YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Automated Document Text Extraction Using Small Language Model (SLM)

Python PyTorch Transformers FastAPI License

Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.

Quick Start

1. Installation

# Clone the repository
git clone https://github.com/sanjanb/small-language-model.git
cd small-language-model

# Install dependencies
pip install -r requirements.txt

# Install Tesseract OCR (Windows)
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH or set TESSERACT_PATH environment variable

# Install Tesseract OCR (Ubuntu/Debian)
sudo apt install tesseract-ocr

# Install Tesseract OCR (macOS)
brew install tesseract

2. Quick Demo

# Run the interactive demo
python demo.py

# Option 1: Complete demo with training and inference
# Option 2: Train model only
# Option 3: Test specific text

3. Web Interface

# Start the web API server
python api/app.py

# Open your browser to http://localhost:8000
# Upload documents or enter text for extraction

Project Overview

This system combines OCR technology, text preprocessing, and a fine-tuned DistilBERT model to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER).

Key Capabilities

  • Multi-format Support: PDF, DOCX, PNG, JPG, TIFF, BMP
  • Dual OCR Engine: Tesseract + EasyOCR for maximum accuracy
  • Smart Entity Extraction: Names, dates, amounts, addresses, phones, emails
  • Transfer Learning: Fine-tuned DistilBERT for document-specific tasks
  • Web API: RESTful endpoints with interactive interface
  • High Accuracy: Regex validation + ML predictions

System Architecture

graph TD
    A[Document Input] --> B[OCR Processing]
    B --> C[Text Cleaning]
    C --> D[Tokenization]
    D --> E[DistilBERT NER Model]
    E --> F[Entity Extraction]
    F --> G[Post-processing]
    G --> H[Structured JSON Output]

    I[Training Data] --> J[Auto-labeling]
    J --> K[Model Training]
    K --> E

Project Structure

small-language-model/
β”œβ”€β”€ src/                    # Core source code
β”‚   β”œβ”€β”€ data_preparation.py  # OCR & dataset creation
β”‚   β”œβ”€β”€ model.py             # DistilBERT NER model
β”‚   β”œβ”€β”€ training_pipeline.py # Training orchestration
β”‚   └── inference.py         # Document processing
β”œβ”€β”€ api/                    # Web API service
β”‚   └── app.py              # FastAPI application
β”œβ”€β”€ config/                 # Configuration files
β”‚   └── settings.py         # Project settings
β”œβ”€β”€ data/                   # Data directories
β”‚   β”œβ”€β”€ raw/                # Input documents
β”‚   └── processed/          # Processed datasets
β”œβ”€β”€ models/                 # Trained models
β”œβ”€β”€ results/               # Training results
β”‚   β”œβ”€β”€ plots/             # Training visualizations
β”‚   └── metrics/           # Evaluation metrics
β”œβ”€β”€ tests/                 # Unit tests
β”œβ”€β”€ demo.py               # Interactive demo
β”œβ”€β”€ requirements.txt      # Dependencies
└── README.md            # This file

Usage Examples

Python API

from src.inference import DocumentInference

# Load trained model
inference = DocumentInference("models/document_ner_model")

# Process a document
result = inference.process_document("path/to/invoice.pdf")
print(result['structured_data'])
# Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'}

# Process text directly
result = inference.process_text_directly(
    "Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50"
)
print(result['structured_data'])

REST API

# Upload and process a file
curl -X POST "http://localhost:8000/extract-from-file" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@invoice.pdf"

# Process text directly
curl -X POST "http://localhost:8000/extract-from-text" \
     -H "Content-Type: application/json" \
     -d '{"text": "Invoice INV-001 for John Doe $1000"}'

Web Interface

Document Text Extraction Web Interface

  1. Go to http://localhost:8000
  2. Choose "Upload File" or "Enter Text" tab
  3. Upload document or paste text
  4. Click "Extract Information"
  5. View structured results

Configuration

Model Configuration

from src.model import ModelConfig

config = ModelConfig(
    model_name="distilbert-base-uncased",
    max_length=512,
    batch_size=16,
    learning_rate=2e-5,
    num_epochs=3,
    entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...]
)

Environment Variables

# Optional: Custom Tesseract path
export TESSERACT_PATH="/usr/bin/tesseract"

# Optional: CUDA for GPU acceleration
export CUDA_VISIBLE_DEVICES=0

Testing

# Run all tests
python -m pytest tests/

# Run specific test module
python tests/test_extraction.py

# Test with coverage
python -m pytest tests/ --cov=src --cov-report=html

Performance Metrics

Entity Type Precision Recall F1-Score
NAME 0.95 0.92 0.94
DATE 0.98 0.96 0.97
AMOUNT 0.93 0.91 0.92
INVOICE_NO 0.89 0.87 0.88
EMAIL 0.97 0.94 0.95
PHONE 0.91 0.89 0.90

Supported Entity Types

  • NAME: Person names (John Doe, Dr. Smith)
  • DATE: Dates in various formats (01/15/2025, March 15, 2025)
  • AMOUNT: Monetary amounts ($1,500.00, 1000 USD)
  • INVOICE_NO: Invoice numbers (INV-1001, BL-2045)
  • ADDRESS: Street addresses
  • PHONE: Phone numbers (555-123-4567, +1-555-123-4567)
  • EMAIL: Email addresses (user@domain.com)

Training Your Own Model

1. Prepare Your Data

# Place your documents in data/raw/
mkdir -p data/raw
cp your_invoices/*.pdf data/raw/

2. Run Training Pipeline

from src.training_pipeline import TrainingPipeline, create_custom_config

# Create custom configuration
config = create_custom_config()
config.num_epochs = 5
config.batch_size = 16

# Run training
pipeline = TrainingPipeline(config)
model_path = pipeline.run_complete_pipeline("data/raw")

3. Evaluate Results

Training automatically generates:

  • Loss curves: results/plots/training_history.png
  • Metrics: results/metrics/evaluation_results.json
  • Model checkpoints: models/document_ner_model/

Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install Tesseract
RUN apt-get update && apt-get install -y tesseract-ocr

COPY . .
EXPOSE 8000

CMD ["python", "api/app.py"]

Cloud Deployment

  • AWS: Deploy using ECS or Lambda
  • Google Cloud: Use Cloud Run or Compute Engine
  • Azure: Deploy with Container Instances

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Support


Star this repository if it helped you!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support