YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Automated Document Text Extraction Using Small Language Model (SLM)
Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.
Quick Start
1. Installation
# Clone the repository
git clone https://github.com/sanjanb/small-language-model.git
cd small-language-model
# Install dependencies
pip install -r requirements.txt
# Install Tesseract OCR (Windows)
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH or set TESSERACT_PATH environment variable
# Install Tesseract OCR (Ubuntu/Debian)
sudo apt install tesseract-ocr
# Install Tesseract OCR (macOS)
brew install tesseract
2. Quick Demo
# Run the interactive demo
python demo.py
# Option 1: Complete demo with training and inference
# Option 2: Train model only
# Option 3: Test specific text
3. Web Interface
# Start the web API server
python api/app.py
# Open your browser to http://localhost:8000
# Upload documents or enter text for extraction
Project Overview
This system combines OCR technology, text preprocessing, and a fine-tuned DistilBERT model to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER).
Key Capabilities
- Multi-format Support: PDF, DOCX, PNG, JPG, TIFF, BMP
- Dual OCR Engine: Tesseract + EasyOCR for maximum accuracy
- Smart Entity Extraction: Names, dates, amounts, addresses, phones, emails
- Transfer Learning: Fine-tuned DistilBERT for document-specific tasks
- Web API: RESTful endpoints with interactive interface
- High Accuracy: Regex validation + ML predictions
System Architecture
graph TD
A[Document Input] --> B[OCR Processing]
B --> C[Text Cleaning]
C --> D[Tokenization]
D --> E[DistilBERT NER Model]
E --> F[Entity Extraction]
F --> G[Post-processing]
G --> H[Structured JSON Output]
I[Training Data] --> J[Auto-labeling]
J --> K[Model Training]
K --> E
Project Structure
small-language-model/
βββ src/ # Core source code
β βββ data_preparation.py # OCR & dataset creation
β βββ model.py # DistilBERT NER model
β βββ training_pipeline.py # Training orchestration
β βββ inference.py # Document processing
βββ api/ # Web API service
β βββ app.py # FastAPI application
βββ config/ # Configuration files
β βββ settings.py # Project settings
βββ data/ # Data directories
β βββ raw/ # Input documents
β βββ processed/ # Processed datasets
βββ models/ # Trained models
βββ results/ # Training results
β βββ plots/ # Training visualizations
β βββ metrics/ # Evaluation metrics
βββ tests/ # Unit tests
βββ demo.py # Interactive demo
βββ requirements.txt # Dependencies
βββ README.md # This file
Usage Examples
Python API
from src.inference import DocumentInference
# Load trained model
inference = DocumentInference("models/document_ner_model")
# Process a document
result = inference.process_document("path/to/invoice.pdf")
print(result['structured_data'])
# Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'}
# Process text directly
result = inference.process_text_directly(
"Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50"
)
print(result['structured_data'])
REST API
# Upload and process a file
curl -X POST "http://localhost:8000/extract-from-file" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.pdf"
# Process text directly
curl -X POST "http://localhost:8000/extract-from-text" \
-H "Content-Type: application/json" \
-d '{"text": "Invoice INV-001 for John Doe $1000"}'
Web Interface
- Go to
http://localhost:8000 - Choose "Upload File" or "Enter Text" tab
- Upload document or paste text
- Click "Extract Information"
- View structured results
Configuration
Model Configuration
from src.model import ModelConfig
config = ModelConfig(
model_name="distilbert-base-uncased",
max_length=512,
batch_size=16,
learning_rate=2e-5,
num_epochs=3,
entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...]
)
Environment Variables
# Optional: Custom Tesseract path
export TESSERACT_PATH="/usr/bin/tesseract"
# Optional: CUDA for GPU acceleration
export CUDA_VISIBLE_DEVICES=0
Testing
# Run all tests
python -m pytest tests/
# Run specific test module
python tests/test_extraction.py
# Test with coverage
python -m pytest tests/ --cov=src --cov-report=html
Performance Metrics
| Entity Type | Precision | Recall | F1-Score |
|---|---|---|---|
| NAME | 0.95 | 0.92 | 0.94 |
| DATE | 0.98 | 0.96 | 0.97 |
| AMOUNT | 0.93 | 0.91 | 0.92 |
| INVOICE_NO | 0.89 | 0.87 | 0.88 |
| 0.97 | 0.94 | 0.95 | |
| PHONE | 0.91 | 0.89 | 0.90 |
Supported Entity Types
- NAME: Person names (John Doe, Dr. Smith)
- DATE: Dates in various formats (01/15/2025, March 15, 2025)
- AMOUNT: Monetary amounts ($1,500.00, 1000 USD)
- INVOICE_NO: Invoice numbers (INV-1001, BL-2045)
- ADDRESS: Street addresses
- PHONE: Phone numbers (555-123-4567, +1-555-123-4567)
- EMAIL: Email addresses (user@domain.com)
Training Your Own Model
1. Prepare Your Data
# Place your documents in data/raw/
mkdir -p data/raw
cp your_invoices/*.pdf data/raw/
2. Run Training Pipeline
from src.training_pipeline import TrainingPipeline, create_custom_config
# Create custom configuration
config = create_custom_config()
config.num_epochs = 5
config.batch_size = 16
# Run training
pipeline = TrainingPipeline(config)
model_path = pipeline.run_complete_pipeline("data/raw")
3. Evaluate Results
Training automatically generates:
- Loss curves:
results/plots/training_history.png - Metrics:
results/metrics/evaluation_results.json - Model checkpoints:
models/document_ner_model/
Deployment
Docker Deployment
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install Tesseract
RUN apt-get update && apt-get install -y tesseract-ocr
COPY . .
EXPOSE 8000
CMD ["python", "api/app.py"]
Cloud Deployment
- AWS: Deploy using ECS or Lambda
- Google Cloud: Use Cloud Run or Compute Engine
- Azure: Deploy with Container Instances
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Hugging Face Transformers for the DistilBERT model
- Tesseract OCR for optical character recognition
- EasyOCR for additional OCR capabilities
- FastAPI for the web framework
Support
- Email: your-email@domain.com
- Issues: GitHub Issues
- Documentation: Project Wiki
Star this repository if it helped you!
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
