Corex / README_OCR.md
yadavkapil23's picture
updated req
fcfe360

DeepSeek OCR Integration

This document explains how to use the DeepSeek OCR integration in your RAG system.

Features

  • Text Extraction: Extract text from images using DeepSeek OCR
  • Grounding: Locate specific text within images
  • Markdown Conversion: Convert document images to markdown format
  • RAG Integration: Query the RAG system with OCR-extracted text
  • Multi-language Support: Supports over 50 languages

API Endpoints

1. Extract Text from Image

POST /ocr/extract-text/
  • Input: Image file (multipart/form-data)
  • Optional: Custom prompt
  • Output: Extracted text

2. Extract Text with Grounding

POST /ocr/extract-with-grounding/
  • Input: Image file + target text (optional)
  • Output: Text with location information

3. Convert to Markdown

POST /ocr/convert-to-markdown/
  • Input: Document image
  • Output: Markdown formatted text

4. Query with OCR Text

POST /ocr/query/
  • Input: Query + conversation history + extracted text
  • Output: RAG response enhanced with OCR text

Frontend Usage

  1. Upload Image: Click the "+" button in the input area
  2. Select Image: Choose an image file from your device
  3. OCR Processing: The system will automatically extract text
  4. Options:
    • Use Extracted Text: Copy the text to the input field
    • Query with OCR: Ask questions about the image content
    • Cancel: Close the OCR modal

Configuration

Create a .env file with the following variables:

# DeepSeek OCR Configuration
DEEPSEEK_OCR_MODEL=deepseek-ai/DeepSeek-OCR
DEEPSEEK_OCR_DEVICE=auto  # auto, cpu, cuda
DEEPSEEK_OCR_MAX_TOKENS=512
DEEPSEEK_OCR_TEMPERATURE=0.1

# Optional: Custom model path for local models
# DEEPSEEK_OCR_MODEL_PATH=/path/to/local/model

# Optional: Hugging Face token for private models
# HF_TOKEN=your_huggingface_token_here

Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (optional):
cp .env.example .env
# Edit .env with your configuration
  1. Run the application:
uvicorn main:app --reload

Model Requirements

For CPU (Laptop) Setup:

  • RAM: At least 8GB (16GB recommended)
  • Storage: ~2GB for model download
  • CPU: Modern multi-core processor (Intel i5/AMD Ryzen 5 or better)
  • Performance: Expect 10-30 seconds per image on CPU

For GPU Setup:

  • GPU: CUDA compatible (NVIDIA)
  • VRAM: At least 4GB
  • RAM: 16GB+ recommended
  • Performance: Expect 2-5 seconds per image on GPU

Performance Tips

For CPU (Laptop) Users:

  1. CPU Optimization: Already configured for CPU usage
  2. Image Size: Use images max 1024x1024 pixels for faster processing
  3. Memory Management: Close other applications to free up RAM
  4. Model Caching: The model is cached after first load
  5. Processing Time: Expect 10-30 seconds per image on CPU

For GPU Users:

  1. GPU Usage: Set DEEPSEEK_OCR_DEVICE=cuda for GPU acceleration
  2. Batch Processing: Process multiple images efficiently
  3. Memory Management: Monitor GPU memory usage for large images

Error Handling

The system includes comprehensive error handling:

  • File type validation
  • Model loading errors
  • OCR processing failures
  • Network connectivity issues

Examples

Basic Text Extraction

import requests

# Upload image and extract text
with open('image.jpg', 'rb') as f:
    response = requests.post(
        'http://localhost:8000/ocr/extract-text/',
        files={'file': f}
    )
    
result = response.json()
print(result['extracted_text'])

Query with OCR

# Query about extracted text
response = requests.post(
    'http://localhost:8000/ocr/query/',
    json={
        'query': 'What is the main topic?',
        'conversation_history': [],
        'extracted_text': 'Your extracted text here...'
    }
)

Troubleshooting

Common Issues

  1. Model Loading Error: Ensure you have sufficient RAM/VRAM
  2. CUDA Error: Check GPU compatibility and drivers
  3. Memory Error: Reduce image size or use CPU mode
  4. Network Error: Check internet connection for model download

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Support

For issues and questions:

  1. Check the logs for error messages
  2. Verify your environment configuration
  3. Test with smaller images first
  4. Check GPU memory usage

License

This integration uses DeepSeek OCR which is licensed under Apache 2.0.