Spaces:
Running
Running
DeepSeek OCR Integration
This document explains how to use the DeepSeek OCR integration in your RAG system.
Features
- Text Extraction: Extract text from images using DeepSeek OCR
- Grounding: Locate specific text within images
- Markdown Conversion: Convert document images to markdown format
- RAG Integration: Query the RAG system with OCR-extracted text
- Multi-language Support: Supports over 50 languages
API Endpoints
1. Extract Text from Image
POST /ocr/extract-text/
- Input: Image file (multipart/form-data)
- Optional: Custom prompt
- Output: Extracted text
2. Extract Text with Grounding
POST /ocr/extract-with-grounding/
- Input: Image file + target text (optional)
- Output: Text with location information
3. Convert to Markdown
POST /ocr/convert-to-markdown/
- Input: Document image
- Output: Markdown formatted text
4. Query with OCR Text
POST /ocr/query/
- Input: Query + conversation history + extracted text
- Output: RAG response enhanced with OCR text
Frontend Usage
- Upload Image: Click the "+" button in the input area
- Select Image: Choose an image file from your device
- OCR Processing: The system will automatically extract text
- Options:
- Use Extracted Text: Copy the text to the input field
- Query with OCR: Ask questions about the image content
- Cancel: Close the OCR modal
Configuration
Create a .env file with the following variables:
# DeepSeek OCR Configuration
DEEPSEEK_OCR_MODEL=deepseek-ai/DeepSeek-OCR
DEEPSEEK_OCR_DEVICE=auto # auto, cpu, cuda
DEEPSEEK_OCR_MAX_TOKENS=512
DEEPSEEK_OCR_TEMPERATURE=0.1
# Optional: Custom model path for local models
# DEEPSEEK_OCR_MODEL_PATH=/path/to/local/model
# Optional: Hugging Face token for private models
# HF_TOKEN=your_huggingface_token_here
Installation
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (optional):
cp .env.example .env
# Edit .env with your configuration
- Run the application:
uvicorn main:app --reload
Model Requirements
For CPU (Laptop) Setup:
- RAM: At least 8GB (16GB recommended)
- Storage: ~2GB for model download
- CPU: Modern multi-core processor (Intel i5/AMD Ryzen 5 or better)
- Performance: Expect 10-30 seconds per image on CPU
For GPU Setup:
- GPU: CUDA compatible (NVIDIA)
- VRAM: At least 4GB
- RAM: 16GB+ recommended
- Performance: Expect 2-5 seconds per image on GPU
Performance Tips
For CPU (Laptop) Users:
- CPU Optimization: Already configured for CPU usage
- Image Size: Use images max 1024x1024 pixels for faster processing
- Memory Management: Close other applications to free up RAM
- Model Caching: The model is cached after first load
- Processing Time: Expect 10-30 seconds per image on CPU
For GPU Users:
- GPU Usage: Set
DEEPSEEK_OCR_DEVICE=cudafor GPU acceleration - Batch Processing: Process multiple images efficiently
- Memory Management: Monitor GPU memory usage for large images
Error Handling
The system includes comprehensive error handling:
- File type validation
- Model loading errors
- OCR processing failures
- Network connectivity issues
Examples
Basic Text Extraction
import requests
# Upload image and extract text
with open('image.jpg', 'rb') as f:
response = requests.post(
'http://localhost:8000/ocr/extract-text/',
files={'file': f}
)
result = response.json()
print(result['extracted_text'])
Query with OCR
# Query about extracted text
response = requests.post(
'http://localhost:8000/ocr/query/',
json={
'query': 'What is the main topic?',
'conversation_history': [],
'extracted_text': 'Your extracted text here...'
}
)
Troubleshooting
Common Issues
- Model Loading Error: Ensure you have sufficient RAM/VRAM
- CUDA Error: Check GPU compatibility and drivers
- Memory Error: Reduce image size or use CPU mode
- Network Error: Check internet connection for model download
Debug Mode
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Support
For issues and questions:
- Check the logs for error messages
- Verify your environment configuration
- Test with smaller images first
- Check GPU memory usage
License
This integration uses DeepSeek OCR which is licensed under Apache 2.0.