--- title: DeepSeek-OCR emoji: 📄 colorFrom: blue colorTo: purple sdk: docker pinned: false tags: - ocr - vision-language-model - document-processing - vllm - deepseek license: mit --- # DeepSeek-OCR with vLLM High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing. ## 🚀 Quick Start with HuggingFace Jobs Process any image dataset without needing your own GPU: ```bash # Basic usage (Gundam mode - adaptive resolution) hf jobs run --flavor l4x1 \ --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ input-dataset \ output-dataset # Quick test with 10 samples hf jobs run --flavor l4x1 \ --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ your-input-dataset \ your-output-dataset \ --max-samples 10 ``` That's it! The script will: - ✅ Process images from your dataset - ✅ Add OCR results as a new `markdown` column - ✅ Push results to a new dataset with automatic documentation - 📊 View results at: `https://huggingface.co/datasets/[your-output-dataset]` ## 📋 Features ### Model Capabilities - 📐 **LaTeX equations** - Mathematical formulas preserved in LaTeX format - 📊 **Tables** - Extracted and formatted as HTML/markdown - 📝 **Document structure** - Headers, lists, and formatting maintained - 🖼️ **Image grounding** - Spatial layout and bounding box information - 🔍 **Complex layouts** - Multi-column and hierarchical structures - 🌍 **Multilingual** - Supports multiple languages ### Performance - ⚡ **vLLM AsyncEngine** - Optimized for throughput (~2500 tokens/s on A100) - 🎯 **Multiple resolution modes** - Choose speed vs quality - 🔥 **Large context** - Up to 8K tokens - 💪 **Batch optimized** - Efficient async processing ## 🎛️ Resolution Modes | Mode | Resolution | Vision Tokens | Best For | |------|-----------|---------------|----------| | `tiny` | 512×512 | 64 | Fast testing, simple documents | | `small` | 640×640 | 100 | Balanced speed/quality | | `base` | 1024×1024 | 256 | High quality documents | | `large` | 1280×1280 | 400 | Maximum quality, detailed docs | | `gundam` | Dynamic | Adaptive | Large documents, best overall | ## 💻 Usage Examples ### Basic Processing ```bash # Default (Gundam mode) hf jobs run --flavor l4x1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ my-images-dataset \ ocr-results ``` ### Fast Processing for Testing ```bash hf jobs run --flavor l4x1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ large-dataset \ test-output \ --max-samples 100 ``` ### Random Sampling ```bash hf jobs run --flavor l4x1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ ordered-dataset \ random-sample \ --max-samples 50 \ --shuffle \ --seed 42 ``` ### Custom Image Column ```bash hf jobs run --flavor a10g-large --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ davanstrien/ufo-ColPali \ ufo-ocr \ --image-column image ``` ### Private Output Dataset ```bash hf jobs run --flavor l4x1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python process_dataset.py \ private-input \ private-output \ --private ``` ## 📝 Command-Line Options ### Required Arguments | Argument | Description | |----------|-------------| | `input_dataset` | Input dataset ID from Hugging Face Hub | | `output_dataset` | Output dataset ID for Hugging Face Hub | ### Optional Arguments | Option | Default | Description | |--------|---------|-------------| | `--image-column` | `image` | Column containing images | | `--model` | `deepseek-ai/DeepSeek-OCR` | Model to use | | `--resolution-mode` | `gundam` | Resolution preset (tiny/small/base/large/gundam) | | `--max-model-len` | `8192` | Maximum model context length | | `--max-tokens` | `8192` | Maximum tokens to generate | | `--gpu-memory-utilization` | `0.75` | GPU memory usage (0.0-1.0) | | `--prompt` | `\n<\|grounding\|>Convert...` | Custom prompt | | `--hf-token` | - | Hugging Face API token (or use env var) | | `--split` | `train` | Dataset split to process | | `--max-samples` | None | Limit samples (for testing) | | `--private` | False | Make output dataset private | | `--shuffle` | False | Shuffle dataset before processing | | `--seed` | `42` | Random seed for shuffling | ## 📊 Output Format The script adds two new columns to your dataset: 1. **`markdown`** - The OCR text in markdown format 2. **`inference_info`** - JSON metadata about the processing ### Inference Info Structure ```json [ { "column_name": "markdown", "model_id": "deepseek-ai/DeepSeek-OCR", "processing_date": "2025-10-21T12:00:00", "resolution_mode": "gundam", "base_size": 1024, "image_size": 640, "crop_mode": true, "prompt": "\n<|grounding|>Convert the document to markdown.", "max_tokens": 8192, "gpu_memory_utilization": 0.75, "max_model_len": 8192, "script": "main.py", "script_version": "1.0.0", "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr", "implementation": "vllm-async (optimized)" } ] ``` ## 🔧 Technical Details ### Architecture - **Model**: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL) - **Inference Engine**: vLLM 0.8.5 with AsyncEngine - **Image Preprocessing**: Custom dynamic tiling based on aspect ratio - **Vision Encoders**: Custom CLIP + SAM encoders - **Context Length**: Up to 8K tokens - **Optimization**: Flash Attention 2.7.3, async batch processing ### Hardware Requirements - **Minimum**: L4 GPU (24GB VRAM) - `--flavor l4x1` - **Recommended**: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large` - **Maximum Performance**: A100 (40GB+ VRAM) - `--flavor a100-large` ### Speed Benchmarks | GPU | Resolution | Speed | Notes | |-----|-----------|-------|-------| | L4 | Tiny | ~5-8 img/s | Good for testing | | L4 | Gundam | ~2-3 img/s | Balanced | | A100 | Gundam | ~8-12 img/s | Production speed | | A100 | Large | ~5-7 img/s | Maximum quality | ## 📚 Example Workflows ### 1. Process Historical Documents ```bash hf jobs run --flavor l40sx1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python main.py \ historical-scans \ historical-text \ --resolution-mode large \ --shuffle ``` ### 2. Extract Tables from Reports ```bash hf jobs run --flavor a10g-large --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python main.py \ financial-reports \ extracted-tables \ --resolution-mode gundam \ --prompt "\n<|grounding|>Convert the document to markdown." ``` ### 3. Multi-language Documents ```bash hf jobs run --flavor l4x1 --secrets HF_TOKEN \ hf.co/spaces/davanstrien/deepseek-ocr \ python main.py \ multilingual-docs \ ocr-output \ --resolution-mode base ``` ## 🔗 Related Resources - **Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) - **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm) - **HF Jobs**: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) ## 📄 License MIT License - See model card for details ## 🙏 Acknowledgments - DeepSeek AI for the OCR model - vLLM team for the inference engine - Hugging Face for Jobs infrastructure --- Built with ❤️ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)