---
title: DeepSeek-OCR
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - ocr
  - vision-language-model
  - document-processing
  - vllm
  - deepseek
license: mit
---

# DeepSeek-OCR with vLLM

High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing.

## 🚀 Quick Start with HuggingFace Jobs

Process any image dataset without needing your own GPU:

```bash
# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    input-dataset \
    output-dataset

# Quick test with 10 samples
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    your-input-dataset \
    your-output-dataset \
    --max-samples 10
```

That's it! The script will:
- ✅ Process images from your dataset
- ✅ Add OCR results as a new `markdown` column
- ✅ Push results to a new dataset with automatic documentation
- 📊 View results at: `https://huggingface.co/datasets/[your-output-dataset]`

## 📋 Features

### Model Capabilities

- 📐 **LaTeX equations** - Mathematical formulas preserved in LaTeX format
- 📊 **Tables** - Extracted and formatted as HTML/markdown
- 📝 **Document structure** - Headers, lists, and formatting maintained
- 🖼️ **Image grounding** - Spatial layout and bounding box information
- 🔍 **Complex layouts** - Multi-column and hierarchical structures
- 🌍 **Multilingual** - Supports multiple languages

### Performance

- ⚡ **vLLM AsyncEngine** - Optimized for throughput (~2500 tokens/s on A100)
- 🎯 **Multiple resolution modes** - Choose speed vs quality
- 🔥 **Large context** - Up to 8K tokens
- 💪 **Batch optimized** - Efficient async processing

## 🎛️ Resolution Modes

| Mode | Resolution | Vision Tokens | Best For |
|------|-----------|---------------|----------|
| `tiny` | 512×512 | 64 | Fast testing, simple documents |
| `small` | 640×640 | 100 | Balanced speed/quality |
| `base` | 1024×1024 | 256 | High quality documents |
| `large` | 1280×1280 | 400 | Maximum quality, detailed docs |
| `gundam` | Dynamic | Adaptive | Large documents, best overall |

## 💻 Usage Examples

### Basic Processing

```bash
# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    my-images-dataset \
    ocr-results
```

### Fast Processing for Testing

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    large-dataset \
    test-output \
    --max-samples 100
```

### Random Sampling

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    ordered-dataset \
    random-sample \
    --max-samples 50 \
    --shuffle \
    --seed 42
```

### Custom Image Column

```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    davanstrien/ufo-ColPali \
    ufo-ocr \
    --image-column image
```

### Private Output Dataset

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    private-input \
    private-output \
    --private
```

## 📝 Command-Line Options

### Required Arguments

| Argument | Description |
|----------|-------------|
| `input_dataset` | Input dataset ID from Hugging Face Hub |
| `output_dataset` | Output dataset ID for Hugging Face Hub |

### Optional Arguments

| Option | Default | Description |
|--------|---------|-------------|
| `--image-column` | `image` | Column containing images |
| `--model` | `deepseek-ai/DeepSeek-OCR` | Model to use |
| `--resolution-mode` | `gundam` | Resolution preset (tiny/small/base/large/gundam) |
| `--max-model-len` | `8192` | Maximum model context length |
| `--max-tokens` | `8192` | Maximum tokens to generate |
| `--gpu-memory-utilization` | `0.75` | GPU memory usage (0.0-1.0) |
| `--prompt` | `<image>\n<\|grounding\|>Convert...` | Custom prompt |
| `--hf-token` | - | Hugging Face API token (or use env var) |
| `--split` | `train` | Dataset split to process |
| `--max-samples` | None | Limit samples (for testing) |
| `--private` | False | Make output dataset private |
| `--shuffle` | False | Shuffle dataset before processing |
| `--seed` | `42` | Random seed for shuffling |

## 📊 Output Format

The script adds two new columns to your dataset:

1. **`markdown`** - The OCR text in markdown format
2. **`inference_info`** - JSON metadata about the processing

### Inference Info Structure

```json
[
  {
    "column_name": "markdown",
    "model_id": "deepseek-ai/DeepSeek-OCR",
    "processing_date": "2025-10-21T12:00:00",
    "resolution_mode": "gundam",
    "base_size": 1024,
    "image_size": 640,
    "crop_mode": true,
    "prompt": "<image>\n<|grounding|>Convert the document to markdown.",
    "max_tokens": 8192,
    "gpu_memory_utilization": 0.75,
    "max_model_len": 8192,
    "script": "main.py",
    "script_version": "1.0.0",
    "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
    "implementation": "vllm-async (optimized)"
  }
]
```

## 🔧 Technical Details

### Architecture

- **Model**: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
- **Inference Engine**: vLLM 0.8.5 with AsyncEngine
- **Image Preprocessing**: Custom dynamic tiling based on aspect ratio
- **Vision Encoders**: Custom CLIP + SAM encoders
- **Context Length**: Up to 8K tokens
- **Optimization**: Flash Attention 2.7.3, async batch processing

### Hardware Requirements

- **Minimum**: L4 GPU (24GB VRAM) - `--flavor l4x1`
- **Recommended**: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large`
- **Maximum Performance**: A100 (40GB+ VRAM) - `--flavor a100-large`

### Speed Benchmarks

| GPU | Resolution | Speed | Notes |
|-----|-----------|-------|-------|
| L4 | Tiny | ~5-8 img/s | Good for testing |
| L4 | Gundam | ~2-3 img/s | Balanced |
| A100 | Gundam | ~8-12 img/s | Production speed |
| A100 | Large | ~5-7 img/s | Maximum quality |

## 📚 Example Workflows

### 1. Process Historical Documents

```bash
hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    historical-scans \
    historical-text \
    --resolution-mode large \
    --shuffle
```

### 2. Extract Tables from Reports

```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    financial-reports \
    extracted-tables \
    --resolution-mode gundam \
    --prompt "<image>\n<|grounding|>Convert the document to markdown."
```

### 3. Multi-language Documents

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    multilingual-docs \
    ocr-output \
    --resolution-mode base
```

## 🔗 Related Resources

- **Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)
- **HF Jobs**: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs)

## 📄 License

MIT License - See model card for details

## 🙏 Acknowledgments

- DeepSeek AI for the OCR model
- vLLM team for the inference engine
- Hugging Face for Jobs infrastructure

---

Built with ❤️ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)