1 Introduction

DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task. Without relying on external tools such as OCR, it processes the inputs in an end-to-end way. It takes an image document and a question as input and returns an answer.

Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025):
DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA
Repository: GitHub

2 Model Summary

DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs. Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder, DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance. It is built by distilling the SigLIP-400m visual encoder of PaliGEMMA into a small hierarchical Swin transformer initialized with the weights of Donut, while reusing the original GEMMA decoder. This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%. Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using merge_and_unload. Trained on the DocVQA dataset for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.

3 Quick Start

Direct Use

From the Transformers library

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-ARD-HRes", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-ARD-HRes", trust_remote_code=True)

image = Image.open("your_image_document_path/image_document.png").convert("RGB")
question_example = "What the the name of the author"

inputs = (
            processor(text=question_example, images=image, return_tensors="pt", padding=True)
            .to(model.device)
            .to(model.dtype)
        )
input_length = inputs["input_ids"].shape[-1]

with torch.inference_mode():
        output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)

generated_ids = output_ids[0][input_length:]
answer = processor.decode(generated_ids, skip_special_tokens=True)

print(answer)

From the GitHub repository

Installation

git clone https://github.com/JayRay5/DIVE-Doc.git
cd DIVE-Doc
conda create -n dive-doc-env python=3.11.5
conda activate dive-doc-env
pip install -r requirements.txt

Inference example using the model repository and gradio

In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-HRes":

if __name__ == "__main__":
    path = "JayRay5/DIVE-Doc-ARD-HRes"
    app(path)

Then run:

python app.py

This will start a gradio web interface where you can use the model.

Notification

Direct Use

This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents DocVQA dataset.

Downstream Use

This model can be finetuned on other DocVQA datasets such as InfoGraphVQA to improve its performance on infographic documents.

Citation

BibTeX:

@inproceedings{Bencharef_2025_ICCV,
    author    = {Bencharef, Rayane and Rahiche, Abderrahmane and Cheriet, Mohamed},
    title     = {DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {7547-7556}
}