Transformers
Safetensors
DIVEdoc
File size: 3,893 Bytes
7a20658
 
75c3afc
 
 
7a20658
 
75c3afc
 
 
 
 
24e851e
7a20658
 
75c3afc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f20d9bb
75c3afc
 
f20d9bb
75c3afc
 
 
 
 
 
 
 
7a20658
 
75c3afc
7a20658
75c3afc
7a20658
 
75c3afc
7a20658
75c3afc
7a20658
 
 
943bc58
7a20658
 
 
24e851e
 
 
 
 
 
 
 
 
943bc58
24e851e
7a20658
24e851e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
library_name: transformers
license: mit
datasets:
- lmms-lab/DocVQA
---

## 1 Introduction
DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task. 
Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
It takes an image document and a question as input and returns an answer. <br>
- **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)
- **Paper:** [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)


## 2 Model Summary
DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder, 
DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder. 
This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%.
Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload).
Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.


## 3 Quick Start

### Installation
```bash
git clone https://github.com/JayRay5/DIVE-Doc.git
cd DIVE-Doc
conda create -n dive-doc-env python=3.11.5
conda activate dive-doc-env
pip install -r requirements.txt
```
### Inference example using the model repository and gradio
In app.py, modify the path variable to "JayRay5/DIVE-Doc-FRD":
```bash
if __name__ == "__main__":
    path = "JayRay5/DIVE-Doc-FRD"
    app(path) 
```
Then run:
```bash
python app.py
```
This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
## Notification


### Direct Use

This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html). 


### Downstream Use 

This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.



## Citation

**BibTeX:**

```bibtex
@inproceedings{Bencharef_2025_ICCV,
    author    = {Bencharef, Rayane and Rahiche, Abderrahmane and Cheriet, Mohamed},
    title     = {DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {7547-7556}
}
```
## Contact

rayane.bencharef.1@ens.etsmtl.ca