Add initial model card for UniFilter-Qwen2.5-1.5B

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +76 -0
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # UniFilter-Qwen2.5-1.5B: A Unified Multimodal Data Quality Classifier
8
+
9
+ This repository contains the **UniFilter-Qwen2.5-1.5B** model, which is an efficient multimodal large language model (MLLM) designed as a Unified Multimodal Data Quality Classifier. It is presented in the paper [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162).
10
+
11
+ UniFilter is capable of generating quality scores for both image-text caption data and interleaved document data. These scores can be used for high-quality data filtering to strengthen the capabilities of pre-trained MLLMs.
12
+
13
+ * **Project Page:** [https://victorwz.github.io/UniFilter](https://victorwz.github.io/UniFilter)
14
+ * **Code Repository:** [https://github.com/Victorwz/UniFilter](https://github.com/Victorwz/UniFilter)
15
+
16
+ ## Introduction
17
+ UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
18
+
19
+ This repo supports
20
+ - synthetic data generation
21
+ - UniFilter training
22
+ - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
23
+
24
+ ## Installation
25
+
26
+ If you just require the quality score generation, please install the customized LLaVA package only.
27
+
28
+ ```Shell
29
+ conda create -n unifilter python=3.10
30
+ conda activate unifilter
31
+ pip install -e LLaVA
32
+ pip install flash-attn==2.5.2 --no-build-isolation
33
+ ```
34
+
35
+ ## Quality Score Generation
36
+
37
+ The UniFilter model can be used to generate quality scores for both caption data and interleaved document data. Below are examples directly from the GitHub README.
38
+
39
+ ### Caption Data Quality Scoring
40
+ ```Shell
41
+ python data_scoring/data_quality_classifier_caption_scoring.py \
42
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
43
+ --tar-file-path data/datacomp/medium_vanilla_filter\
44
+ --gpu-id 0 \
45
+ --batch-size 4 \
46
+ --tars-per-gpu 256 \
47
+ ```
48
+
49
+ ### Interleaved Data Quality Scoring
50
+ ```Shell
51
+ python data_scoring/data_quality_classifier_interleaved_scoring.py \
52
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
53
+ --tar-file-path data/OBELICS/obelics_webdataset\
54
+ --gpu-id 0 \
55
+ --batch-size 1 \
56
+ --tars-per-gpu 128 \
57
+ ```
58
+
59
+ Parameters to note:
60
+ - `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
61
+ - `--model-path`: path to the UniFilter model checkpoint
62
+ - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
63
+ - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
64
+
65
+ ## Citation
66
+
67
+ Please cite our paper if you find this repository interesting or helpful:
68
+
69
+ ```bibtex
70
+ @article{UniFilter,
71
+ title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
72
+ author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
73
+ journal={arXiv preprint arXiv:2510.15162},
74
+ year={2025}
75
+ }
76
+ ```