nielsr HF Staff commited on
Commit
5cb3ef3
·
verified ·
1 Parent(s): 5bf951b

Add initial model card for UniFilter-Qwen2.5-1.5B

Browse files

This PR adds the initial model card for the UniFilter-Qwen2.5-1.5B model, which is a Unified Multimodal Data Quality Classifier.

It includes:
- Linking to the paper: [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162)
- Adding the `license: cc-by-nc-4.0`
- Adding the `library_name: transformers` to enable the "how to use" widget and automated code snippets.
- Adding the `pipeline_tag: image-text-to-text` for better discoverability on the Hub.
- Including links to the project page and GitHub repository for further details.
- Providing sample usage code snippets for quality score generation directly from the GitHub README, ensuring they are accurate and functional.

Please review and merge this PR if everything looks good.

Files changed (1) hide show
  1. README.md +76 -0
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # UniFilter-Qwen2.5-1.5B: A Unified Multimodal Data Quality Classifier
8
+
9
+ This repository contains the **UniFilter-Qwen2.5-1.5B** model, which is an efficient multimodal large language model (MLLM) designed as a Unified Multimodal Data Quality Classifier. It is presented in the paper [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162).
10
+
11
+ UniFilter is capable of generating quality scores for both image-text caption data and interleaved document data. These scores can be used for high-quality data filtering to strengthen the capabilities of pre-trained MLLMs.
12
+
13
+ * **Project Page:** [https://victorwz.github.io/UniFilter](https://victorwz.github.io/UniFilter)
14
+ * **Code Repository:** [https://github.com/Victorwz/UniFilter](https://github.com/Victorwz/UniFilter)
15
+
16
+ ## Introduction
17
+ UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
18
+
19
+ This repo supports
20
+ - synthetic data generation
21
+ - UniFilter training
22
+ - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
23
+
24
+ ## Installation
25
+
26
+ If you just require the quality score generation, please install the customized LLaVA package only.
27
+
28
+ ```Shell
29
+ conda create -n unifilter python=3.10
30
+ conda activate unifilter
31
+ pip install -e LLaVA
32
+ pip install flash-attn==2.5.2 --no-build-isolation
33
+ ```
34
+
35
+ ## Quality Score Generation
36
+
37
+ The UniFilter model can be used to generate quality scores for both caption data and interleaved document data. Below are examples directly from the GitHub README.
38
+
39
+ ### Caption Data Quality Scoring
40
+ ```Shell
41
+ python data_scoring/data_quality_classifier_caption_scoring.py \
42
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
43
+ --tar-file-path data/datacomp/medium_vanilla_filter\
44
+ --gpu-id 0 \
45
+ --batch-size 4 \
46
+ --tars-per-gpu 256 \
47
+ ```
48
+
49
+ ### Interleaved Data Quality Scoring
50
+ ```Shell
51
+ python data_scoring/data_quality_classifier_interleaved_scoring.py \
52
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
53
+ --tar-file-path data/OBELICS/obelics_webdataset\
54
+ --gpu-id 0 \
55
+ --batch-size 1 \
56
+ --tars-per-gpu 128 \
57
+ ```
58
+
59
+ Parameters to note:
60
+ - `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
61
+ - `--model-path`: path to the UniFilter model checkpoint
62
+ - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
63
+ - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
64
+
65
+ ## Citation
66
+
67
+ Please cite our paper if you find this repository interesting or helpful:
68
+
69
+ ```bibtex
70
+ @article{UniFilter,
71
+ title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
72
+ author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
73
+ journal={arXiv preprint arXiv:2510.15162},
74
+ year={2025}
75
+ }
76
+ ```