weizhiwang
/

UniFilter-Qwen3-0.6B

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# UniFilter-Qwen2.5-1.5B: A Unified Multimodal Data Quality Classifier
+This repository contains the **UniFilter-Qwen2.5-1.5B** model, which is an efficient multimodal large language model (MLLM) designed as a Unified Multimodal Data Quality Classifier. It is presented in the paper [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162).
+UniFilter is capable of generating quality scores for both image-text caption data and interleaved document data. These scores can be used for high-quality data filtering to strengthen the capabilities of pre-trained MLLMs.
+*   **Project Page:** [https://victorwz.github.io/UniFilter](https://victorwz.github.io/UniFilter)
+*   **Code Repository:** [https://github.com/Victorwz/UniFilter](https://github.com/Victorwz/UniFilter)
+## Introduction
+UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
+This repo supports
+ - synthetic data generation
+ - UniFilter training
+ - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
+## Installation
+If you just require the quality score generation, please install the customized LLaVA package only.
+```Shell
+conda create -n unifilter python=3.10
+conda activate unifilter
+pip install -e LLaVA
+pip install flash-attn==2.5.2 --no-build-isolation
+```
+## Quality Score Generation
+The UniFilter model can be used to generate quality scores for both caption data and interleaved document data. Below are examples directly from the GitHub README.
+### Caption Data Quality Scoring
+```Shell
+python data_scoring/data_quality_classifier_caption_scoring.py \
+    --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
+    --tar-file-path data/datacomp/medium_vanilla_filter\
+    --gpu-id 0 \
+    --batch-size 4 \
+    --tars-per-gpu 256 \
+```
+### Interleaved Data Quality Scoring
+```Shell
+python data_scoring/data_quality_classifier_interleaved_scoring.py \
+    --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
+    --tar-file-path data/OBELICS/obelics_webdataset\
+    --gpu-id 0 \
+    --batch-size 1 \
+    --tars-per-gpu 128 \
+```
+Parameters to note:
+- `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
+- `--model-path`: path to the UniFilter model checkpoint
+- `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
+- `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
+## Citation
+Please cite our paper if you find this repository interesting or helpful:
+```bibtex
+@article{UniFilter,
+   title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
+   author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
+   journal={arXiv preprint arXiv:2510.15162},
+   year={2025}
+ }
+```