Add initial model card for UniFilter-Qwen2.5-1.5B
Browse filesThis PR adds the initial model card for the UniFilter-Qwen2.5-1.5B model, which is a Unified Multimodal Data Quality Classifier.
It includes:
- Linking to the paper: [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162)
- Adding the `license: cc-by-nc-4.0`
- Adding the `library_name: transformers` to enable the "how to use" widget and automated code snippets.
- Adding the `pipeline_tag: image-text-to-text` for better discoverability on the Hub.
- Including links to the project page and GitHub repository for further details.
- Providing sample usage code snippets for quality score generation directly from the GitHub README, ensuring they are accurate and functional.
Please review and merge this PR if everything looks good.
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# UniFilter-Qwen2.5-1.5B: A Unified Multimodal Data Quality Classifier
|
| 8 |
+
|
| 9 |
+
This repository contains the **UniFilter-Qwen2.5-1.5B** model, which is an efficient multimodal large language model (MLLM) designed as a Unified Multimodal Data Quality Classifier. It is presented in the paper [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162).
|
| 10 |
+
|
| 11 |
+
UniFilter is capable of generating quality scores for both image-text caption data and interleaved document data. These scores can be used for high-quality data filtering to strengthen the capabilities of pre-trained MLLMs.
|
| 12 |
+
|
| 13 |
+
* **Project Page:** [https://victorwz.github.io/UniFilter](https://victorwz.github.io/UniFilter)
|
| 14 |
+
* **Code Repository:** [https://github.com/Victorwz/UniFilter](https://github.com/Victorwz/UniFilter)
|
| 15 |
+
|
| 16 |
+
## Introduction
|
| 17 |
+
UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
|
| 18 |
+
|
| 19 |
+
This repo supports
|
| 20 |
+
- synthetic data generation
|
| 21 |
+
- UniFilter training
|
| 22 |
+
- quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
|
| 23 |
+
|
| 24 |
+
## Installation
|
| 25 |
+
|
| 26 |
+
If you just require the quality score generation, please install the customized LLaVA package only.
|
| 27 |
+
|
| 28 |
+
```Shell
|
| 29 |
+
conda create -n unifilter python=3.10
|
| 30 |
+
conda activate unifilter
|
| 31 |
+
pip install -e LLaVA
|
| 32 |
+
pip install flash-attn==2.5.2 --no-build-isolation
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
## Quality Score Generation
|
| 36 |
+
|
| 37 |
+
The UniFilter model can be used to generate quality scores for both caption data and interleaved document data. Below are examples directly from the GitHub README.
|
| 38 |
+
|
| 39 |
+
### Caption Data Quality Scoring
|
| 40 |
+
```Shell
|
| 41 |
+
python data_scoring/data_quality_classifier_caption_scoring.py \
|
| 42 |
+
--model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
|
| 43 |
+
--tar-file-path data/datacomp/medium_vanilla_filter\
|
| 44 |
+
--gpu-id 0 \
|
| 45 |
+
--batch-size 4 \
|
| 46 |
+
--tars-per-gpu 256 \
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Interleaved Data Quality Scoring
|
| 50 |
+
```Shell
|
| 51 |
+
python data_scoring/data_quality_classifier_interleaved_scoring.py \
|
| 52 |
+
--model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
|
| 53 |
+
--tar-file-path data/OBELICS/obelics_webdataset\
|
| 54 |
+
--gpu-id 0 \
|
| 55 |
+
--batch-size 1 \
|
| 56 |
+
--tars-per-gpu 128 \
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Parameters to note:
|
| 60 |
+
- `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
|
| 61 |
+
- `--model-path`: path to the UniFilter model checkpoint
|
| 62 |
+
- `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
|
| 63 |
+
- `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
|
| 64 |
+
|
| 65 |
+
## Citation
|
| 66 |
+
|
| 67 |
+
Please cite our paper if you find this repository interesting or helpful:
|
| 68 |
+
|
| 69 |
+
```bibtex
|
| 70 |
+
@article{UniFilter,
|
| 71 |
+
title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
|
| 72 |
+
author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
|
| 73 |
+
journal={arXiv preprint arXiv:2510.15162},
|
| 74 |
+
year={2025}
|
| 75 |
+
}
|
| 76 |
+
```
|