Add initial model card for UniFilter-Qwen2.5-1.5B

This PR adds the initial model card for the UniFilter-Qwen2.5-1.5B model, which is a Unified Multimodal Data Quality Classifier.

It includes:
- Linking to the paper: [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162)
- Adding the `license: cc-by-nc-4.0`
- Adding the `library_name: transformers` to enable the "how to use" widget and automated code snippets.
- Adding the `pipeline_tag: image-text-to-text` for better discoverability on the Hub.
- Including links to the project page and GitHub repository for further details.
- Providing sample usage code snippets for quality score generation directly from the GitHub README, ensuring they are accurate and functional.

Please review and merge this PR if everything looks good.

Files changed (1) hide show

README.md +76 -0

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# UniFilter-Qwen2.5-1.5B: A Unified Multimodal Data Quality Classifier
+This repository contains the **UniFilter-Qwen2.5-1.5B** model, which is an efficient multimodal large language model (MLLM) designed as a Unified Multimodal Data Quality Classifier. It is presented in the paper [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162).
+UniFilter is capable of generating quality scores for both image-text caption data and interleaved document data. These scores can be used for high-quality data filtering to strengthen the capabilities of pre-trained MLLMs.
+*   **Project Page:** [https://victorwz.github.io/UniFilter](https://victorwz.github.io/UniFilter)
+*   **Code Repository:** [https://github.com/Victorwz/UniFilter](https://github.com/Victorwz/UniFilter)
+## Introduction
+UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
+This repo supports
+ - synthetic data generation
+ - UniFilter training
+ - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
+## Installation
+If you just require the quality score generation, please install the customized LLaVA package only.
+```Shell
+conda create -n unifilter python=3.10
+conda activate unifilter
+pip install -e LLaVA
+pip install flash-attn==2.5.2 --no-build-isolation
+```
+## Quality Score Generation
+The UniFilter model can be used to generate quality scores for both caption data and interleaved document data. Below are examples directly from the GitHub README.
+### Caption Data Quality Scoring
+```Shell
+python data_scoring/data_quality_classifier_caption_scoring.py \
+    --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
+    --tar-file-path data/datacomp/medium_vanilla_filter\
+    --gpu-id 0 \
+    --batch-size 4 \
+    --tars-per-gpu 256 \
+```
+### Interleaved Data Quality Scoring
+```Shell
+python data_scoring/data_quality_classifier_interleaved_scoring.py \
+    --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
+    --tar-file-path data/OBELICS/obelics_webdataset\
+    --gpu-id 0 \
+    --batch-size 1 \
+    --tars-per-gpu 128 \
+```
+Parameters to note:
+- `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
+- `--model-path`: path to the UniFilter model checkpoint
+- `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
+- `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
+## Citation
+Please cite our paper if you find this repository interesting or helpful:
+```bibtex
+@article{UniFilter,
+   title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
+   author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
+   journal={arXiv preprint arXiv:2510.15162},
+   year={2025}
+ }
+```