Create README.md
Browse files
    	
        README.md
    ADDED
    
    | @@ -0,0 +1,119 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            base_model:
         | 
| 3 | 
            +
            - Qwen/Qwen3-0.6B
         | 
| 4 | 
            +
            - google/siglip2-so400m-patch14-384
         | 
| 5 | 
            +
            datasets:
         | 
| 6 | 
            +
            - weizhiwang/unifilter_train_data
         | 
| 7 | 
            +
            license: mit
         | 
| 8 | 
            +
            pipeline_tag: image-text-to-text
         | 
| 9 | 
            +
            library_name: transformers
         | 
| 10 | 
            +
            ---
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            # UniFilter
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162) accepted by EMNLP 2025 Findings.
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            - 📝 [Paper](https://huggingface.co/papers/2510.15162)
         | 
| 17 | 
            +
            - 🌐 [Project Page](https://victorwz.github.io/UniFilter)
         | 
| 18 | 
            +
            - 💻 [GitHub Repository](https://github.com/Victorwz/UniFilter)
         | 
| 19 | 
            +
             | 
| 20 | 
            +
            ## Release
         | 
| 21 | 
            +
            - [10/21/2025] 🔥 We released UniFilter model at [UniFilter-Qwen3-0.6B](https://huggingface.co/weizhiwang/UniFilter-Qwen3-0.6B). It is constructed on Qwen3-0.6B and SigLIP-2, which achieves better classification performance with much less model parameters.
         | 
| 22 | 
            +
            - [10/19/2025] 🔥 We released UniFilter model at [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B). Empowered by a strong 1.5B LLM backbone, the UniFilter model achieves best inference speed on quality score generation and the classification accuracy.
         | 
| 23 | 
            +
             | 
| 24 | 
            +
             | 
| 25 | 
            +
            ## Introduction
         | 
| 26 | 
            +
            UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
         | 
| 27 | 
            +
             | 
| 28 | 
            +
            This repo supports
         | 
| 29 | 
            +
             - synthetic data generation
         | 
| 30 | 
            +
             - UniFilter training
         | 
| 31 | 
            +
             - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            ## Installation
         | 
| 34 | 
            +
            If you just require the quality score generation, please install the customized LLaVA package only.
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            ```Shell
         | 
| 37 | 
            +
            conda create -n unifilter python=3.10
         | 
| 38 | 
            +
            conda activate unifilter
         | 
| 39 | 
            +
            pip install -e LLaVA
         | 
| 40 | 
            +
            pip install flash-attn==2.5.2 --no-build-isolation
         | 
| 41 | 
            +
            ```
         | 
| 42 | 
            +
             | 
| 43 | 
            +
            ## Synthetic Data Generation for UniFilter Training
         | 
| 44 | 
            +
            We instruct Claude-3 or Claude-3.5 to generate the desired (multimodal data example, quality score) pairs across 4 designated quality levels.
         | 
| 45 | 
            +
            The synthetic data generation scrips are:
         | 
| 46 | 
            +
             - [claude_sonnet_caption_data_generation.py](data_prepare/caption_data_scripts/claude_sonnet_caption_data_generation.py)
         | 
| 47 | 
            +
             - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
         | 
| 48 | 
            +
             | 
| 49 | 
            +
            ## Data Preparation for UniFilter Training
         | 
| 50 | 
            +
            UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data](https://huggingface.co/datasets/weizhiwang/unifilter_train_data).
         | 
| 51 | 
            +
             | 
| 52 | 
            +
            ## UniFilter Training
         | 
| 53 | 
            +
            We develop the UniFilter training and scoring codebase based on [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders. 
         | 
| 54 | 
            +
            <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
         | 
| 57 | 
            +
            - `--mm_projector_type`: visual projector, i.e. aapool_mlp representing average pooling vision projector with 144 tokens for one image
         | 
| 58 | 
            +
            - `--vision_tower`: vision encoder, i.e. SigLIP-SO-400M with 384px resolution
         | 
| 59 | 
            +
            - `--model_name_or_path`: LLM Backbone, i.e. Qwen2.5-0.5B-Instruct
         | 
| 60 | 
            +
             | 
| 61 | 
            +
             | 
| 62 | 
            +
            ### Visual Projector Pre-Training (Stage 1)
         | 
| 63 | 
            +
             | 
| 64 | 
            +
            Please download the 558K subset of the LLAVA-Pretrain caption dataset [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
         | 
| 65 | 
            +
             | 
| 66 | 
            +
            Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](scripts/v1_5/pretrain.sh).
         | 
| 67 | 
            +
             | 
| 68 | 
            +
             | 
| 69 | 
            +
            ### UniFilter Classifier Training (Stage 2)
         | 
| 70 | 
            +
             | 
| 71 | 
            +
             | 
| 72 | 
            +
            Training script with DeepSpeed ZeRO-3: [`train_classifier.sh`](scripts/v1_5/train_classifier.sh).
         | 
| 73 | 
            +
             | 
| 74 | 
            +
            Our training script will upload the metrics to wandb. The best UniFilter model is saved based on the best quality classification accuracy on the validation sets.
         | 
| 75 | 
            +
             | 
| 76 | 
            +
             | 
| 77 | 
            +
            ## Quality Score Generation
         | 
| 78 | 
            +
             | 
| 79 | 
            +
            ## Caption Data Quality Scoring
         | 
| 80 | 
            +
            ```Shell
         | 
| 81 | 
            +
            python data_scoring/data_quality_classifier_caption_scoring.py \
         | 
| 82 | 
            +
                --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
         | 
| 83 | 
            +
                --tar-file-path data/datacomp/medium_vanilla_filter\ 
         | 
| 84 | 
            +
                --gpu-id 0 \
         | 
| 85 | 
            +
                --batch-size 4 \
         | 
| 86 | 
            +
                --tars-per-gpu 256 \
         | 
| 87 | 
            +
            ```
         | 
| 88 | 
            +
             | 
| 89 | 
            +
            ## Interleaved Data Quality Scoring
         | 
| 90 | 
            +
            ```Shell
         | 
| 91 | 
            +
            python data_scoring/data_quality_classifier_interleaved_scoring.py \
         | 
| 92 | 
            +
                --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
         | 
| 93 | 
            +
                --tar-file-path data/OBELICS/obelics_webdataset\ 
         | 
| 94 | 
            +
                --gpu-id 0 \
         | 
| 95 | 
            +
                --batch-size 1 \
         | 
| 96 | 
            +
                --tars-per-gpu 128 \
         | 
| 97 | 
            +
            ```
         | 
| 98 | 
            +
             | 
| 99 | 
            +
            Parameters to note:
         | 
| 100 | 
            +
            - `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
         | 
| 101 | 
            +
            - `--model-path`: path to the UniFilter model checkpoint
         | 
| 102 | 
            +
            - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
         | 
| 103 | 
            +
            - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
         | 
| 104 | 
            +
             | 
| 105 | 
            +
            ## Citation
         | 
| 106 | 
            +
             | 
| 107 | 
            +
            Please cite our paper if you find this repository interesting or helpful:
         | 
| 108 | 
            +
            ```bibtex
         | 
| 109 | 
            +
            @article{UniFilter,
         | 
| 110 | 
            +
               title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
         | 
| 111 | 
            +
               author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
         | 
| 112 | 
            +
               journal={arXiv preprint arXiv:2510.15162},
         | 
| 113 | 
            +
               year={2025}
         | 
| 114 | 
            +
             }
         | 
| 115 | 
            +
            ```
         | 
| 116 | 
            +
             | 
| 117 | 
            +
            ## Acknowledgement
         | 
| 118 | 
            +
             | 
| 119 | 
            +
            - [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon for UniFilter training.
         | 
