arxiv:2601.10305

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Published on Jan 15

· Submitted by

Kaicheng Yang on Jan 16

Upvote

Authors:

Hengyu Shen ,

Tiancheng Gu ,

Zelong Sun ,

Weidong Cai ,

Ziyong Feng ,

Kaicheng Yang

Abstract

A large-scale Chinese image-text dataset called DanQing is introduced to advance vision-language pretraining, demonstrating superior performance in various downstream tasks through continual pretraining of the SigLIP2 model.

AI-generated summary

Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.