arxiv:2601.10922

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Published on Jan 16

· Submitted by

Authors:

Abstract

Data curation for multimodal reasoning shows that difficulty-based example selection on aligned datasets drives performance gains, while increasing dataset size mainly reduces variance and synthetic augmentation heuristics often degrade performance.

AI-generated summary

We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.

View arXiv page View PDF Add to collection

Community

rajkumarrawal

Paper submitter about 21 hours ago

Some of the observations founded are :

i. Difficulty based example selection is the dominant driver of performance:
Selecting challenging but learnable examples yields the largest gains in multimodal reasoning accuracy, outperforming other curation strategies.

ii. Increasing dataset size does not reliably improve mean accuracy:
Once a well aligned base dataset is chosen, larger datasets mainly reduce run to run variance rather than boosting average performance.

iii. Data curation operates in a saturation regime:
Most performance improvements come from a relatively small number of carefully curated examples, with diminishing returns from adding more data.

iv. Common diversity heuristics provide little or no benefit:
Techniques such as clustering based diversity, category balancing, and synthetic augmentation often fail to improve performance and can even degrade accuracy.

v. Alignment between dataset, benchmark, and base model is crucial:
Strong alignment amplifies the effectiveness of difficulty filtering and explains why compact, well aligned datasets can outperform larger but less aligned ones.

librarian-bot

about 14 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.10922 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10922 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10922 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.