Submitted by lkdhy 108 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm OpenMOSS (SII, FNLP) 2 1
Submitted by vyokky 8 GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents Microsoft 1
Submitted by h-otsuka 5 The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms · 6 authors 1
Submitted by jihanyang 3 Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts NYU VisionX 1
Submitted by spapi 3 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics · 5 authors 1
Submitted by taesiri 2 Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots ByteDance Seed 1
Submitted by mucai 2 Contamination Detection for VLMs using Multi-Modal Semantic Perturbation University of Wisconsin - Madison 1