Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation Paper • 1909.04696 • Published Sep 10, 2019
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation Paper • 2505.13441 • Published May 19 • 1
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts Paper • 2511.04655 • Published Nov 6 • 7
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding Paper • 2511.04668 • Published Nov 6 • 4
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding Paper • 2511.04668 • Published Nov 6 • 4
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models Paper • 2412.07755 • Published Dec 10, 2024 • 2
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models Paper • 2412.07755 • Published Dec 10, 2024 • 2
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 63
Internet Explorer: Targeted Representation Learning on the Open Web Paper • 2302.14051 • Published Feb 27, 2023 • 1
COLA: How to adapt vision-language models to Compose Objects Localized with Attributes? Paper • 2305.03689 • Published May 5, 2023 • 3