kaizuberbuehler
			's Collections
			 
		
			
		Vision Language Models
		
	updated
			
 
				
				
	
	
	
			
			BLINK: Multimodal Large Language Models Can See but Not Perceive
		
			Paper
			
•
			2404.12390
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			TextSquare: Scaling up Text-Centric Visual Instruction Tuning
		
			Paper
			
•
			2404.12803
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Groma: Localized Visual Tokenization for Grounding Multimodal Large
  Language Models
		
			Paper
			
•
			2404.13013
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
  Handling Resolutions from 336 Pixels to 4K HD
		
			Paper
			
•
			2404.06512
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
		
			Paper
			
•
			2404.05719
			
•
			Published
				
			•
				
				83
			
 
	
	 
	
	
	
			
			MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
  Understanding
		
			Paper
			
•
			2404.05726
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
  Interleaved Visual-Textual Tokens
		
			Paper
			
•
			2404.03413
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
		
			Paper
			
•
			2311.16502
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Kosmos-2: Grounding Multimodal Large Language Models to the World
		
			Paper
			
•
			2306.14824
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			CogVLM: Visual Expert for Pretrained Language Models
		
			Paper
			
•
			2311.03079
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Pegasus-v1 Technical Report
		
			Paper
			
•
			2404.14687
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
		
			Paper
			
•
			2404.16821
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			List Items One by One: A New Data Source and Learning Paradigm for
  Multimodal LLMs
		
			Paper
			
•
			2404.16375
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
  Text-Rich Visual Comprehension
		
			Paper
			
•
			2404.16790
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
  Dense Captioning
		
			Paper
			
•
			2404.16994
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			What matters when building vision-language models?
		
			Paper
			
•
			2405.02246
			
•
			Published
				
			•
				
				103
			
 
	
	 
	
	
	
			
			Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
  Multi-modal LLMs in Video Analysis
		
			Paper
			
•
			2405.21075
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			ShareGPT4Video: Improving Video Understanding and Generation with Better
  Captions
		
			Paper
			
•
			2406.04325
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			An Image is Worth More Than 16x16 Patches: Exploring Transformers on
  Individual Pixels
		
			Paper
			
•
			2406.09415
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			OpenVLA: An Open-Source Vision-Language-Action Model
		
			Paper
			
•
			2406.09246
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			MuirBench: A Comprehensive Benchmark for Robust Multi-image
  Understanding
		
			Paper
			
•
			2406.09411
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
		
			Paper
			
•
			2406.09403
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
		
			Paper
			
•
			2406.08707
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
  Instruction-Tuning Dataset for LVLMs
		
			Paper
			
•
			2406.11833
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			VideoLLM-online: Online Video Large Language Model for Streaming Video
		
			Paper
			
•
			2406.11816
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
  Chart-to-Code Generation
		
			Paper
			
•
			2406.09961
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Needle In A Multimodal Haystack
		
			Paper
			
•
			2406.07230
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Wolf: Captioning Everything with a World Summarization Framework
		
			Paper
			
•
			2407.18908
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal
  Language Model
		
			Paper
			
•
			2408.00754
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			OmniParser for Pure Vision Based GUI Agent
		
			Paper
			
•
			2408.00203
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			LongVILA: Scaling Long-Context Visual Language Models for Long Videos
		
			Paper
			
•
			2408.10188
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			Show-o: One Single Transformer to Unify Multimodal Understanding and
  Generation
		
			Paper
			
•
			2408.12528
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			VideoLLaMB: Long-context Video Understanding with Recurrent Memory
  Bridges
		
			Paper
			
•
			2409.01071
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution
		
			Paper
			
•
			2409.12191
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			NVLM: Open Frontier-Class Multimodal LLMs
		
			Paper
			
•
			2409.11402
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
  3D-awareness
		
			Paper
			
•
			2409.18125
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			OmniBench: Towards The Future of Universal Omni-Language Models
		
			Paper
			
•
			2409.15272
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Progressive Multimodal Reasoning via Active Retrieval
		
			Paper
			
•
			2412.14835
			
•
			Published
				
			•
				
				73
			
 
	
	 
	
	
	
			
			Apollo: An Exploration of Video Understanding in Large Multimodal Models
		
			Paper
			
•
			2412.10360
			
•
			Published
				
			•
				
				147
			
 
	
	 
	
	
	
			
			Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
  Visual Descriptions
		
			Paper
			
•
			2412.08737
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
		
			Paper
			
•
			2412.01169
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			PaliGemma 2: A Family of Versatile VLMs for Transfer
		
			Paper
			
•
			2412.03555
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			ShowUI: One Vision-Language-Action Model for GUI Visual Agent
		
			Paper
			
•
			2411.17465
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
  Video Reasoning via Core Frame Selection
		
			Paper
			
•
			2411.14794
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Enhancing the Reasoning Ability of Multimodal Large Language Models via
  Mixed Preference Optimization
		
			Paper
			
•
			2411.10442
			
•
			Published
				
			•
				
				86
			
 
	
	 
	
	
	
			
			LLaVA-o1: Let Vision Language Models Reason Step-by-Step
		
			Paper
			
•
			2411.10440
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
		
			Paper
			
•
			2411.07461
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
  And A Retrieval-Aware Tuning Framework
		
			Paper
			
•
			2411.06176
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			Mixture-of-Transformers: A Sparse and Scalable Architecture for
  Multi-Modal Foundation Models
		
			Paper
			
•
			2411.04996
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			Analyzing The Language of Visual Tokens
		
			Paper
			
•
			2411.05001
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile
  Manipulation
		
			Paper
			
•
			2411.04999
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			2.5 Years in Class: A Multimodal Textbook for Vision-Language
  Pretraining
		
			Paper
			
•
			2501.00958
			
•
			Published
				
			•
				
				107
			
 
	
	 
	
	
	
			
			Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
		
			Paper
			
•
			2501.01904
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
  Marks
		
			Paper
			
•
			2501.08326
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
		
			Paper
			
•
			2501.06186
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
  Understanding?
		
			Paper
			
•
			2501.05510
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			Are VLMs Ready for Autonomous Driving? An Empirical Study from the
  Reliability, Data, and Metric Perspectives
		
			Paper
			
•
			2501.04003
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
		
			Paper
			
•
			2501.09012
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Learnings from Scaling Visual Tokenizers for Reconstruction and
  Generation
		
			Paper
			
•
			2501.09755
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Do generative video models learn physical principles from watching
  videos?
		
			Paper
			
•
			2501.09038
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			FAST: Efficient Action Tokenization for Vision-Language-Action Models
		
			Paper
			
•
			2501.09747
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
		
			Paper
			
•
			2501.12380
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
  Model
		
			Paper
			
•
			2501.12368
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding
		
			Paper
			
•
			2501.13106
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
  Professional Videos
		
			Paper
			
•
			2501.13826
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Temporal Preference Optimization for Long-Form Video Understanding
		
			Paper
			
•
			2501.13919
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			PixelWorld: Towards Perceiving Everything as Pixels
		
			Paper
			
•
			2501.19339
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
  Modality Alignment
		
			Paper
			
•
			2502.04328
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
		
			Paper
			
•
			2502.03738
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Scaling Pre-training to One Hundred Billion Data for Vision Language
  Models
		
			Paper
			
•
			2502.07617
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			CoS: Chain-of-Shot Prompting for Long Video Understanding
		
			Paper
			
•
			2502.06428
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
  Models for Vision-Driven Embodied Agents
		
			Paper
			
•
			2502.09560
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
  Reasoning Quality, Robustness, and Efficiency
		
			Paper
			
•
			2502.09621
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Exploring the Potential of Encoder-free Architectures in 3D LMMs
		
			Paper
			
•
			2502.09620
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			mmE5: Improving Multimodal Multilingual Embeddings via High-quality
  Synthetic Data
		
			Paper
			
•
			2502.08468
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			ZeroBench: An Impossible Visual Benchmark for Contemporary Large
  Multimodal Models
		
			Paper
			
•
			2502.09696
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
  Task Automation on PC
		
			Paper
			
•
			2502.14282
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Qwen2.5-VL Technical Report
		
			Paper
			
•
			2502.13923
			
•
			Published
				
			•
				
				207
			
 
	
	 
	
	
	
			
			Soundwave: Less is More for Speech-Text Alignment in LLMs
		
			Paper
			
•
			2502.12900
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
  Multimodal Reasoning Models
		
			Paper
			
•
			2502.16033
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			VEM: Environment-Free Exploration for Training GUI Agent with Value
  Environment Model
		
			Paper
			
•
			2502.18906
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Token-Efficient Long Video Understanding for Multimodal LLMs
		
			Paper
			
•
			2503.04130
			
•
			Published
				
			•
				
				96
			
 
	
	 
	
	
	
			
			Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
  Models via Mixture-of-LoRAs
		
			Paper
			
•
			2503.01743
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Visual-RFT: Visual Reinforcement Fine-Tuning
		
			Paper
			
•
			2503.01785
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			EgoLife: Towards Egocentric Life Assistant
		
			Paper
			
•
			2503.03803
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Unified Video Action Model
		
			Paper
			
•
			2503.00200
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Unified Reward Model for Multimodal Understanding and Generation
		
			Paper
			
•
			2503.05236
			
•
			Published
				
			•
				
				123
			
 
	
	 
	
	
	
			
			LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
  Two-Stage Rule-Based RL
		
			Paper
			
•
			2503.07536
			
•
			Published
				
			•
				
				88
			
 
	
	 
	
	
	
			
			MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
  Reinforcement Learning
		
			Paper
			
•
			2503.07365
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
		
			Paper
			
•
			2503.05132
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			World Modeling Makes a Better Planner: Dual Preference Optimization for
  Embodied Task Planning
		
			Paper
			
•
			2503.10480
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
		
			Paper
			
•
			2503.10291
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
  Reinforcing Learning
		
			Paper
			
•
			2503.05379
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
  Language Models
		
			Paper
			
•
			2503.06749
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via
  Reinforcement Learning and Reasoning
		
			Paper
			
•
			2503.07608
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			R1-Onevision: Advancing Generalized Multimodal Reasoning through
  Cross-Modal Formalization
		
			Paper
			
•
			2503.10615
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
  VLM Agent Training
		
			Paper
			
•
			2503.08525
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
  Vision-Language Models in Fact-Seeking Question Answering
		
			Paper
			
•
			2503.06492
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
		
			Paper
			
•
			2503.10391
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution
  Autonomous Driving VQA from Peru
		
			Paper
			
•
			2503.07587
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
  Samples
		
			Paper
			
•
			2410.14669
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
  Knowledge
		
			Paper
			
•
			2504.10342
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Being-0: A Humanoid Robotic Agent with Vision-Language Models and
  Modular Skills
		
			Paper
			
•
			2503.12533
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
		
			Paper
			
•
			2503.15558
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
		
			Paper
			
•
			2503.14478
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			API Agents vs. GUI Agents: Divergence and Convergence
		
			Paper
			
•
			2503.11069
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
		
			Paper
			
•
			2503.12605
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
  for Knowledge-Intensive Visual Grounding
		
			Paper
			
•
			2503.12797
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Cube: A Roblox View of 3D Intelligence
		
			Paper
			
•
			2503.15475
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			R1-VL: Learning to Reason with Multimodal Large Language Models via
  Step-wise Group Relative Policy Optimization
		
			Paper
			
•
			2503.12937
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			CapArena: Benchmarking and Analyzing Detailed Image Captioning in the
  LLM Era
		
			Paper
			
•
			2503.12329
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
  Models
		
			Paper
			
•
			2503.16257
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
		
			Paper
			
•
			2503.11579
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
		
			Paper
			
•
			2503.13444
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			CLS-RL: Image Classification with Rule-Based Reinforcement Learning
		
			Paper
			
•
			2503.16188
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Free-form language-based robotic reasoning and grasping
		
			Paper
			
•
			2503.13082
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Qwen2.5-Omni Technical Report
		
			Paper
			
•
			2503.20215
			
•
			Published
				
			•
				
				166
			
 
	
	 
	
	
	
			
			Video-R1: Reinforcing Video Reasoning in MLLMs
		
			Paper
			
•
			2503.21776
			
•
			Published
				
			•
				
				79
			
 
	
	 
	
	
	
			
			Dita: Scaling Diffusion Transformer for Generalist
  Vision-Language-Action Policy
		
			Paper
			
•
			2503.19757
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
		
			Paper
			
•
			2503.19990
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Exploring Hallucination of Large Multimodal Models in Video
  Understanding: Benchmark, Analysis and Mitigation
		
			Paper
			
•
			2503.19622
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Judge Anything: MLLM as a Judge Across Any Modality
		
			Paper
			
•
			2503.17489
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
  via Vision-Guided Reinforcement Learning
		
			Paper
			
•
			2503.18013
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Video SimpleQA: Towards Factuality Evaluation in Large Video Language
  Models
		
			Paper
			
•
			2503.18923
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Can Large Vision Language Models Read Maps Like a Human?
		
			Paper
			
•
			2503.14607
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Improved Visual-Spatial Reasoning via R1-Zero-Like Training
		
			Paper
			
•
			2504.00883
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			Towards Physically Plausible Video Generation via VLM Planning
		
			Paper
			
•
			2503.23368
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			Unicorn: Text-Only Data Synthesis for Vision Language Model Training
		
			Paper
			
•
			2503.22655
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Exploring the Effect of Reinforcement Learning on Video Understanding:
  Insights from SEED-Bench-R1
		
			Paper
			
•
			2503.24376
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
  LLMs on Academic Resources
		
			Paper
			
•
			2504.00595
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Rethinking RL Scaling for Vision Language Models: A Transparent,
  From-Scratch Framework and Comprehensive Evaluation Scheme
		
			Paper
			
•
			2504.02587
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Scaling Analysis of Interleaved Speech-Text Language Models
		
			Paper
			
•
			2504.02398
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Scaling Language-Free Visual Representation Learning
		
			Paper
			
•
			2504.01017
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			ShortV: Efficient Multimodal Large Language Models by Freezing Visual
  Tokens in Ineffective Layers
		
			Paper
			
•
			2504.00502
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			DASH: Detection and Assessment of Systematic Hallucinations of VLMs
		
			Paper
			
•
			2503.23573
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			SmolVLM: Redefining small and efficient multimodal models
		
			Paper
			
•
			2504.05299
			
•
			Published
				
			•
				
				200
			
 
	
	 
	
	
	
			
			OmniSVG: A Unified Scalable Vector Graphics Generation Model
		
			Paper
			
•
			2504.06263
			
•
			Published
				
			•
				
				180
			
 
	
	 
	
	
	
		
			Paper
			
•
			2504.07491
			
•
			Published
				
			•
				
				132
			
 
	
	 
	
	
	
			
			One-Minute Video Generation with Test-Time Training
		
			Paper
			
•
			2504.05298
			
•
			Published
				
			•
				
				110
			
 
	
	 
	
	
	
			
			Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
		
			Paper
			
•
			2504.05599
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			An Empirical Study of GPT-4o Image Generation Capabilities
		
			Paper
			
•
			2504.05979
			
•
			Published
				
			•
				
				64
			
 
	
	 
	
	
	
			
			VCR-Bench: A Comprehensive Evaluation Framework for Video
  Chain-of-Thought Reasoning
		
			Paper
			
•
			2504.07956
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Scaling Laws for Native Multimodal Models Scaling Laws for Native
  Multimodal Models
		
			Paper
			
•
			2504.07951
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via
  Iterative Instruction Tuning and Reinforcement Learning
		
			Paper
			
•
			2504.02949
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual
  Reasoning Self-Improvement
		
			Paper
			
•
			2504.07934
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Towards Visual Text Grounding of Multimodal Large Language Model
		
			Paper
			
•
			2504.04974
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
  (v1)
		
			Paper
			
•
			2504.03151
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			MME-Unify: A Comprehensive Benchmark for Unified Multimodal
  Understanding and Generation Models
		
			Paper
			
•
			2504.03641
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
  Capabilities in Multimodal Large Language Models
		
			Paper
			
•
			2504.06148
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
  Fine-Tuning
		
			Paper
			
•
			2504.06958
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			InternVL3: Exploring Advanced Training and Test-Time Recipes for
  Open-Source Multimodal Models
		
			Paper
			
•
			2504.10479
			
•
			Published
				
			•
				
				300
			
 
	
	 
	
	
	
			
			Have we unified image generation and understanding yet? An empirical
  study of GPT-4o's image generation ability
		
			Paper
			
•
			2504.08003
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			ColorBench: Can VLMs See and Understand the Colorful World? A
  Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
		
			Paper
			
•
			2504.10514
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
  with Reinforcement Learning
		
			Paper
			
•
			2504.08837
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			Generate, but Verify: Reducing Hallucination in Vision-Language Models
  with Retrospective Resampling
		
			Paper
			
•
			2504.13169
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			FUSION: Fully Integration of Vision-Language Representations for Deep
  Cross-Modal Understanding
		
			Paper
			
•
			2504.09925
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
		
			Paper
			
•
			2504.07615
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Mavors: Multi-granularity Video Representation for Multimodal Large
  Language Model
		
			Paper
			
•
			2504.10068
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
  Vision-Language Models
		
			Paper
			
•
			2504.11468
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
		
			Paper
			
•
			2504.10465
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question
  Answering
		
			Paper
			
•
			2504.05506
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
		
			Paper
			
•
			2504.13055
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			PerceptionLM: Open-Access Data and Models for Detailed Visual
  Understanding
		
			Paper
			
•
			2504.13180
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			Breaking the Data Barrier -- Building GUI Agents Through Task
  Generalization
		
			Paper
			
•
			2504.10127
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
		
			Paper
			
•
			2504.09641
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
		
			Paper
			
•
			2504.09130
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections
  of Images
		
			Paper
			
•
			2504.08727
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
  Large Language Models
		
			Paper
			
•
			2504.15279
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			Eagle 2.5: Boosting Long-Context Post-Training for Frontier
  Vision-Language Models
		
			Paper
			
•
			2504.15271
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			Breaking the Modality Barrier: Universal Embedding Learning with
  Multimodal LLMs
		
			Paper
			
•
			2504.17432
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
  Simulation
		
			Paper
			
•
			2504.17207
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Seeing from Another Perspective: Evaluating Multi-View Understanding in
  MLLMs
		
			Paper
			
•
			2504.15280
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
  in Multimodal LLMs
		
			Paper
			
•
			2504.15415
			
•
			Published
				
			•
				
				22