OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents Paper • 2510.24563 • Published 6 days ago • 22
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT Paper • 2509.19284 • Published Sep 23 • 22
UItron: Foundational GUI Agent with Advanced Perception and Planning Paper • 2508.21767 • Published Aug 29 • 12
Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models Paper • 2508.02120 • Published Aug 4 • 19
Attention Basin: Why Contextual Position Matters in Large Language Models Paper • 2508.05128 • Published Aug 7 • 4
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents Paper • 2507.19478 • Published Jul 25 • 30
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Paper • 2507.00432 • Published Jul 1 • 79
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs Paper • 2506.14245 • Published Jun 17 • 42
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents Paper • 2506.11763 • Published Jun 13 • 71
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13 • 23
Hidden in plain sight: VLMs overlook their visual representations Paper • 2506.08008 • Published Jun 9 • 7
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Paper • 2506.09250 • Published Jun 10 • 27
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead Paper • 2504.00294 • Published Mar 31 • 10
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders Paper • 2503.18878 • Published Mar 24 • 119
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking Paper • 2503.19855 • Published Mar 25 • 29
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning Paper • 2503.21620 • Published Mar 27 • 62