MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28 • 170
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge Paper • 2509.07968 • Published Sep 9 • 14
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet Paper • 2509.06861 • Published Sep 8 • 8
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Paper • 2502.11962 • Published Feb 17 • 38
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2 • 52
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published May 15 • 120