CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration? Paper • 2510.24505 • Published Oct 28, 2025 • 3
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents Paper • 2511.02734 • Published Nov 4, 2025 • 20
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models Paper • 2506.17114 • Published Jun 20, 2025
Diversity-Enhanced Reasoning for Subjective Questions Paper • 2507.20187 • Published Jul 27, 2025 • 25