On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published 26 days ago • 11
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Paper • 2510.03669 • Published Oct 4 • 1