GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training Paper • 2509.24494 • Published about 1 month ago • 9