ExGRPO: Learning to Reason from Experience

Unearth and learn high-value experience in RLVR.

📢 News • 📖 Introduction • 🚀 Getting Started

🔧 Usage • 📊 Evaluation • ✨ Acknowledgement • 📬 Contact • 📝 Citation

📢News

[2025/10/03] ExGRPO paper is available on arXiv.

📖Introduction

Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability. We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality. Building on these insights, we propose ExGRPO, a novel framework that strategically manages and replays high-value experiences through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.

Key Highlights:

Experience Value Modeling: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
ExGRPO Framework: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
Generalization and Stability: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.

🚀Getting Started

Installation

You can install dependencies by running the following commands:

conda create -n exgrpo python=3.10
conda activate exgrpo
cd exgrpo
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

Note: If you encounter issues caused by the pyairports library, please refer to this hot-fix solution.

For the flash-attn library, we use the v2.7.4-post1 release and recommend installing it via the pre-built wheel. Please adjust based on your environment.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

ExGRPO Plug-and-Play Modules Structure

ExGRPO extends verl framework by introducing plug-and-play experience modules, following a design similar to that of LUFFY. It focuses on the experience/ submodule and the trainer mix_trainer_experience.py, enabling dynamic integration of on-policy data with collected experiences. The key modules are structured as follows:

exgrpo/verl/verl/mix_src
├── ...
├── experience
│   ├── experience_bucket_manager.py    # Abstraction of experience bucket; stats & maintenance
│   ├── weighted_bucket_sampler.py      # Probabilistic experience sampler (across/within buckets)
│   ├── experience_collate_fn.py        # Mix fresh on-policy data with experience per batch
│   ├── experience_helpers.py           # Sampling, metric computation, sample builders used by collate_fn
│   ├── experience_trainer_ops.py       # Trainer-side experience management operations
│   └── rl_dataset_with_experience.py   # Dataset class for ExGRPO training
├── ...
├── mix_trainer_experience.py           # ExGRPO Trainer
└── ...

    # Additional Training/Runtime Modules:
    are largely similar to those in `LUFFY`, with minor modifications to components such as the rollout     
    mechanism, checkpoint manager, and FSDPworker to better align with the requirements of ExGRPO.

🔧Usage

Data Preparation

You need to first run the data preparation script to get the training data in parquet format.

cd data
python prepare_train.py --dataset_name Elliott/Openr1-Math-46k-8192 --output_file openr1.parquet

Note: Although we utilize the OpenR1 data, only the question field is used in RLVR. The ExGRPO data processing pipeline does not incorporate the external R1 trajectory during training.

Training

We provide an example script to train ExGRPO on 46k-subset of OpenR1-Math-220k. You can run the following command to train:

  cd exp_scripts
  bash run_exgrpo.sh

For Qwen2.5-Math-7B backbone model, we use this version. Other Qwen backbone models follow the same prompt template.

Configuration Quick Reference

Key fields read by the ExGRPO components (names reflect usage in training scipts):

trainer.experience (bool): Enable ExGRPO training.
trainer.experience_ratio (float): Fraction of each batch taken from the experience pool in mixed training.
trainer.exp_metric (str): Metric for trajectory selection. Default: ent.
exp_bucket_manager (str|bool): Probabilistic bucket sampling method. Default: normal.
exp_is_correct (bool): Enable importance sampling correction for experiential trajectories.
experience_lbound / experience_rbound (int): Eligibility bounds on number of successes recorded per question (lbound, rbound].

📊Evaluation

Reproducing the Results

We currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro).

You can reproduce our results by running the following commands:

ROOT= # Your Root Path
TEMPLATE=own
MODEL_PATH= # Your checkpoint Path
OUTPUT_DIR=results/

DATA=$ROOT/data/valid.id.parquet
MODEL_NAME=exgrpo+testid

mkdir -p $OUTPUT_DIR

python generate_vllm.py \
  --model_path $MODEL_PATH \
  --input_file $DATA \
  --remove_system True \
  --add_oat_evaluate True \
  --output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \
  --template $TEMPLATE > $OUTPUT_DIR/$MODEL_NAME.log

Main Results

Zero RLVR on Qwen2.5-Math-7B & Continual RLVR on LUFFY

Zero RLVR on Llama3.1-8B (Base, Instruct), Qwen2.5-Math 1.5B Base, Qwen2.5-7B Instruct

Click to view full results of model extension

Released Models

Model	Huggingface	Base Model
ExGRPO-Qwen2.5-Math-7B-Zero	https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero	Qwen2.5-Math-7B
ExGRPO-LUFFY-7B-Continual	https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual	LUFFY-Qwen-Math-7B-Zero
ExGRPO-Qwen2.5-7B-Instruct	https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct	Qwen2.5-7B Instruct
ExGRPO-Qwen2.5-Math-1.5B-Zero	https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero	Qwen2.5-Math-1.5B
ExGRPO-Llama3.1-8B-Zero	https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero	Llama3.1-8B
ExGRPO-Llama3.1-8B-Instruct	https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct	Llama3.1-8B Instruct

✨Acknowledgement

ExGRPO builds upon LUFFY, veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for RLVR reward model. We thank the open-source community for datasets and backbones, including NuminaMath, OpenR1-Math-220k, OpenR1-Math-46k, Qwen-2.5-Math, Qwen-2.5 and Llama-3.1 model.

📬Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:

Runzhe Zhan: nlp2ct.runzhe@gmail.com
Yafu Li: yafuly@gmail.com

📝Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@article{zhan2025exgrpo,
      title={ExGRPO: Learning to Reason from Experience}, 
      author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.02245},
      url={https://arxiv.org/abs/2510.02245}, 
}

Downloads last month: 100

Safetensors

Model size

8B params

Tensor type

F32

Collection including rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero

ExGRPO

Collection

Model collections trained using ExGRPO. • 7 items • Updated 26 days ago • 1