G²RPO: Granular GRPO for precise reward in flow models

Yujie Zhou^1,⁴^*, Pengyang Ling^2,⁴^*, Jiazi Bu^1,⁴^*,
Yibin Wang^3,⁵, Yuhang Zang⁴, Jiaqi Wang^4,^5†, Li Niu^1†, Guangtao Zhai¹

¹Shanghai Jiao Tong University ²University of Science and Technology of China
³Fudan University ⁴Shanghai AI Laboratory ⁵Shanghai Innovation Institute

This model is presented in the paper G$^2$RPO: Granular GRPO for Precise Reward in Flow Models. Project page: https://bujiazi.github.io/g2rpo.github.io/

Abstract

The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines, highlighting its effectiveness and robustness.

🌟 Model

The diffusion_pytorch_model.safetensors is based on FLUX.1 Dev with our G²RPO finetuning.

Joint training with reward models:

🔧 Github Link

https://github.com/bcmi/Granular-GRPO

🚀 Inference

import torch
from diffusers import FluxPipeline
from diffusers import FluxTransformer2DModel
from safetensors.torch import load_file

device = "cuda:0"

model_path = "ckpt/g2rpo/diffusion_pytorch_model.safetensors"
flux_path = "ckpt/flux"

pipe = FluxPipeline.from_pretrained(flux_path, use_safetensors=True,  torch_dtype=torch.float16)
model_state_dict = load_file(model_path)
pipe.transformer.load_state_dict(model_state_dict, strict=True)
pipe = pipe.to(device)

prompt = "A public welfare poster has a clear dividing line in the middle of the picture. On the left is the dry and cracked land and withered trees, and on the right is the vibrant oasis and clear lake water"

image = pipe(
    prompt,
    guidance_scale=3.5,
    height=1024,
    width=1024,
    num_inference_steps=50,
    max_sequence_length=512,
).images[0]

save_path = "g2rpo.png"
image.save(save_path)

📎 Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{zhou2025g2rpo,
  title={G$^2$RPO: Granular GRPO for Precise Reward in Flow Models},
  author={Zhou, Yujie and Ling, Pengyang and Bu, Jiazi and Wang, Yibin and Zang, Yuhang and Wang, Jiaqi and Niu, Li and Zhai, Guangtao},
  journal={arXiv preprint arXiv:2510.01982},
  year={2025}
}

💞 Acknowledgement

The code is built upon the below repositories, we thank all the contributors for open-sourcing.

Downloads last month: 166