Training details?
Would love to have more details on how you did the POLAR training part!
basically just https://github.com/RowitZou/POLAR_RFT/blob/main/examples/ppo/qwen3-8b_hh-rlhf.sh on 8xh100, script edited to use 7 gpus with an lmdeploy polar 7b server running on gpu 8, 2 epochs on a subset of my roleplay data. then trained for another epoch with the rollout temperature set to 1.5 and min_p 0.01 (verl doesn't support setting min_p so i had to add support for it myself, it was like a 2 line code change iirc)
(that is after 2 epochs of sft on the full data with axolotl)
Thanks, that's very helpful. I was a bit shocked that a 12B model required 8 H100s to train. I guess POLAR is extremely resource demanding, huh?
Online RL in general is extremely demanding, you already need 2 instances of the model with all their associated memory costs