Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -1,12 +1,15 @@ | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            license: llama3
         | 
| 3 | 
             
            ---
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 4 | 
             
            This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
         | 
| 5 |  | 
| 6 | 
             
            The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
         | 
| 7 |  | 
| 8 | 
            -
            See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
         | 
| 9 | 
            -
             | 
| 10 | 
             
            ## Service the RM
         | 
| 11 |  | 
| 12 | 
             
            Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
         | 
|  | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            license: llama3
         | 
| 3 | 
             
            ---
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            * **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
         | 
| 6 | 
            +
            * **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
         | 
| 7 | 
            +
            * **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
         | 
| 8 | 
            +
             | 
| 9 | 
             
            This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
         | 
| 10 |  | 
| 11 | 
             
            The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
         | 
| 12 |  | 
|  | |
|  | |
| 13 | 
             
            ## Service the RM
         | 
| 14 |  | 
| 15 | 
             
            Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
         | 

