Files changed (1) hide show
  1. README.md +215 -201
README.md CHANGED
@@ -1,202 +1,216 @@
1
- ---
2
- base_model: Qwen/Qwen2.5-0.5B
3
- datasets: trl-lib/math_shepherd
4
- library_name: transformers
5
- model_name: Qwen2.5-0.5B-Math-Shepherd-PRM-0.2
6
- tags:
7
- - generated_from_trainer
8
- - trl
9
- - stepwise-reward-trainer
10
- licence: license
11
- ---
12
-
13
- # Model Card for Qwen2.5-0.5B-Math-Shepherd-PRM-0.2
14
-
15
- This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) on the [trl-lib/math_shepherd](https://huggingface.co/datasets/trl-lib/math_shepherd) dataset.
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
-
18
- ## Quick start
19
-
20
- Example 1)
21
-
22
- ```python
23
- from datasets import load_dataset
24
- from transformers import pipeline
25
- import os
26
- os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
27
-
28
- model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
29
-
30
- pipe = pipeline("token-classification", model=model_name, device="cuda")
31
- dataset = load_dataset("trl-lib/math_shepherd")
32
- example = dataset["test"][10]
33
-
34
- sep = "\n"
35
-
36
- print(sep.join((example["prompt"], *example["completions"])))
37
- for idx in range(1, len(example["completions"])+1):
38
- text = sep.join((example["prompt"], *example["completions"][0:idx])) + sep
39
- output = pipe(text)
40
- score = float(output[-1]["score"])
41
- pred = True if output[-1]["entity"] == "LABEL_1" else False
42
- print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
43
-
44
- # Grandma gave Bryce and Carter some raisins. Bryce received 6 more raisins than Carter, and Carter received half the number of raisins Bryce received. How many raisins did Bryce receive?
45
- # Step 1: Let $b$ be the number of raisins Bryce received and $c$ be the number of raisins Carter received.
46
- # Step 2: We are given that $b = c + 6$ and $c = \frac{1}{2}b$.
47
- # Step 3: Substituting the second equation into the first equation, we get $b = c + 6 = \frac{1}{2}b + 6$.
48
- # Step 4: Simplifying, we have $b = \frac{1}{2}b + 6$.
49
- # Step 5: Subtracting $\frac{1}{2}b$ from both sides, we get $\frac{1}{2}b - b = 6$.
50
- # Step 6: Simplifying further, we have $\frac{1}{2}b - 2b = 6$.
51
- # Step 7: Combining like terms, we have $-\frac{1}{2}b = 6$.
52
- # Step 8: Multiplying both sides by $-2$, we get $b = -12$.
53
- # Step 9: Therefore, Bryce received $\boxed{-12}$ raisins.The answer is: -12
54
- # Step 1 Predicted (score): True (0.99) Label: True
55
- # Step 2 Predicted (score): True (0.99) Label: True
56
- # Step 3 Predicted (score): True (0.94) Label: True
57
- # Step 4 Predicted (score): True (0.82) Label: True
58
- # Step 5 Predicted (score): True (0.58) Label: True
59
- # Step 6 Predicted (score): False (0.62) Label: False
60
- # Step 7 Predicted (score): False (0.77) Label: False
61
- # Step 8 Predicted (score): False (0.91) Label: False
62
- # Step 9 Predicted (score): False (0.97) Label: False
63
- ```
64
-
65
- Example 2)
66
-
67
- ```python
68
- from datasets import load_dataset
69
- from transformers import pipeline
70
- import os
71
- os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
72
-
73
- model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
74
-
75
- pipe = pipeline("token-classification", model=model_name, device="cuda")
76
- dataset = load_dataset("trl-lib/math_shepherd")
77
- i = 32 # 10, 32
78
- example = dataset["test"][i]
79
-
80
- sep = "\n"
81
-
82
- print(sep.join((example["prompt"], *example["completions"])))
83
- for idx in range(1, len(example["completions"])+1):
84
- text = sep.join((example["prompt"], *example["completions"][0:idx])) + sep
85
- output = pipe(text)
86
- score = float(output[-1]["score"])
87
- pred = True if output[-1]["entity"] == "LABEL_1" else False
88
- print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
89
-
90
- # In the Golden State Team, each player earned points. Draymond earned 12 points, Curry earned twice the points as Draymond, Kelly earned 9, Durant earned twice the points as Kelly, Klay earned half the points as Draymond. How many points did the Golden States have in total?
91
- # Step 1: Draymond earned 12 points, Curry earned twice the points as Draymond, which is 2*12 = 24 points.
92
- # Step 2: Kelly earned 9 points, Durant earned twice the points as Kelly, which is 2*9 = 18 points.
93
- # Step 3: Klay earned half the points as Draymond, which is 12/2 = <<12/2=6>>6 points.
94
- # Step 4: The Golden State Team had 12+24+9+18+6 = <<12+24+9+18+6=51>>51 points. The answer is: 51
95
- # Step 1 Predicted (score): True (1.00) Label: True
96
- # Step 2 Predicted (score): True (1.00) Label: True
97
- # Step 3 Predicted (score): True (1.00) Label: True
98
- # Step 4 Predicted (score): False (0.96) Label: False
99
- ```
100
-
101
- Example 3)
102
-
103
- This example corresponds to the one shown in the [peiyi9979/math-shepherd-mistral-7b-prm](https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm):
104
-
105
- ```python
106
- from datasets import load_dataset
107
- from transformers import pipeline
108
- import os
109
- os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
110
-
111
- model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
112
-
113
- pipe = pipeline("token-classification", model=model_name, device="cuda")
114
-
115
- examples = [
116
- {
117
- "prompt": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
118
- "completions": [
119
- "Step 1: Janet's ducks lay 16 eggs per day.",
120
- 'Step 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left.',
121
- 'Step 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left.',
122
- "Step 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18"
123
- ],
124
- "labels": [True, True, True, True]
125
- },
126
- {
127
- "prompt": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
128
- "completions": [
129
- "Step 1: Janet's ducks lay 16 eggs per day.",
130
- 'Step 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left.',
131
- 'Step 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left.',
132
- "Step 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 17"
133
- ],
134
- "labels": [True, True, True, False]
135
- },
136
-
137
- ]
138
-
139
-
140
- sep = "\n"
141
-
142
- for i, example in enumerate(examples):
143
- print(f"- Example {i}:")
144
- for idx in range(1, len(example["completions"])+1):
145
- text = "\n".join((example["prompt"], *example["completions"][0:idx])) + "\n"
146
- output = pipe(text)
147
- score = float(output[-1]["score"])
148
- pred = True if output[-1]["entity"] == "LABEL_1" else False
149
- print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
150
-
151
- # - Example 0:
152
- # Step 1 Predicted (score): True (0.90) Label: True
153
- # Step 2 Predicted (score): False (0.55) Label: True
154
- # Step 3 Predicted (score): False (0.62) Label: True
155
- # Step 4 Predicted (score): False (0.90) Label: True
156
- # - Example 1:
157
- # Step 1 Predicted (score): True (0.90) Label: True
158
- # Step 2 Predicted (score): False (0.55) Label: True
159
- # Step 3 Predicted (score): False (0.62) Label: True
160
- # Step 4 Predicted (score): False (0.96) Label: False
161
- ```
162
-
163
-
164
- ## Training procedure
165
-
166
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/plaguss/huggingface/runs/obk416rg)
167
-
168
- This model was trained with Stepwise Reward.
169
-
170
- ### Framework versions
171
-
172
- - TRL: 0.13.0.dev0
173
- - Transformers: 4.47.0
174
- - Pytorch: 2.4.1
175
- - Datasets: 3.0.1
176
- - Tokenizers: 0.21.0
177
-
178
- ## Citations
179
-
180
- Cite Stepwise Reward as:
181
-
182
- ```bibtex
183
- @article{uesato2022solving,
184
- title = {Solving Math Word Problems With Process- and Outcome-Based Feedback},
185
- author = {Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina},
186
- year = 2022,
187
- journal = {arXiv preprint arXiv:2211.14275}
188
- }
189
- ```
190
-
191
- Cite TRL as:
192
-
193
- ```bibtex
194
- @misc{vonwerra2022trl,
195
- title = {{TRL: Transformer Reinforcement Learning}},
196
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
197
- year = 2020,
198
- journal = {GitHub repository},
199
- publisher = {GitHub},
200
- howpublished = {\url{https://github.com/huggingface/trl}}
201
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
  ```
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-0.5B
3
+ datasets: trl-lib/math_shepherd
4
+ library_name: transformers
5
+ model_name: Qwen2.5-0.5B-Math-Shepherd-PRM-0.2
6
+ tags:
7
+ - generated_from_trainer
8
+ - trl
9
+ - stepwise-reward-trainer
10
+ licence: license
11
+ language:
12
+ - zho
13
+ - eng
14
+ - fra
15
+ - spa
16
+ - por
17
+ - deu
18
+ - ita
19
+ - rus
20
+ - jpn
21
+ - kor
22
+ - vie
23
+ - tha
24
+ - ara
25
+ ---
26
+
27
+ # Model Card for Qwen2.5-0.5B-Math-Shepherd-PRM-0.2
28
+
29
+ This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) on the [trl-lib/math_shepherd](https://huggingface.co/datasets/trl-lib/math_shepherd) dataset.
30
+ It has been trained using [TRL](https://github.com/huggingface/trl).
31
+
32
+ ## Quick start
33
+
34
+ Example 1)
35
+
36
+ ```python
37
+ from datasets import load_dataset
38
+ from transformers import pipeline
39
+ import os
40
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
41
+
42
+ model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
43
+
44
+ pipe = pipeline("token-classification", model=model_name, device="cuda")
45
+ dataset = load_dataset("trl-lib/math_shepherd")
46
+ example = dataset["test"][10]
47
+
48
+ sep = "\n"
49
+
50
+ print(sep.join((example["prompt"], *example["completions"])))
51
+ for idx in range(1, len(example["completions"])+1):
52
+ text = sep.join((example["prompt"], *example["completions"][0:idx])) + sep
53
+ output = pipe(text)
54
+ score = float(output[-1]["score"])
55
+ pred = True if output[-1]["entity"] == "LABEL_1" else False
56
+ print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
57
+
58
+ # Grandma gave Bryce and Carter some raisins. Bryce received 6 more raisins than Carter, and Carter received half the number of raisins Bryce received. How many raisins did Bryce receive?
59
+ # Step 1: Let $b$ be the number of raisins Bryce received and $c$ be the number of raisins Carter received.
60
+ # Step 2: We are given that $b = c + 6$ and $c = \frac{1}{2}b$.
61
+ # Step 3: Substituting the second equation into the first equation, we get $b = c + 6 = \frac{1}{2}b + 6$.
62
+ # Step 4: Simplifying, we have $b = \frac{1}{2}b + 6$.
63
+ # Step 5: Subtracting $\frac{1}{2}b$ from both sides, we get $\frac{1}{2}b - b = 6$.
64
+ # Step 6: Simplifying further, we have $\frac{1}{2}b - 2b = 6$.
65
+ # Step 7: Combining like terms, we have $-\frac{1}{2}b = 6$.
66
+ # Step 8: Multiplying both sides by $-2$, we get $b = -12$.
67
+ # Step 9: Therefore, Bryce received $\boxed{-12}$ raisins.The answer is: -12
68
+ # Step 1 Predicted (score): True (0.99) Label: True
69
+ # Step 2 Predicted (score): True (0.99) Label: True
70
+ # Step 3 Predicted (score): True (0.94) Label: True
71
+ # Step 4 Predicted (score): True (0.82) Label: True
72
+ # Step 5 Predicted (score): True (0.58) Label: True
73
+ # Step 6 Predicted (score): False (0.62) Label: False
74
+ # Step 7 Predicted (score): False (0.77) Label: False
75
+ # Step 8 Predicted (score): False (0.91) Label: False
76
+ # Step 9 Predicted (score): False (0.97) Label: False
77
+ ```
78
+
79
+ Example 2)
80
+
81
+ ```python
82
+ from datasets import load_dataset
83
+ from transformers import pipeline
84
+ import os
85
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
86
+
87
+ model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
88
+
89
+ pipe = pipeline("token-classification", model=model_name, device="cuda")
90
+ dataset = load_dataset("trl-lib/math_shepherd")
91
+ i = 32 # 10, 32
92
+ example = dataset["test"][i]
93
+
94
+ sep = "\n"
95
+
96
+ print(sep.join((example["prompt"], *example["completions"])))
97
+ for idx in range(1, len(example["completions"])+1):
98
+ text = sep.join((example["prompt"], *example["completions"][0:idx])) + sep
99
+ output = pipe(text)
100
+ score = float(output[-1]["score"])
101
+ pred = True if output[-1]["entity"] == "LABEL_1" else False
102
+ print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
103
+
104
+ # In the Golden State Team, each player earned points. Draymond earned 12 points, Curry earned twice the points as Draymond, Kelly earned 9, Durant earned twice the points as Kelly, Klay earned half the points as Draymond. How many points did the Golden States have in total?
105
+ # Step 1: Draymond earned 12 points, Curry earned twice the points as Draymond, which is 2*12 = 24 points.
106
+ # Step 2: Kelly earned 9 points, Durant earned twice the points as Kelly, which is 2*9 = 18 points.
107
+ # Step 3: Klay earned half the points as Draymond, which is 12/2 = <<12/2=6>>6 points.
108
+ # Step 4: The Golden State Team had 12+24+9+18+6 = <<12+24+9+18+6=51>>51 points. The answer is: 51
109
+ # Step 1 Predicted (score): True (1.00) Label: True
110
+ # Step 2 Predicted (score): True (1.00) Label: True
111
+ # Step 3 Predicted (score): True (1.00) Label: True
112
+ # Step 4 Predicted (score): False (0.96) Label: False
113
+ ```
114
+
115
+ Example 3)
116
+
117
+ This example corresponds to the one shown in the [peiyi9979/math-shepherd-mistral-7b-prm](https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm):
118
+
119
+ ```python
120
+ from datasets import load_dataset
121
+ from transformers import pipeline
122
+ import os
123
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
124
+
125
+ model_name = "plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2"
126
+
127
+ pipe = pipeline("token-classification", model=model_name, device="cuda")
128
+
129
+ examples = [
130
+ {
131
+ "prompt": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
132
+ "completions": [
133
+ "Step 1: Janet's ducks lay 16 eggs per day.",
134
+ 'Step 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left.',
135
+ 'Step 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left.',
136
+ "Step 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18"
137
+ ],
138
+ "labels": [True, True, True, True]
139
+ },
140
+ {
141
+ "prompt": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
142
+ "completions": [
143
+ "Step 1: Janet's ducks lay 16 eggs per day.",
144
+ 'Step 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left.',
145
+ 'Step 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left.',
146
+ "Step 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 17"
147
+ ],
148
+ "labels": [True, True, True, False]
149
+ },
150
+
151
+ ]
152
+
153
+
154
+ sep = "\n"
155
+
156
+ for i, example in enumerate(examples):
157
+ print(f"- Example {i}:")
158
+ for idx in range(1, len(example["completions"])+1):
159
+ text = "\n".join((example["prompt"], *example["completions"][0:idx])) + "\n"
160
+ output = pipe(text)
161
+ score = float(output[-1]["score"])
162
+ pred = True if output[-1]["entity"] == "LABEL_1" else False
163
+ print(f"Step {idx}\tPredicted (score): {pred} ({score:.2f})\tLabel: {example['labels'][idx-1]}")
164
+
165
+ # - Example 0:
166
+ # Step 1 Predicted (score): True (0.90) Label: True
167
+ # Step 2 Predicted (score): False (0.55) Label: True
168
+ # Step 3 Predicted (score): False (0.62) Label: True
169
+ # Step 4 Predicted (score): False (0.90) Label: True
170
+ # - Example 1:
171
+ # Step 1 Predicted (score): True (0.90) Label: True
172
+ # Step 2 Predicted (score): False (0.55) Label: True
173
+ # Step 3 Predicted (score): False (0.62) Label: True
174
+ # Step 4 Predicted (score): False (0.96) Label: False
175
+ ```
176
+
177
+
178
+ ## Training procedure
179
+
180
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/plaguss/huggingface/runs/obk416rg)
181
+
182
+ This model was trained with Stepwise Reward.
183
+
184
+ ### Framework versions
185
+
186
+ - TRL: 0.13.0.dev0
187
+ - Transformers: 4.47.0
188
+ - Pytorch: 2.4.1
189
+ - Datasets: 3.0.1
190
+ - Tokenizers: 0.21.0
191
+
192
+ ## Citations
193
+
194
+ Cite Stepwise Reward as:
195
+
196
+ ```bibtex
197
+ @article{uesato2022solving,
198
+ title = {Solving Math Word Problems With Process- and Outcome-Based Feedback},
199
+ author = {Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina},
200
+ year = 2022,
201
+ journal = {arXiv preprint arXiv:2211.14275}
202
+ }
203
+ ```
204
+
205
+ Cite TRL as:
206
+
207
+ ```bibtex
208
+ @misc{vonwerra2022trl,
209
+ title = {{TRL: Transformer Reinforcement Learning}},
210
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
211
+ year = 2020,
212
+ journal = {GitHub repository},
213
+ publisher = {GitHub},
214
+ howpublished = {\url{https://github.com/huggingface/trl}}
215
+ }
216
  ```