initial commit
Browse files
README.md
CHANGED
|
@@ -194,7 +194,7 @@ We extended the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) code repositor
|
|
| 194 |
CodeFu employs a 2-stage curriculum learning approach:
|
| 195 |
|
| 196 |
|
| 197 |
-
| Stage | Data | Max resp token | Batch
|
| 198 |
|-------|------|-------|------------|--------------|---------|--------|-------|------------|
|
| 199 |
| **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
|
| 200 |
| **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
|
|
@@ -202,9 +202,40 @@ CodeFu employs a 2-stage curriculum learning approach:
|
|
| 202 |
|
| 203 |
### Reward ###
|
| 204 |
|
| 205 |
-
**
|
| 206 |
|
| 207 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
### Data Selection ###
|
| 210 |
Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
|
|
|
|
| 194 |
CodeFu employs a 2-stage curriculum learning approach:
|
| 195 |
|
| 196 |
|
| 197 |
+
| Stage | Data | Max resp token | Batch size | Mini batch size | # of Rollouts | Reward | Focus | # of nodes |
|
| 198 |
|-------|------|-------|------------|--------------|---------|--------|-------|------------|
|
| 199 |
| **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
|
| 200 |
| **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
|
|
|
|
| 202 |
|
| 203 |
### Reward ###
|
| 204 |
|
| 205 |
+
**Stage 1 (Exponential smooth with public scores)** - Uses exponential smoothing and both public/private test cases for clearer learning signals on easier problems.
|
| 206 |
|
| 207 |
+
**Stage 2 (Linear without public scores)** - Shifts to linear rewards using only private test cases to encourage robust problem-solving on harder problems.
|
| 208 |
+
|
| 209 |
+
Here is the pseudocode for the reward calculation across both training stages:
|
| 210 |
+
```python
|
| 211 |
+
def compute_reward(code_output, public_tests, private_tests, stage):
|
| 212 |
+
# Handle execution failures (same for both stages)
|
| 213 |
+
if not is_executable(code_output):
|
| 214 |
+
return -1
|
| 215 |
+
|
| 216 |
+
if compilation_failed(code_output) or exceeds_time_limit(code_output):
|
| 217 |
+
return 0
|
| 218 |
+
|
| 219 |
+
# Stage-specific reward calculation for successful execution
|
| 220 |
+
if stage == 1:
|
| 221 |
+
# Exponential smoothing with public + private tests
|
| 222 |
+
passed_public = count_passed(code_output, public_tests)
|
| 223 |
+
passed_private = count_passed(code_output, private_tests)
|
| 224 |
+
total_tests = len(public_tests) + len(private_tests)
|
| 225 |
+
passed_tests = passed_public + passed_private
|
| 226 |
+
|
| 227 |
+
pass_ratio = passed_tests / total_tests
|
| 228 |
+
reward = pass_ratio ** 1.5
|
| 229 |
+
|
| 230 |
+
elif stage == 2:
|
| 231 |
+
# Linear reward with private tests only
|
| 232 |
+
passed_private = count_passed(code_output, private_tests)
|
| 233 |
+
total_private = len(private_tests)
|
| 234 |
+
|
| 235 |
+
reward = passed_private / total_private
|
| 236 |
+
|
| 237 |
+
return reward
|
| 238 |
+
```
|
| 239 |
|
| 240 |
### Data Selection ###
|
| 241 |
Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
|