initial commit
Browse files
README.md
CHANGED
|
@@ -253,3 +253,13 @@ CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu hel
|
|
| 253 |
}
|
| 254 |
```
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
}
|
| 254 |
```
|
| 255 |
|
| 256 |
+
## References
|
| 257 |
+
[1] - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. (https://arxiv.org/pdf/1707.06347.pdf)
|
| 258 |
+
|
| 259 |
+
[2] - Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., ... & Wang, M. (2025). DAPO: An open-source llm reinforcement learning system at scale.
|
| 260 |
+
|
| 261 |
+
[3] - Hao, Y., Dong, L., Wu, X., Huang, S., Chi, Z., & Wei, F. (2025). On-Policy RL with Optimal Reward Baseline.
|
| 262 |
+
|
| 263 |
+
[4] - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., ... & Lin, M. Understanding r1-zero-like training: A critical perspective.
|
| 264 |
+
|
| 265 |
+
[5] - Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization.
|