initial readme
Browse files
README.md
CHANGED
|
@@ -1,138 +1,158 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
-
|
| 39 |
-
-
|
| 40 |
-
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
)
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
)
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
|
| 97 |
-
|
| 98 |
-
For supported frameworks, you could add the following to `config.json` to enable YaRN:
|
| 99 |
-
```json
|
| 100 |
-
{
|
| 101 |
-
...,
|
| 102 |
-
"rope_scaling": {
|
| 103 |
-
"factor": 4.0,
|
| 104 |
-
"original_max_position_embeddings": 32768,
|
| 105 |
-
"type": "yarn"
|
| 106 |
-
}
|
| 107 |
-
}
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
| 111 |
-
Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM.
|
| 112 |
-
Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**.
|
| 113 |
-
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
```
|
|
|
|
| 1 |
+
# Skywork-SWE
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+
|
| 5 |
+
📖 [Report]() 📰 [Blog](https://quixotic-sting-239.notion.site/eb17f379610040ceb54da5d5d24065bd)
|
| 6 |
+
|
| 7 |
+
## Model Introduction
|
| 8 |
+
***Skywork-SWE-32B*** is a code agent model specifically designed for software engineering (SWE) tasks developed by [Skywork.AI](https://skywork.ai/home). It achieves state-of-the-art performance across several key metrics:
|
| 9 |
+
- Skywork-SWE-32B attains 38.0% pass@1 accuracy on the [SWE-bench Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) benchmark, outperforming previous open-source SOTA [Qwen2.5-Coder-32B-based](https://huggingface.co/Qwen/Qwen2.5-Coder-32B) LLMs built on the [OpenHands](https://www.all-hands.dev/) agent framework.
|
| 10 |
+
- When incorporated with test-time scaling techniques, the performance further improves to 47.0% pass@1 accuracy, surpassing the previous SoTA results for sub-32B parameter models.
|
| 11 |
+
- We clearly demonstrate the data scaling law phenomenon for software engineering capabilities in LLMs, with no signs of saturation at 8209 collected training trajectories.
|
| 12 |
+
|
| 13 |
+
We also introduce an efficient and automated pipeline for SWE data collection, culminating in the creation of the Skywork-SWE dataset---a large-scale, high-quality dataset featuring comprehensive executable runtime environments. Detailed descriptions are available on [arXiv](https://xxx).
|
| 14 |
+
### 🔧 Model Details
|
| 15 |
+
|
| 16 |
+
| Model Name | Backbone LLM | HuggingFace Link | Technology Report | Blog |
|
| 17 |
+
|---|---------------|-----------|-|-|
|
| 18 |
+
|Skywork-SWE-32B | [🤗 Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | [🤗 Skywork-SWE-32B](https://huggingface.co/Skywork/Skywork-SWE-32B) | [arXiv]() | [blog]()|
|
| 19 |
+
|
| 20 |
+
## Evaluation
|
| 21 |
+
|
| 22 |
+

|
| 23 |
+
|
| 24 |
+
Data Scaling Law for Pass@1 Accuracy on Qwen2.5-Coder-32B-Based LLMs Using the OpenHands v0.32.0 Code Agent Framework. Skywork-SWE-32B establishes a new state-of-the-art (SoTA) among the Qwen2.5-Coder-32B-based LLM, achieving the highest pass@1 accuracy without using verifiers or multiple rollouts.
|
| 25 |
+
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
With the incorporation of test-time scaling techniques, Skywork-SWE-32B further improves to 47.0% pass@1 accuracy, surpassing the previous SoTA results for sub-32B parameter models.
|
| 29 |
+
|
| 30 |
+
## Performance Summary
|
| 31 |
+
- Skywork-SWE-32B:
|
| 32 |
+
```
|
| 33 |
+
Submission summary on SWE-bench verified split
|
| 34 |
+
==================================================
|
| 35 |
+
Resolved 190 instances (38.0%)
|
| 36 |
+
==================================================
|
| 37 |
+
Resolved by Repository
|
| 38 |
+
- astropy/astropy: 4/22 (18.18%)
|
| 39 |
+
- django/django: 99/231 (42.86%)
|
| 40 |
+
- matplotlib/matplotlib: 9/34 (26.47%)
|
| 41 |
+
- mwaskom/seaborn: 0/2 (0.0%)
|
| 42 |
+
- pallets/flask: 1/1 (100.0%)
|
| 43 |
+
- psf/requests: 4/8 (50.0%)
|
| 44 |
+
- pydata/xarray: 7/22 (31.82%)
|
| 45 |
+
- pylint-dev/pylint: 2/10 (20.0%)
|
| 46 |
+
- pytest-dev/pytest: 9/19 (47.37%)
|
| 47 |
+
- scikit-learn/scikit-learn: 17/32 (53.12%)
|
| 48 |
+
- sphinx-doc/sphinx: 13/44 (29.55%)
|
| 49 |
+
- sympy/sympy: 25/75 (33.33%)
|
| 50 |
+
==================================================
|
| 51 |
+
Resolved by Time
|
| 52 |
+
- 2013: 2/3 (66.67%)
|
| 53 |
+
- 2014: 2/2 (100.0%)
|
| 54 |
+
- 2015: 0/1 (0.0%)
|
| 55 |
+
- 2016: 2/2 (100.0%)
|
| 56 |
+
- 2017: 5/16 (31.25%)
|
| 57 |
+
- 2018: 7/24 (29.17%)
|
| 58 |
+
- 2019: 46/98 (46.94%)
|
| 59 |
+
- 2020: 43/108 (39.81%)
|
| 60 |
+
- 2021: 27/86 (31.4%)
|
| 61 |
+
- 2022: 35/102 (34.31%)
|
| 62 |
+
- 2023: 21/58 (36.21%)
|
| 63 |
+
```
|
| 64 |
+
- Skywork-SWE-32B + TTS:
|
| 65 |
+
```
|
| 66 |
+
Submission summary on SWE-bench verified split
|
| 67 |
+
==================================================
|
| 68 |
+
Resolved 235 instances (47.0%)
|
| 69 |
+
==================================================
|
| 70 |
+
Resolved by Repository
|
| 71 |
+
- astropy/astropy: 8/22 (36.36%)
|
| 72 |
+
- django/django: 115/231 (49.78%)
|
| 73 |
+
- matplotlib/matplotlib: 15/34 (44.12%)
|
| 74 |
+
- mwaskom/seaborn: 0/2 (0.0%)
|
| 75 |
+
- pallets/flask: 1/1 (100.0%)
|
| 76 |
+
- psf/requests: 3/8 (37.5%)
|
| 77 |
+
- pydata/xarray: 14/22 (63.64%)
|
| 78 |
+
- pylint-dev/pylint: 4/10 (40.0%)
|
| 79 |
+
- pytest-dev/pytest: 10/19 (52.63%)
|
| 80 |
+
- scikit-learn/scikit-learn: 22/32 (68.75%)
|
| 81 |
+
- sphinx-doc/sphinx: 12/44 (27.27%)
|
| 82 |
+
- sympy/sympy: 31/75 (41.33%)
|
| 83 |
+
==================================================
|
| 84 |
+
Resolved by Time
|
| 85 |
+
- 2013: 1/3 (33.33%)
|
| 86 |
+
- 2014: 1/2 (50.0%)
|
| 87 |
+
- 2015: 0/1 (0.0%)
|
| 88 |
+
- 2016: 2/2 (100.0%)
|
| 89 |
+
- 2017: 6/16 (37.5%)
|
| 90 |
+
- 2018: 9/24 (37.5%)
|
| 91 |
+
- 2019: 52/98 (53.06%)
|
| 92 |
+
- 2020: 48/108 (44.44%)
|
| 93 |
+
- 2021: 40/86 (46.51%)
|
| 94 |
+
- 2022: 46/102 (45.1%)
|
| 95 |
+
- 2023: 30/58 (51.72%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
+
### Launch a server to deploy Skywork-SWE-32B
|
| 101 |
|
| 102 |
+
You can serve the model using vLLM or SGLang. Since our model has 32 billion parameters and supports a 32K context length, we recommend launching the model server with at least 2 GPUs equipped with sufficient VRAM to ensure efficient inference.
|
| 103 |
+
### Set up OpenHands framework
|
| 104 |
+
```
|
| 105 |
+
git clone https://github.com/All-Hands-AI/OpenHands.git
|
| 106 |
+
cd OpenHands
|
| 107 |
+
git checkout tags/0.32.0
|
| 108 |
+
make build
|
| 109 |
+
```
|
| 110 |
+
See official documentation for more details: [SWE-Bench Evaluation with OpenHands SWE-Bench Docker Image](https://github.com/All-Hands-AI/OpenHands/tree/main/evaluation/benchmarks/swe_bench)
|
| 111 |
|
| 112 |
+
### Create the corresponding config file:
|
| 113 |
+
```
|
| 114 |
+
[core]
|
| 115 |
+
workspace_base="./workspace"
|
| 116 |
+
|
| 117 |
+
[llm.my-oss-model]
|
| 118 |
+
model = "openai//path/to/Skywork-SWE "
|
| 119 |
+
base_url = "http://0.0.0.0:8000/v1"
|
| 120 |
+
api_key="vllm"
|
| 121 |
+
max_message_chars=32768
|
| 122 |
+
max_input_tokens=32768
|
| 123 |
+
max_output_tokens=8192
|
| 124 |
+
log_completions=true
|
| 125 |
+
temperature=0.0
|
| 126 |
+
```
|
| 127 |
|
| 128 |
+
If you want to run the OpenHands agent with test-time scaling techniques (a Best-of-N method based on the critic model), please refer to the [blog](https://www.all-hands.dev/blog/sota-on-swe-bench-verified-with-inference-time-scaling-and-critic-model) for detailed instructions. You will need to switch to the [feature/llm-critic](https://github.com/All-Hands-AI/OpenHands/tree/feature/llm-critic) branch and deploy the [critic model](https://huggingface.co/all-hands/openhands-critic-32b-exp-20250417) accordingly. Additionally, you need to add the following parameters into the configuration file:
|
| 129 |
+
```
|
| 130 |
+
use_critic=true
|
| 131 |
+
critic_model="critic_model"
|
| 132 |
+
critic_base_url="**********"
|
| 133 |
+
critic_api_key="************"
|
| 134 |
+
critic_num_candidates=2
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### Rollout on SWE-Bench Instances
|
| 138 |
+
```
|
| 139 |
+
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
|
| 140 |
|
| 141 |
+
# Example
|
| 142 |
+
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.my-oss-model HEAD CodeActAgent 500 100 1
|
| 143 |
+
princeton-nlp/SWE-bench_Verified test
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### Evaluate generated patches
|
| 147 |
+
```
|
| 148 |
+
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh \
|
| 149 |
+
./evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/my-oss-model_maxiter_100_N_v0.32.0-no-hint-run_1/output.jsonl
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
## Acknowledgements
|
| 153 |
+
We would like to thank the contributors of the [OpenHands](https://www.all-hands.dev/) and [AllHands Critic](https://huggingface.co/all-hands/openhands-critic-32b-exp-20250417) repositories for their open research and valuable contributions.
|
| 154 |
+
|
| 155 |
+
## Citation
|
| 156 |
+
If you use Skywork-SWE in your research, please consider citing our work using the following BibTeX entry:
|
| 157 |
+
```
|
| 158 |
```
|