Spaces:
Build error
Build error
Multi-swe-bench Evaluation with OpenHands
LLM Setup
Please follow here.
Dataset Preparing
Please download the Multi-SWE-Bench dataset. And change the dataset following script.
python evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py
Docker image download
Please download the multi-swe-bench dokcer images from here.
Generate patch
Please edit the script and run it.
bash evaluation/benchmarks/multi_swe_bench/infer.sh
Script variable explanation:
models, e.g.llm.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml.git-version, e.g.HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2.agent, e.g.CodeActAgent, is the name of the agent for benchmarks, defaulting toCodeActAgent.eval_limit, e.g.10, limits the evaluation to the firsteval_limitinstances. By default, the script evaluates the (500 issues), which will no exceed the maximum of the dataset number.max_iter, e.g.20, is the maximum number of iterations for the agent to run. By default, it is set to 50.num_workers, e.g.3, is the number of parallel workers to run the evaluation. By default, it is set to 1.language, the language of your evaluating dataset.dataset, the absolute position of the dataset jsonl.
The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl, you can refer to the example.
Runing evaluation
First, install multi-swe-bench.
pip install multi-swe-bench
Second, convert the output.jsonl to patch.jsonl with script, you can refer to the example.
python evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py
Finally, evaluate with multi-swe-bench. The config file config.json can be refer to the example or github.
python -m multi_swe_bench.harness.run_evaluation --config config.json