Spaces:
Running
Running
File size: 4,802 Bytes
9c68964 4110d16 06c2c4d 9c68964 8a91d68 9c68964 9f15498 9c68964 4110d16 9c68964 4110d16 9c68964 065faaf d089f70 065faaf 80ecf61 9c68964 4110d16 9c68964 4110d16 9c68964 065faaf 9c68964 065faaf 9c68964 065faaf 9c68964 4110d16 065faaf 9c68964 065faaf 9c68964 4110d16 9c68964 065faaf b00e3ed 9c68964 4110d16 4e2670a 5588e7f 81b057f 4e2670a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
title: SWE-Model-Arena
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
hf_oauth: true
pinned: false
short_description: Chatbot arena for software engineering tasks
---
# SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
Welcome to **SWE-Model-Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SWE-Model-Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
## Key Features
- **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
- **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
- **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
- Traditional metrics: Elo score and average win rate
- Network-based metrics: Eigenvector centrality, PageRank score
- Community detection: Newman modularity score
- Consistency score: Quantify model determinism and reliability through self-play matches
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
- **Intelligent Request Filtering**: Employ `GPT-OSS-20B` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.
## Why SWE-Model-Arena?
Existing evaluation frameworks (e.g. [LMArena](https://lmarena.ai)) often don't address the complex, iterative nature of SE tasks. SWE-Model-Arena fills critical gaps by:
- Supporting context-rich, multi-turn evaluations to capture iterative workflows
- Integrating repository-level context through RepoChat to simulate real-world development scenarios
- Providing multidimensional metrics for nuanced model comparisons
- Focusing on the full breadth of SE tasks beyond just code generation
## How It Works
1. **Submit a Prompt**: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
2. **Compare Responses**: Two anonymous models provide responses to your query
3. **Continue the Conversation**: Test contextual understanding over multiple rounds
4. **Vote**: Choose the better model at any point, with ability to re-assess after multiple turns
## Getting Started
### Prerequisites
- A [Hugging Face](https://huggingface.co) account
- Basic understanding of software engineering workflows
### Usage
1. Navigate to the [SWE-Model-Arena platform](https://huggingface.co/spaces/SE-Arena/SWE-Model-Arena)
2. Sign in with your Hugging Face account
3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
4. Engage in multi-round interactions and vote on model performance
## Contributing
We welcome contributions from the community! Here's how you can help:
1. **Submit SE Tasks**: Share your real-world SE problems to enrich our evaluation dataset
2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository
3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request
## Privacy Policy
Your interactions are anonymized and used solely for improving SWE-Model-Arena and FM benchmarking. By using SWE-Model-Arena, you agree to our Terms of Service.
## Future Plans
- **Analysis of Real-World SE Workloads**: Identify common patterns and challenges in user-submitted tasks
- **Multi-Round Evaluation Metrics**: Develop specialized metrics for assessing model adaptation over successive turns
- **Enhanced Community Engagement**: Enable broader participation through voting and contributions
- **Expanded FM Coverage**: Include domain-specific and multimodal foundation models
- **Advanced Context Compression**: Integrate techniques like [LongRope](https://github.com/microsoft/LongRoPE) and [SelfExtend](https://github.com/datamllab/LongLM) to manage long-term memory
## Contact
For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/SWE-Model-Arena/issues/new) in this repository. We welcome your contributions and suggestions!
## Citation
Made with ❤️ for SWE-Model-Arena. If this work is useful to you, please consider citing our vision paper:
```bibtex
@inproceedings{zhao2025se,
title={SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering},
author={Zhao, Zhimin},
booktitle={2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)},
pages={78--81},
year={2025},
organization={IEEE}
}
``` |