File size: 5,907 Bytes
364e95c fa11ba1 364e95c df35d1a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
title: AutoBench Leaderboard
emoji: π
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
license: mit
short_description: Multi-run AutoBench leaderboard with historical navigation
---
# AutoBench LLM Leaderboard
Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods.
## π Features
### Multi-Run Navigation
- **π Run Selector**: Switch between different AutoBench runs using the dropdown menu
- **π Historical Data**: View and compare results across different time periods
- **π Reactive Interface**: All tabs and visualizations update automatically when switching runs
- **π Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs
### Comprehensive Analysis
- **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics
- **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks
- **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs
- **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles
- **Domain Performance**: Model rankings across specific knowledge areas
### Dynamic Features
- **π Benchmark Correlations**: Displays correlation percentages with other popular benchmarks
- **π° Cost Conversion**: Automatic conversion to cents for better readability
- **β‘ Performance Metrics**: Average and P99 latency measurements
- **π― Fail Rate Tracking**: Model reliability metrics (for supported runs)
- **π’ Iteration Counts**: Number of evaluations per model (for supported runs)
## π How to Use
### Navigation
1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs
2. **Explore Tabs**: Navigate through different analysis views using the tab interface
3. **Interactive Tables**: Sort and filter data by clicking on column headers
4. **Hover for Details**: Get additional information by hovering over chart elements
### Understanding the Data
- **AutoBench Score**: Higher scores indicate better performance
- **Cost**: Lower values are better (displayed in cents per response)
- **Latency**: Lower response times are better (average and P99 percentiles)
- **Fail Rate**: Lower percentages indicate more reliable models
- **Iterations**: Number of evaluation attempts per model
## π§ Adding New Runs
### Directory Structure
```
runs/
βββ run_YYYY-MM-DD/
β βββ metadata.json # Run information and metadata
β βββ correlations.json # Benchmark correlation data
β βββ summary_data.csv # Main leaderboard data
β βββ domain_ranks.csv # Domain-specific rankings
β βββ cost_data.csv # Cost breakdown by domain
β βββ avg_latency.csv # Average latency by domain
β βββ p99_latency.csv # P99 latency by domain
```
### Required Files
#### 1. metadata.json
```json
{
"run_id": "run_2025-08-14",
"title": "AutoBench Run 3 - August 2025",
"date": "2025-08-14",
"description": "Latest AutoBench run with enhanced metrics",
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
"model_count": 34,
"is_latest": true
}
```
#### 2. correlations.json
```json
{
"correlations": {
"Chatbot Arena": 82.51,
"Artificial Analysis Intelligence Index": 83.74,
"MMLU": 71.51
},
"description": "Correlation percentages between AutoBench scores and other benchmark scores"
}
```
#### 3. summary_data.csv
Required columns:
- `Model`: Model name
- `AutoBench`: AutoBench score
- `Costs (USD)`: Cost per response in USD
- `Avg Answer Duration (sec)`: Average response time
- `P99 Answer Duration (sec)`: 99th percentile response time
Optional columns (for enhanced metrics):
- `Iterations`: Number of evaluation iterations
- `Fail Rate %`: Percentage of failed responses
- `LMArena` or `Chatbot Ar.`: Chatbot Arena scores
- `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores
- `AAI Index`: Artificial Analysis Intelligence Index scores
### Adding a New Run
1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD`
2. **Add Data Files**: Copy your CSV files to the new directory
3. **Create Metadata**: Add `metadata.json` with run information
4. **Add Correlations**: Create `correlations.json` with benchmark correlations
5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata
6. **Restart App**: The new run will be automatically discovered
### Column Compatibility
The application automatically adapts to different column structures:
- **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency)
- **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %)
- **Flexible Naming**: Handles variations in benchmark column names
## π οΈ Development
### Requirements
- Python 3.8+
- Gradio 5.27.0+
- Pandas
- Plotly
### Installation
```bash
pip install -r requirements.txt
```
### Running Locally
```bash
python app.py
```
### killing all python processes
```bash
taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill"
```
The app will automatically discover available runs and launch on a local port.
## π Data Sources
AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench).
## π License
MIT License - see LICENSE file for details.
---
Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options.
|