Spaces:

AutoBench
/

AutoBench-Leaderboard

Running

File size: 5,907 Bytes

364e95c
 
 
 
 
 
 
 
 
 
fa11ba1
364e95c
 
df35d1a

---
title: AutoBench Leaderboard
emoji: 👀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
license: mit
short_description: Multi-run AutoBench leaderboard with historical navigation
---

# AutoBench LLM Leaderboard

Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods.

## 🌟 Features

### Multi-Run Navigation
- **📊 Run Selector**: Switch between different AutoBench runs using the dropdown menu
- **🕐 Historical Data**: View and compare results across different time periods
- **🔄 Reactive Interface**: All tabs and visualizations update automatically when switching runs
- **📈 Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs

### Comprehensive Analysis
- **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics
- **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks
- **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs
- **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles
- **Domain Performance**: Model rankings across specific knowledge areas

### Dynamic Features
- **📊 Benchmark Correlations**: Displays correlation percentages with other popular benchmarks
- **💰 Cost Conversion**: Automatic conversion to cents for better readability
- **⚡ Performance Metrics**: Average and P99 latency measurements
- **🎯 Fail Rate Tracking**: Model reliability metrics (for supported runs)
- **🔢 Iteration Counts**: Number of evaluations per model (for supported runs)

## 🚀 How to Use

### Navigation
1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs
2. **Explore Tabs**: Navigate through different analysis views using the tab interface
3. **Interactive Tables**: Sort and filter data by clicking on column headers
4. **Hover for Details**: Get additional information by hovering over chart elements

### Understanding the Data
- **AutoBench Score**: Higher scores indicate better performance
- **Cost**: Lower values are better (displayed in cents per response)
- **Latency**: Lower response times are better (average and P99 percentiles)
- **Fail Rate**: Lower percentages indicate more reliable models
- **Iterations**: Number of evaluation attempts per model

## 🔧 Adding New Runs

### Directory Structure
```
runs/
├── run_YYYY-MM-DD/
│   ├── metadata.json          # Run information and metadata
│   ├── correlations.json      # Benchmark correlation data
│   ├── summary_data.csv       # Main leaderboard data
│   ├── domain_ranks.csv       # Domain-specific rankings
│   ├── cost_data.csv          # Cost breakdown by domain
│   ├── avg_latency.csv        # Average latency by domain
│   └── p99_latency.csv        # P99 latency by domain
```

### Required Files

#### 1. metadata.json
```json
{
  "run_id": "run_2025-08-14",
  "title": "AutoBench Run 3 - August 2025",
  "date": "2025-08-14",
  "description": "Latest AutoBench run with enhanced metrics",
  "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
  "model_count": 34,
  "is_latest": true
}
```

#### 2. correlations.json
```json
{
  "correlations": {
    "Chatbot Arena": 82.51,
    "Artificial Analysis Intelligence Index": 83.74,
    "MMLU": 71.51
  },
  "description": "Correlation percentages between AutoBench scores and other benchmark scores"
}
```

#### 3. summary_data.csv
Required columns:
- `Model`: Model name
- `AutoBench`: AutoBench score
- `Costs (USD)`: Cost per response in USD
- `Avg Answer Duration (sec)`: Average response time
- `P99 Answer Duration (sec)`: 99th percentile response time

Optional columns (for enhanced metrics):
- `Iterations`: Number of evaluation iterations
- `Fail Rate %`: Percentage of failed responses
- `LMArena` or `Chatbot Ar.`: Chatbot Arena scores
- `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores
- `AAI Index`: Artificial Analysis Intelligence Index scores

### Adding a New Run

1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD`
2. **Add Data Files**: Copy your CSV files to the new directory
3. **Create Metadata**: Add `metadata.json` with run information
4. **Add Correlations**: Create `correlations.json` with benchmark correlations
5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata
6. **Restart App**: The new run will be automatically discovered

### Column Compatibility

The application automatically adapts to different column structures:
- **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency)
- **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %)
- **Flexible Naming**: Handles variations in benchmark column names

## 🛠️ Development

### Requirements
- Python 3.8+
- Gradio 5.27.0+
- Pandas
- Plotly

### Installation
```bash
pip install -r requirements.txt
```

### Running Locally
```bash
python app.py
```

### killing all python processes
```bash
taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill"
```

The app will automatically discover available runs and launch on a local port.

## 📊 Data Sources

AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench).

## 📄 License

MIT License - see LICENSE file for details.

---

Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options.