📊 Benchmarks and Leaderboards - a society-ethics Collection

society-ethics 's Collections

⛔️🔦 Provenance, Watermarking & Deepfake Detection

🗳️ AI for Policymakers

⚖️ Showing Biases in ML Systems

🤬⛔ Hate Speech and Filtering

🪪🔦Model Cards

🔒☂️🧑‍🤝‍🧑 Privacy and AI

📊 Benchmarks and Leaderboards

📚🔍 Understanding Datasets

💻🔍 Understanding Models

🏛️📚🖼️ Open Data: Public Domain and Open Licenses

📊 Benchmarks and Leaderboards

updated Aug 11

Running on CPU Upgrade

13.6k

13.6k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Runtime error

5

5

Zeno Evals Hub

🏃
Running on CPU Upgrade

6.61k

6.61k

MTEB Leaderboard

🥇

Embedding Leaderboard
Running

569

569

LLM-Perf Leaderboard

🏆

Explore hardware performance for LLMs
Runtime error

135

135

Leaderboards

📈
Running on CPU Upgrade

1.12k

1.12k

Open ASR Leaderboard

🏆

Display and request speech recognition model benchmarks
Running

1.45k

1.45k

Big Code Models Leaderboard

📈

Submit code models for evaluation and view leaderboard
Running

4.65k

4.65k

LMArena Leaderboard

🏆

Display LMArena Leaderboard
Running

170

170

Open Object Detection Leaderboard

🏆

Request evaluation for a new model
Running

70

70

Toolbench Leaderboard

⚡

Display leaderboard of language models
Running

85

85

SEED-Bench Leaderboard

🏆

Submit model evaluation results to leaderboard
Running

95

95

OpenCompass LLM Leaderboard

🚀

Display a web page
nguha/legalbench

Updated Sep 30, 2024 • 21.1k • 145
Running

6

6

Skillmix

🚀

Browse and compare AI model evaluations
Runtime error

144

144

Hallucinations Leaderboard

🔥

View and submit LLM evaluations
Running

40

40

MVBench Leaderboard

🐨

Submit and view model evaluations
Running

3

3

Mt Bench French Browser

📊

Compare model answers to questions in French
Running

10

10

ML.ENERGY Leaderboard

⚡

Explore energy consumption of GenAI models
Running

54

54

NPHardEval Leaderboard

🥇

Explore and filter LLM benchmark results
Running

328

328

VBench Leaderboard

📊

Upload and evaluate video models
Runtime error

105

105

Enterprise Scenarios Leaderboard

🥇
Running

191

191

Yet Another LLM Leaderboard

🌖

Generate interactive web apps with Streamlit
Running

70

70

CyberSecEvalTest

📈

Evaluate LLMs' cybersecurity risks and capabilities
Running

30

30

Contextual Leaderboard

🐨

Submit and evaluate models for contextual understanding tasks
Runtime error

56

56

Open Multilingual Llm Leaderboard

🐨

Search for model performance across languages and benchmarks
Running on CPU Upgrade

92

92

OpenLLM Turkish leaderboard

🥇

Explore and submit LLM benchmarks
Running on CPU Upgrade

921

921

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

406

406

Reward Bench Leaderboard

📐

Display and analyze reward model evaluation results
Build error

63

63

Guardrails Arena

⚔

Jailbreak the LLM and privacy guardrails
Running

18

18

🐍💨 Data Contamination Database

🏭

Filter data on contamination in datasets and models
Running on CPU Upgrade

167

167

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Runtime error

73

73

AIR-Bench Leaderboard

🥇

Explore and compare QA and long doc benchmarks
Running

23

23

MM-UPD Leaderboard

🥇

Submit and evaluate model results on MM-UPD benchmarks
Running

225

225

BigCodeBench Leaderboard

🥇

Explore and analyze code completion benchmarks
Running on CPU Upgrade

74

74

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.
Running

118

118

Open FinLLM Leaderboard

🏆

Evaluating LLMs on Multilingual Multimodal Financial Tasks