Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Track, rank and evaluate open LLMs and chatbots
Embedding Leaderboard
Explore hardware performance for LLMs
Display and request speech recognition model benchmarks
Submit code models for evaluation and view leaderboard
Display LMArena Leaderboard
Request evaluation for a new model
Display leaderboard of language models
Submit model evaluation results to leaderboard
Display a web page
Browse and compare AI model evaluations
View and submit LLM evaluations
Submit and view model evaluations
Compare model answers to questions in French
Explore energy consumption of GenAI models
Explore and filter LLM benchmark results
Upload and evaluate video models
Generate interactive web apps with Streamlit
Evaluate LLMs' cybersecurity risks and capabilities
Submit and evaluate models for contextual understanding tasks
Search for model performance across languages and benchmarks
Explore and submit LLM benchmarks
VLMEvalKit Evaluation Results Collection
Display and analyze reward model evaluation results
Jailbreak the LLM and privacy guardrails
Filter data on contamination in datasets and models
Track, rank and evaluate open Arabic LLMs and chatbots
Explore and compare QA and long doc benchmarks
Submit and evaluate model results on MM-UPD benchmarks
Explore and analyze code completion benchmarks
Evaluate open LLMs in the languages of LATAM and Spain.
Evaluating LLMs on Multilingual Multimodal Financial Tasks