add description
Browse files
app.py
CHANGED
|
@@ -42,3 +42,34 @@ styled_data = (
|
|
| 42 |
|
| 43 |
|
| 44 |
st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
|
| 44 |
st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)
|
| 45 |
+
|
| 46 |
+
st.text("\n\n")
|
| 47 |
+
st.markdown(
|
| 48 |
+
r"""
|
| 49 |
+
This leaderboard measures the **system-level performance and behavior of LLM judges**, and was created as part of the **[JuStRank paper](https://www.arxiv.org/abs/2412.09569)** from ACL 2025.
|
| 50 |
+
|
| 51 |
+
Judges are sorted according to *Ranking Agreement* with humans, i.e., comparing how the judges rank different systems (generative models) relative to how humans rank those systems on [LMSys Arena](https://lmarena.ai/leaderboard/text/hard-prompts-english).
|
| 52 |
+
|
| 53 |
+
We also compare judges in terms of the *Decisiveness* and *Bias* reflected in their judgment behaviors (refer to the paper for details).
|
| 54 |
+
|
| 55 |
+
In our research we tested 10 **LLM judges** and 8 **reward models**, and asked them to score the [responses](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto/tree/main/data/arena-hard-v0.1/model_answer) of 63 systems to the 500 questions from Arena Hard v0.1.
|
| 56 |
+
For each LLM judge we tried 4 different _realizations_, i.e., different prompt and scoring methods used with the LLM judge.
|
| 57 |
+
|
| 58 |
+
In total, the judge ranking is derived from **[1.5 million raw judgment scores](https://huggingface.co/datasets/ibm-research/justrank_judge_scores)** (48 judge realizations X 63 target systems X 500 instances).
|
| 59 |
+
|
| 60 |
+
If you find this useful, please cite our work 🤗
|
| 61 |
+
|
| 62 |
+
```bibtex
|
| 63 |
+
@inproceedings{gera2025justrank,
|
| 64 |
+
title={JuStRank: Benchmarking LLM Judges for System Ranking},
|
| 65 |
+
author={Gera, Ariel and Boni, Odellia and Perlitz, Yotam and Bar-Haim, Roy and Eden, Lilach and Yehudai, Asaf},
|
| 66 |
+
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
|
| 67 |
+
month={july},
|
| 68 |
+
address={Vienna, Austria},
|
| 69 |
+
year={2025}
|
| 70 |
+
url={www.arxiv.org/abs/2412.09569},
|
| 71 |
+
}
|
| 72 |
+
```
|
| 73 |
+
"""
|
| 74 |
+
)
|
| 75 |
+
|