Corey Morris
commited on
Commit
·
4fbdb10
1
Parent(s):
6d0ce25
Added citation
Browse files
app.py
CHANGED
|
@@ -145,3 +145,23 @@ st.plotly_chart(fig)
|
|
| 145 |
|
| 146 |
fig = px.histogram(filtered_data, x="MMLU_moral_disputes", marginal="rug", hover_data=filtered_data.columns)
|
| 147 |
st.plotly_chart(fig)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
fig = px.histogram(filtered_data, x="MMLU_moral_disputes", marginal="rug", hover_data=filtered_data.columns)
|
| 147 |
st.plotly_chart(fig)
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
st.markdown("**Thank you to hugging face for running the evaluations and supplying the data as well as the original authors of the evaluations**")
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
st.markdown("""
|
| 154 |
+
# References
|
| 155 |
+
|
| 156 |
+
1. Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf. (2023). *Open LLM Leaderboard*. Hugging Face. [link](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
| 157 |
+
|
| 158 |
+
2. Gao, Leo et al. (2021). *A framework for few-shot language model evaluation*. Zenodo. [link](https://doi.org/10.5281/zenodo.5371628)
|
| 159 |
+
|
| 160 |
+
3. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. (2018). *Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge*. arXiv. [link](https://arxiv.org/abs/1803.05457)
|
| 161 |
+
|
| 162 |
+
4. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi. (2019). *HellaSwag: Can a Machine Really Finish Your Sentence?*. arXiv. [link](https://arxiv.org/abs/1905.07830)
|
| 163 |
+
|
| 164 |
+
5. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. (2021). *Measuring Massive Multitask Language Understanding*. arXiv. [link](https://arxiv.org/abs/2009.03300)
|
| 165 |
+
|
| 166 |
+
6. Stephanie Lin, Jacob Hilton, Owain Evans. (2022). *TruthfulQA: Measuring How Models Mimic Human Falsehoods*. arXiv. [link](https://arxiv.org/abs/2109.07958)
|
| 167 |
+
""")
|