Spaces:

SeaLLMs
/

LLM_Leaderboard_for_SEA

Running

App Files Files Community

lukecq commited on Apr 24, 2024

Commit

8c8300c

1 Parent(s): 60867e4

update about page

Browse files

Files changed (1) hide show

src/display/about.py +8 -3

src/display/about.py CHANGED Viewed

@@ -28,7 +28,6 @@ This leaderboard is specifically designed to evaluate large language models (LLM
 For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "📝 About" tab.
 Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -42,7 +41,7 @@ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/
 - [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
 ## Evalation Criteria
-We evaluate the models with accuracy score. The leaderboard is sorted by the average score across SEA languages (id, th, and vi).
 We have the following settings for evaluation:
 - **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
@@ -50,7 +49,11 @@ We have the following settings for evaluation:
 ## Reults
-You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 ## Reproducibility
 To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
@@ -60,6 +63,8 @@ python scripts/main.py --model $model_name_or_path
 """
 EVALUATION_QUEUE_TEXT = """
 ## Some good practices before submitting a model

 For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "📝 About" tab.
 Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 - [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
 ## Evalation Criteria
+We evaluate the models with accuracy score.
 We have the following settings for evaluation:
 - **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
 ## Reults
+How to interpret the leaderboard?
+* Each numerical value represet the accuracy (%).
+* The "M3Exam" and "MMLU" pages show the performance of each model for that dataset.
+* The "🏅 Overall" shows the average results of "M3Exam" and "MMLU".
+* The leaderboard is sorted by avg_sea, the average score across SEA languages (id, th, and vi).
 ## Reproducibility
 To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
 """
+# You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 EVALUATION_QUEUE_TEXT = """
 ## Some good practices before submitting a model