Spaces:
Running
Running
update about page
Browse files- src/display/about.py +8 -3
src/display/about.py
CHANGED
|
@@ -28,7 +28,6 @@ This leaderboard is specifically designed to evaluate large language models (LLM
|
|
| 28 |
For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "π About" tab.
|
| 29 |
|
| 30 |
Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
|
| 31 |
-
|
| 32 |
"""
|
| 33 |
|
| 34 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
@@ -42,7 +41,7 @@ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/
|
|
| 42 |
- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
|
| 43 |
|
| 44 |
## Evalation Criteria
|
| 45 |
-
We evaluate the models with accuracy score.
|
| 46 |
|
| 47 |
We have the following settings for evaluation:
|
| 48 |
- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
|
|
@@ -50,7 +49,11 @@ We have the following settings for evaluation:
|
|
| 50 |
|
| 51 |
|
| 52 |
## Reults
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
## Reproducibility
|
| 56 |
To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
|
|
@@ -60,6 +63,8 @@ python scripts/main.py --model $model_name_or_path
|
|
| 60 |
|
| 61 |
"""
|
| 62 |
|
|
|
|
|
|
|
| 63 |
EVALUATION_QUEUE_TEXT = """
|
| 64 |
## Some good practices before submitting a model
|
| 65 |
|
|
|
|
| 28 |
For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "π About" tab.
|
| 29 |
|
| 30 |
Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
|
|
|
|
| 31 |
"""
|
| 32 |
|
| 33 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
|
| 41 |
- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
|
| 42 |
|
| 43 |
## Evalation Criteria
|
| 44 |
+
We evaluate the models with accuracy score.
|
| 45 |
|
| 46 |
We have the following settings for evaluation:
|
| 47 |
- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
|
|
|
|
| 49 |
|
| 50 |
|
| 51 |
## Reults
|
| 52 |
+
How to interpret the leaderboard?
|
| 53 |
+
* Each numerical value represet the accuracy (%).
|
| 54 |
+
* The "M3Exam" and "MMLU" pages show the performance of each model for that dataset.
|
| 55 |
+
* The "π
Overall" shows the average results of "M3Exam" and "MMLU".
|
| 56 |
+
* The leaderboard is sorted by avg_sea, the average score across SEA languages (id, th, and vi).
|
| 57 |
|
| 58 |
## Reproducibility
|
| 59 |
To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
|
|
|
|
| 63 |
|
| 64 |
"""
|
| 65 |
|
| 66 |
+
# You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
|
| 67 |
+
|
| 68 |
EVALUATION_QUEUE_TEXT = """
|
| 69 |
## Some good practices before submitting a model
|
| 70 |
|