Spaces:
Running
Running
| from dataclasses import dataclass | |
| from enum import Enum | |
| class Task: | |
| benchmark: str | |
| metric: str | |
| col_name: str | |
| # Select your tasks here | |
| # --------------------------------------------------- | |
| class Tasks(Enum): | |
| # task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
| task0 = Task("anli_r1", "acc", "ANLI") | |
| task1 = Task("logiqa", "acc_norm", "LogiQA") | |
| NUM_FEWSHOT = 0 # Change with your few shot | |
| # --------------------------------------------------- | |
| # Your leaderboard name | |
| TITLE = """<h1 align="center" id="space-title">MageBench Leaderboard</h1>""" | |
| # What does your leaderboard evaluate? | |
| INTRODUCTION_TEXT = """ | |
| MageBench is a reasoning-oriented multimodal intelligent agent benchmark introduced in the paper ["MageBench: Bridging Large Multimodal Models to Agents"](https://arxiv.org/abs/2412.04531). | |
| The tasks we selected meet the following criteria: | |
| - Simple environment, | |
| - Reflect a certain reasoning ability, | |
| - High level of visual involvement. | |
| In our paper, we demonstrate that our benchmark can generalize well to other scenarios. | |
| We hope our work can empower future research in the fields of intelligent agents, robotics, and more. | |
| """ | |
| # Which evaluations are you running? how can people reproduce what you have? | |
| LLM_BENCHMARKS_TEXT = f""" | |
| ## How it works | |
| This platform will not run your model for testing, it only provides a leaderboard. | |
| You need to choose a preset that matches your results, test it in your local environment, | |
| and then submit the results to us for approval. Once approved, we will make your results public. | |
| ## Reproducibility | |
| Since we are unable to reproduce the submitter's results, to ensure the reliability of the results, | |
| we require all submitters to provide either a link to a paper/blog/report that includes contact information or an open-source GitHub link that reproduces the results. | |
| **Results that do not meet the above conditions or have other issues affecting fairness | |
| (such as incorrect setting category) will be removed by us.** | |
| """ | |
| EVALUATION_QUEUE_TEXT = """ | |
| # Instructions to submit results | |
| - First, make sure you've read the content in About part. | |
| - Test you model locally and submit your results in the following form. | |
| - Upload **one** result each time by fulfill the form and click "Upload One Eval", and you will be able to see the result in the "Uploaded results" part. | |
| - Continue to upload untill all results are uploaded, click "Submit All", after restarting the space, you will be able to see your result on the leaderboard, but marked as checking. | |
| - If your uploaded results contain error, click "Click Upload" and re-upload all results | |
| - If there is an error in submitted results, you can upload an alternative, we will use the latest submitted results during our review. | |
| - If there is an error in "checked" results, email us to withdraw. | |
| # Detailed settings | |
| - **Score**: float number, the corresponding evaluation number | |
| - **Name**: str **less than 3 words**, an abbreviation representing your work, it can be a model name or paper key words. | |
| - **BaseModel**: str, LMM model for agent, suggested to be the unique hf model id | |
| - **Target-research**: (1)`Model-Eval-Online` and `Model-Eval-Global` represent the standard setting proposed in our paper, this setting is used to test the model capability. (2) `Agent-Eval-Prompt`: Any agent design that use fixed model weight, including using RAG, memory and etc. (3) `Agent-Eval-Finetune`: The model weight is changed, and it is trained on in-domain (same environment) data. | |
| """ | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = r""" | |
| @article{zhang2024magebench, | |
| title={MageBench: Bridging Large Multimodal Models to Agents}, | |
| author={Miaosen Zhang and Qi Dai and Yifan Yang and Jianmin Bao and Dongdong Chen and Kai Qiu and Chong Luo and Xin Geng and Baining Guo}, | |
| journal={arXiv preprint arXiv:2412.04531}, | |
| year={2024} | |
| } | |
| """ | |