Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| from pathlib import Path | |
| # Directory where request by models are stored | |
| DIR_OUTPUT_REQUESTS = Path("requested_models") | |
| EVAL_REQUESTS_PATH = Path("eval_requests") | |
| ########################## | |
| # Text definitions # | |
| ########################## | |
| banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png" | |
| BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>' | |
| TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> ๐ค Open Automatic Speech Recognition Leaderboard </b> </body> </html>" | |
| INTRODUCTION_TEXT = "๐ The ๐ค Open ASR Leaderboard ranks and evaluates speech recognition models \ | |
| on the Hugging Face Hub. \ | |
| \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (โฌ๏ธ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (โฌ๏ธ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the ๐ Metrics tab to understand how the models are evaluated. \ | |
| \nIf you want results for a model that is not listed here, you can submit a request for it to be included โ๏ธโจ. \ | |
| \nThe leaderboard includes both English ASR evaluation and multilingual benchmarks across the top European languages." | |
| CITATION_TEXT = """@misc{srivastav2025openasrleaderboardreproducible, | |
| title={Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation}, | |
| author={Vaibhav Srivastav and Steven Zheng and Eric Bezzam and Eustache Le Bihan and Nithin Koluguri and Piotr ลปelasko and Somshubra Majumdar and Adel Moumen and Sanchit Gandhi}, | |
| year={2025}, | |
| eprint={2510.06961}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2510.06961}, | |
| } | |
| """ | |
| METRICS_TAB_TEXT = """ | |
| Here you will find details about the speech recognition metrics and datasets reported in our leaderboard. | |
| ## Metrics | |
| Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric | |
| is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based | |
| on their WER, lowest to highest. | |
| Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold: | |
| 1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish. | |
| 2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios). | |
| For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model). | |
| ### Word Error Rate (WER) | |
| Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage | |
| of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. | |
| Take the following example: | |
| | Reference: | the | cat | sat | on | the | mat | | |
| |-------------|-----|-----|---------|-----|-----|-----| | |
| | Prediction: | the | cat | **sit** | on | the | | | | |
| | Label: | โ | โ | S | โ | โ | D | | |
| Here, we have: | |
| * 1 substitution ("sit" instead of "sat") | |
| * 0 insertions | |
| * 1 deletion ("mat" is missing) | |
| This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our | |
| reference (N), which for this example is 6: | |
| ``` | |
| WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333 | |
| ``` | |
| Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation). | |
| ### Inverse Real Time Factor (RTFx) | |
| Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an | |
| model to process a given amount of speech. It is defined as: | |
| ``` | |
| RTFx = (number of seconds of audio inferred) / (compute time in seconds) | |
| ``` | |
| Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. | |
| Thus, **a higher RTFx value indicates lower latency**. | |
| ## How to reproduce our results | |
| The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. | |
| Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations. | |
| For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard | |
| P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! โฅ๏ธ | |
| ## Benchmark datasets | |
| Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the | |
| [ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model. | |
| ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad | |
| set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, | |
| acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how | |
| a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone. | |
| The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard | |
| are ranked based on their average WER scores, from lowest to highest. | |
| | Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License | | |
| |-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------| | |
| | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 | | |
| | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 | | |
| | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 | | |
| | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 | | |
| | [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement | | |
| | [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 | | |
| | [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 | | |
| For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352). | |
| """ | |
| # Multilingual benchmark definitions | |
| EU_LANGUAGES = { | |
| "de": {"name": "German", "flag": "๐ฉ๐ช", "datasets": ["mls", "fleurs", "covost"]}, | |
| "fr": {"name": "French", "flag": "๐ซ๐ท", "datasets": ["mls", "fleurs", "covost"]}, | |
| "it": {"name": "Italian", "flag": "๐ฎ๐น", "datasets": ["mls", "fleurs", "covost"]}, | |
| "es": {"name": "Spanish", "flag": "๐ช๐ธ", "datasets": ["mls", "fleurs", "covost"]}, | |
| "pt": {"name": "Portuguese", "flag": "๐ต๐น", "datasets": ["mls", "fleurs", "covost"]} | |
| } | |
| MULTILINGUAL_TAB_TEXT = """ | |
| ## ๐ Multilingual ASR Evaluation | |
| """ | |
| LONGFORM_TAB_TEXT = """ | |
| ## ๐ Long-form ASR Evaluation | |
| """ | |
| LEADERBOARD_CSS = """ | |
| #leaderboard-table th .header-content { | |
| white-space: nowrap; | |
| } | |
| #multilingual-table th .header-content { | |
| white-space: nowrap; | |
| } | |
| #multilingual-table th:hover { | |
| background-color: var(--table-row-focus); | |
| } | |
| #longform-table th .header-content { | |
| white-space: nowrap; | |
| } | |
| #longform-table th:hover { | |
| background-color: var(--table-row-focus); | |
| } | |
| .language-detail-modal { | |
| background: var(--background-fill-primary); | |
| border: 1px solid var(--border-color-primary); | |
| border-radius: 8px; | |
| padding: 1rem; | |
| margin: 1rem 0; | |
| } | |
| """ | |