Spaces:
Build error
Build error
Update app.py
Browse files
app.py
CHANGED
|
@@ -16,20 +16,49 @@ leaderboard_data = [
|
|
| 16 |
# Texto para la pestaña de métricas
|
| 17 |
METRICS_TAB_TEXT = """
|
| 18 |
## Metrics
|
| 19 |
-
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.
|
| 20 |
-
### UTMOS
|
| 21 |
-
The **UTMOS** (Utterance Mean Opinion Score) metric evaluates the **quality** of speech synthesis models. A higher UTMOS score indicates better audio quality.
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
The **Short-Time Objective Intelligibility (STOI)** is a metric used to evaluate the **intelligibility** of synthesized speech. Higher STOI values indicate clearer, more intelligible speech.
|
| 28 |
|
| 29 |
-
###
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
"""
|
| 32 |
|
|
|
|
|
|
|
| 33 |
####################################
|
| 34 |
# Functions (static version)
|
| 35 |
####################################
|
|
|
|
| 16 |
# Texto para la pestaña de métricas
|
| 17 |
METRICS_TAB_TEXT = """
|
| 18 |
## Metrics
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
Models in the leaderboard are evaluated using several key metrics:
|
| 21 |
+
* **UTMOS** (User-TTS Mean Opinion Score),
|
| 22 |
+
* **WER** (Word Error Rate),
|
| 23 |
+
* **STOI** (Short-Time Objective Intelligibility),
|
| 24 |
+
* **PESQ** (Perceptual Evaluation of Speech Quality).
|
| 25 |
|
| 26 |
+
These metrics help evaluate both the accuracy and quality of the model, as well as the inference speed.
|
|
|
|
| 27 |
|
| 28 |
+
### UTMOS (User-TTS Mean Opinion Score)
|
| 29 |
+
UTMOS is a subjective metric that evaluates the perceived quality of speech generated by a TTS system. **A higher UTMOS indicates better quality** of the generated voice.
|
| 30 |
+
|
| 31 |
+
### WER (Word Error Rate)
|
| 32 |
+
WER is a common metric for evaluating speech recognition systems. It measures the percentage of words in the generated transcript that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
|
| 33 |
+
|
| 34 |
+
Example:
|
| 35 |
+
| Reference | the | cat | sat | on | the | mat |
|
| 36 |
+
|-------------|------|-----|---------|-----|------|-----|
|
| 37 |
+
| Prediction | the | cat | **sit** | on | the | |
|
| 38 |
+
| Label | ✅ | ✅ | S | ✅ | ✅ | D |
|
| 39 |
+
|
| 40 |
+
The WER calculation is done as follows:
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
### STOI (Short-Time Objective Intelligibility)
|
| 48 |
+
STOI measures the intelligibility of the synthesized speech signal compared to the original signal. **A higher STOI indicates better intelligibility**.
|
| 49 |
+
|
| 50 |
+
### PESQ (Perceptual Evaluation of Speech Quality)
|
| 51 |
+
PESQ is a perceptual metric that evaluates the quality of speech in a similar manner to how a human listener would. **A higher PESQ indicates better voice quality**.
|
| 52 |
+
|
| 53 |
+
## How to Reproduce Our Results
|
| 54 |
+
The ASR Leaderboard will continue as an effort to benchmark open-source TTS models based on the metrics mentioned above. To reproduce these results, check our [GitHub repository](https://github.com/huggingface/open_asr_leaderboard).
|
| 55 |
+
|
| 56 |
+
## Benchmark Datasets
|
| 57 |
+
Model performance is evaluated using our test datasets. These datasets cover a variety of domains and acoustic conditions, ensuring a robust evaluation.
|
| 58 |
"""
|
| 59 |
|
| 60 |
+
|
| 61 |
+
|
| 62 |
####################################
|
| 63 |
# Functions (static version)
|
| 64 |
####################################
|