File size: 6,290 Bytes
84df765 5f0fdc4 84df765 011a031 164b32a 51b469b a8d998c 51b469b a8d998c 51b469b a8d998c 5f0fdc4 a8d998c f2bee07 a8d998c 91c1cb4 a8d998c 91c1cb4 a8d998c 8a71b7a a8d998c 8a71b7a a8d998c 8a71b7a a8d998c f2bee07 a8d998c 8a71b7a a8d998c 8a71b7a a8d998c 8a71b7a a8d998c 8a71b7a 5700d22 a8d998c 51b469b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: other
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
tags:
- transformers
- pytorch
- speech-to-text
- conformer
- embedded
- edgeAI
- ExecuTorch
- audioprocessing
- transformer
- Arm
- MCU
---
# ExecuTorch Conformer
<!-- Provide a quick summary of what the model is/does. -->
Conformer is a popular Transformer based speech recognition network, suitable for low-cost embedded devices. This repository contains example FP32 trained weights and the associated tokenizer for an implementation of Conformer. We also include exported quantized program with ExecuTorch, quantized for the ExecuTorch Ethos-U backend allowing an easy deployment on SoCs with an Arm® Ethos™-U NPU.
## Model Details
### Model Description
Conformer is a popular Neural Network for speech recognition. This repository contains trained weights for the Conformer implementation in https://github.com/sooftware/conformer/
- **Developed by:** Arm
- **Model type:** Transformer
- **Language(s) (NLP):** English
- **License:** BigScience OpenRAIL-M v1.1
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/sooftware/conformer/
- **Paper [optional]:** https://arxiv.org/abs/2005.08100
## Uses
You need to install ExecuTorch 1.0 with `$ pip install executorch`.
By downloading the quantized exported graph module, you can directly call the `to_edge_transform_and_lower` API of ExecuTorch.
The `to_edge_transform_and_lower` API will convert the quantized exported program to backend-specific command stream for the Ethos-U.
The end result a pte file for your variant of the Ethos-U.
Below is an example script to produce a pte file for Ethos-U85 256 MAC configuration in Shared_Sram memory mode.
```
import torch
from executorch.backends.arm.ethosu import EthosUPartitioner, EthosUCompileSpec
from executorch.backends.arm.quantizer import (
EthosUQuantizer,
get_symmetric_quantization_config,
)
from executorch.exir import (
EdgeCompileConfig,
ExecutorchBackendConfig,
to_edge_transform_and_lower,
)
from executorch.extension.export_util.utils import save_pte_program
def main():
quant_exported_program = torch.export.load("Conformer_ArmQuantizer_quant_exported_program.pt2")
compile_spec = EthosUCompileSpec(
target="ethos-u85-256",
system_config="Ethos_U85_SYS_Flash_High",
memory_mode="Shared_Sram",
extra_flags=["--output-format=raw", "--debug-force-regor"],
)
partitioner = EthosUPartitioner(compile_spec)
print(
"Calling to_edge_transform_and_lower - lowering to TOSA and compiling for the Ethos-U hardware"
)
# Lower the exported program to the Ethos-U backend
edge_program_manager = to_edge_transform_and_lower(
quant_exported_program,
partitioner=[partitioner],
compile_config=EdgeCompileConfig(
_check_ir_validity=False,
),
)
executorch_program_manager = edge_program_manager.to_executorch(
config=ExecutorchBackendConfig(extract_delegate_segments=False)
)
save_pte_program(
executorch_program_manager, f"conformer_quantized.pte"
)
if __name__ == "__main__":
main()
```
## How to Get Started with the Model
To you can download directly the quantized exported program for the Ethos-U backend(`Conformer_ArmQuantizer_quant_exported_program.pt2`) and call the `to_edge_transform_and_lower` ExecuTorch API.
This means you don't need to train the model from scratch and you don't need to load & pre-process representative dataset for calibration. You can focus on developing the application running on device.
For an example end-to-end speech-to-text application running on an embedded platform, have a look at https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/experimental/executorch/docs/use_cases/asr.md
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the LibriSpeech 960h dataset. The dataset is composed of 460h of clean audio data and 500h of more noisy data. We obtain the dataset as part of the PyTorch torchaudio library.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
If you want to train the Conformer model from scratch, you can do so by following the instructions in https://github.com/Arm-Examples/ML-examples/tree/main/pytorch-conformer-train-quantize/training
We used an AWS g5.24xlarge instance to train the NN.
#### Preprocessing
We first train a tokenizer on the Librispeech dataset. The tokenizer converts labels into tokens. For example, in English, it is very common to have 's at the end of words, the tokenizer will identify that patten and have a dedicated token for the 's combination.
The code to obtain the tokenizer is available in https://github.com/Arm-Examples/ML-examples/blob/main/pytorch-conformer-train-quantize/training/build_sp_128_librispeech.py . The trained tokenizer is also available in the Hugging Face repository.
We also apply a MelSpectrogram on the input audio as per the Conformer paper - the LibriSpeech dataset contains audio recordings sampled at 16kHz. The Conformer
paper recommends 25ms window length, corresponding to 400 samples(16000*0.025=400) and a stride of 10ms, corresponding to 160 samples(16000*0.01). We use 80 filter banks as
recommended by the paper and 512 FFTs.
#### Training Hyperparameters
- **Training regime:** The model is trained in FP32
- **Epochs:** 160
- **Batch size:** 96
- **Learning rate:** 0.0005
- **Weight decay:** 1e-6
- **Warmup-epochs:** 2.0
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
We test the model on the LibriSpeech `test-clean` dataset and obtain 7% Word Error Rate. The accuracy of the model may be improved through training with additional datasets, and through data augmentation techniques such as time slicing. |