ExecuTorch Conformer
Conformer is a popular Transformer based speech recognition network, suitable for low-cost embedded devices. This repository contains example FP32 trained weights and the associated tokenizer for an implementation of Conformer. We also include exported quantized program with ExecuTorch, quantized for the ExecuTorch Ethos-U backend allowing an easy deployment on SoCs with an Arm® Ethos™-U NPU.
Model Details
Model Description
Conformer is a popular Neural Network for speech recognition. This repository contains trained weights for the Conformer implementation in https://github.com/sooftware/conformer/
- Developed by: Arm
- Model type: Transformer
- Language(s) (NLP): English
- License: BigScience OpenRAIL-M v1.1
The model contains 10M parameters. For a SoC with Cortex-M and Ethos-U85 in Shared_Sram memory mode,the memory usage is 5.7MB of SRAM to store the peak intermediate tensor and 10.8MB of read-only data living in the external memory for the weights and biases.
Model Sources
- Repository: https://github.com/sooftware/conformer/
- Paper [optional]: https://arxiv.org/abs/2005.08100
Uses
You need to install ExecuTorch 1.0 with $ pip install executorch.
By downloading the quantized exported graph module, you can directly call the to_edge_transform_and_lower API of ExecuTorch.
The to_edge_transform_and_lower API will convert the quantized exported program to backend-specific command stream for the Ethos-U.
The end result a pte file for your variant of the Ethos-U.
Below is an example script to produce a pte file for Ethos-U85 256 MAC configuration in Shared_Sram memory mode.
import torch
from executorch.backends.arm.ethosu import EthosUPartitioner, EthosUCompileSpec
from executorch.backends.arm.quantizer import (
EthosUQuantizer,
get_symmetric_quantization_config,
)
from executorch.exir import (
EdgeCompileConfig,
ExecutorchBackendConfig,
to_edge_transform_and_lower,
)
from executorch.extension.export_util.utils import save_pte_program
def main():
quant_exported_program = torch.export.load("Conformer_ArmQuantizer_quant_exported_program.pt2")
compile_spec = EthosUCompileSpec(
target="ethos-u85-256",
system_config="Ethos_U85_SYS_Flash_High",
memory_mode="Shared_Sram",
extra_flags=["--output-format=raw", "--debug-force-regor"],
)
partitioner = EthosUPartitioner(compile_spec)
print(
"Calling to_edge_transform_and_lower - lowering to TOSA and compiling for the Ethos-U hardware"
)
# Lower the exported program to the Ethos-U backend
edge_program_manager = to_edge_transform_and_lower(
quant_exported_program,
partitioner=[partitioner],
compile_config=EdgeCompileConfig(
_check_ir_validity=False,
),
)
executorch_program_manager = edge_program_manager.to_executorch(
config=ExecutorchBackendConfig(extract_delegate_segments=False)
)
save_pte_program(
executorch_program_manager, f"conformer_quantized.pte"
)
if __name__ == "__main__":
main()
How to Get Started with the Model
To you can download directly the quantized exported program for the Ethos-U backend(Conformer_ArmQuantizer_quant_exported_program.pt2) and call the to_edge_transform_and_lower ExecuTorch API.
This means you don't need to train the model from scratch and you don't need to load & pre-process representative dataset for calibration. You can focus on developing the application running on device.
For an example end-to-end speech-to-text application running on an embedded platform, have a look at https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/experimental/executorch/docs/use_cases/asr.md
Training Details
Training Data
We used the LibriSpeech 960h dataset. The dataset is composed of 460h of clean audio data and 500h of more noisy data. We obtain the dataset as part of the PyTorch torchaudio library.
Training Procedure
If you want to train the Conformer model from scratch, you can do so by following the instructions in https://github.com/Arm-Examples/ML-examples/tree/main/pytorch-conformer-train-quantize/training We used an AWS g5.24xlarge instance to train the NN.
Preprocessing
We first train a tokenizer on the Librispeech dataset. The tokenizer converts labels into tokens. For example, in English, it is very common to have 's at the end of words, the tokenizer will identify that patten and have a dedicated token for the 's combination. The code to obtain the tokenizer is available in https://github.com/Arm-Examples/ML-examples/blob/main/pytorch-conformer-train-quantize/training/build_sp_128_librispeech.py . The trained tokenizer is also available in the Hugging Face repository.
We also apply a MelSpectrogram on the input audio as per the Conformer paper - the LibriSpeech dataset contains audio recordings sampled at 16kHz. The Conformer paper recommends 25ms window length, corresponding to 400 samples(160000.025=400) and a stride of 10ms, corresponding to 160 samples(160000.01). We use 80 filter banks as recommended by the paper and 512 FFTs.
Training Hyperparameters
- Training regime: The model is trained in FP32
- Epochs: 160
- Batch size: 96
- Learning rate: 0.0005
- Weight decay: 1e-6
- Warmup-epochs: 2.0
Evaluation
Testing Data, Factors & Metrics
Testing Data
We test the model on the LibriSpeech test-clean dataset and obtain 7% Word Error Rate. The accuracy of the model may be improved through training with additional datasets, and through data augmentation techniques such as time slicing.