ExecuTorch Conformer

Conformer is a popular Transformer based speech recognition network, suitable for low-cost embedded devices. This repository contains example FP32 trained weights and the associated tokenizer for an implementation of Conformer. We also include exported quantized program with ExecuTorch, quantized for the ExecuTorch Ethos-U backend allowing an easy deployment on SoCs with an Arm® Ethos™-U NPU.

Model Details

Model Description

Conformer is a popular Neural Network for speech recognition. This repository contains trained weights for the Conformer implementation in https://github.com/sooftware/conformer/

Developed by: Arm
Model type: Transformer
Language(s) (NLP): English
License: BigScience OpenRAIL-M v1.1

The model contains 10M parameters. For a SoC with Cortex-M and Ethos-U85 in Shared_Sram memory mode,the memory usage is 5.7MB of SRAM to store the peak intermediate tensor and 10.8MB of read-only data living in the external memory for the weights and biases.

Model Sources

Repository: https://github.com/sooftware/conformer/
Paper [optional]: https://arxiv.org/abs/2005.08100

Uses

You need to install ExecuTorch 1.0 with $ pip install executorch.

By downloading the quantized exported graph module, you can directly call the to_edge_transform_and_lower API of ExecuTorch. The to_edge_transform_and_lower API will convert the quantized exported program to backend-specific command stream for the Ethos-U. The end result a pte file for your variant of the Ethos-U. Below is an example script to produce a pte file for Ethos-U85 256 MAC configuration in Shared_Sram memory mode.

import torch
from executorch.backends.arm.ethosu import EthosUPartitioner, EthosUCompileSpec
from executorch.backends.arm.quantizer import (
    EthosUQuantizer,
    get_symmetric_quantization_config,
)
from executorch.exir import (
    EdgeCompileConfig,
    ExecutorchBackendConfig,
    to_edge_transform_and_lower,
)
from executorch.extension.export_util.utils import save_pte_program

def main():
    quant_exported_program = torch.export.load("Conformer_ArmQuantizer_quant_exported_program.pt2")
    compile_spec = EthosUCompileSpec(
            target="ethos-u85-256",
            system_config="Ethos_U85_SYS_Flash_High",
            memory_mode="Shared_Sram",
            extra_flags=["--output-format=raw", "--debug-force-regor"],
        )
    partitioner = EthosUPartitioner(compile_spec)
    print(
        "Calling to_edge_transform_and_lower - lowering to TOSA and compiling for the Ethos-U hardware"
    )
    # Lower the exported program to the Ethos-U backend
    edge_program_manager = to_edge_transform_and_lower(
        quant_exported_program,
        partitioner=[partitioner],
        compile_config=EdgeCompileConfig(
            _check_ir_validity=False,
        ),
    )
    executorch_program_manager = edge_program_manager.to_executorch(
        config=ExecutorchBackendConfig(extract_delegate_segments=False)
    )
    save_pte_program(
        executorch_program_manager, f"conformer_quantized.pte"
    )


if __name__ == "__main__":
    main()

How to Get Started with the Model

To you can download directly the quantized exported program for the Ethos-U backend(Conformer_ArmQuantizer_quant_exported_program.pt2) and call the to_edge_transform_and_lower ExecuTorch API. This means you don't need to train the model from scratch and you don't need to load & pre-process representative dataset for calibration. You can focus on developing the application running on device. For an example end-to-end speech-to-text application running on an embedded platform, have a look at https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/experimental/executorch/docs/use_cases/asr.md

Training Details

Training Data

We used the LibriSpeech 960h dataset. The dataset is composed of 460h of clean audio data and 500h of more noisy data. We obtain the dataset as part of the PyTorch torchaudio library.

Training Procedure

If you want to train the Conformer model from scratch, you can do so by following the instructions in https://github.com/Arm-Examples/ML-examples/tree/main/pytorch-conformer-train-quantize/training We used an AWS g5.24xlarge instance to train the NN.

Preprocessing

We first train a tokenizer on the Librispeech dataset. The tokenizer converts labels into tokens. For example, in English, it is very common to have 's at the end of words, the tokenizer will identify that patten and have a dedicated token for the 's combination. The code to obtain the tokenizer is available in https://github.com/Arm-Examples/ML-examples/blob/main/pytorch-conformer-train-quantize/training/build_sp_128_librispeech.py . The trained tokenizer is also available in the Hugging Face repository.

We also apply a MelSpectrogram on the input audio as per the Conformer paper - the LibriSpeech dataset contains audio recordings sampled at 16kHz. The Conformer paper recommends 25ms window length, corresponding to 400 samples(160000.025=400) and a stride of 10ms, corresponding to 160 samples(160000.01). We use 80 filter banks as recommended by the paper and 512 FFTs.

Training Hyperparameters

Training regime: The model is trained in FP32
Epochs: 160
Batch size: 96
Learning rate: 0.0005
Weight decay: 1e-6
Warmup-epochs: 2.0

Evaluation

Testing Data, Factors & Metrics

Testing Data

We test the model on the LibriSpeech test-clean dataset and obtain 7% Word Error Rate. The accuracy of the model may be improved through training with additional datasets, and through data augmentation techniques such as time slicing.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Arm
/

stt_en_conformer_executorch_small