Paradox of the Readme

#11
by evewashere - opened

In the README it says:

LFM2-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components.

But then below that it says:
Audio encoder FastConformer (115M, canary-180m-flash)

Which is an automatic speech recognition model.
image

Which is weird.
Either it uses an ASR component, or it doesn't. Which one is it, Liquid?

Liquid AI org

We only use the audio encoder part of canary-180m-flash, which by itself does not constitute an ASR model (the encoder alone cannot produce text, it only converts audio into latent vectors).

When we say we don't use an ASR component, we mean that we don't pass the audio through an ASR model to transcribe the speech into text before feeding it to the LFM2 backbone as text. Instead, we encode the audio into latent vectors, and feed those directly (as "audio vectors", not text) into the LFM2 backbone.

I suppose one could say that the encoder is a "component identical to one from an ASR model", but similar Conformers are also used in a multitude of non-ASR audio models, as general purpose audio encoders. We also tune the weights of the encoder during training, so while the architecture of the encoder is the same as canary-180m-flash, the weights are not anymore.

haerski changed discussion status to closed

Sign up or log in to comment