Paradox of the Readme

#11

by evewashere - opened 25 days ago

25 days ago

In the README it says:

LFM2-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components.

But then below that it says:
Audio encoder FastConformer (115M, canary-180m-flash)

Which is an automatic speech recognition model.

Which is weird.
Either it uses an ASR component, or it doesn't. Which one is it, Liquid?

haerski

Liquid AI org 25 days ago

We only use the audio encoder part of canary-180m-flash, which by itself does not constitute an ASR model (the encoder alone cannot produce text, it only converts audio into latent vectors).

When we say we don't use an ASR component, we mean that we don't pass the audio through an ASR model to transcribe the speech into text before feeding it to the LFM2 backbone as text. Instead, we encode the audio into latent vectors, and feed those directly (as "audio vectors", not text) into the LFM2 backbone.

I suppose one could say that the encoder is a "component identical to one from an ASR model", but similar Conformers are also used in a multitude of non-ASR audio models, as general purpose audio encoders. We also tune the weights of the encoder during training, so while the architecture of the encoder is the same as canary-180m-flash, the weights are not anymore.

haerski changed discussion status to closed 25 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment