---
language:
- en
tags:
- liquid
- lfm2
- audio
- lfm2-audio
- speech-to-speech
- liquid-audio
license: other
license_name: lfm1.0
license_link: LICENSE
library_name: liquid-audio
pipeline_tag: audio-to-audio
base_model:
- LiquidAI/LFM2-1.2B
---

<center>
<div style="text-align: center;">
  <img 
    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png" 
    alt="Liquid AI"
    style="width: 100%; max-width: 66%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
  />
</div>
</center>

# LFM2‑Audio-1.5B

LFM2-Audio-1.5B is [Liquid AI](https://www.liquid.ai/)'s first end-to-end audio foundation model.
Designed with low latency and real time conversation in mind, at only 1.5 billion parameters LFM2-Audio enables seamless conversational interaction, achieving capabilities on par with much larger models.

LFM2-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components.
Our model consists of a pretrained LFM2 model as its multimodal backbone, along with a FastConformer based audio encoder to handle continuous audio inputs, and a RQ-transformer generating discrete Mimi tokens as audio output.

LFM2-Audio supports two distinct generation routines, each suitable for a set of tasks.
Interleaved generation enables real-time speech-to-speech conversational chatbot capabilities, where audio generation latency is key.
Sequential generation is suited for non-conversational tasks such as ASR or TTS, and allows the model to switch generated modality on the fly. 

## 📄 Model details

| Property | |
|---|---:|
| **Parameters (LM only)** | 1.2B |
| **Audio encoder** | FastConformer (115M, [canary-180m-flash](https://huggingface.co/nvidia/canary-180m-flash)) |
| **Backbone layers** | hybrid conv+attention |
| **Audio tokenizer** | [Mimi](https://huggingface.co/kyutai/mimi), using 8 codebooks |
| **Context** | 32,768 tokens |
| **Vocab size** | 65,536 (text) / 2049*8 (audio) |
| **Precision** | bfloat16 |
| **License** | LFM Open License v1.0 |

**Supported languages:** English

## 🏃 How to run LFM2-Audio
Install the `liquid-audio` package via `pip`
```bash
pip install liquid-audio
pip install "liquid-audio [demo]" # optional, to install demo dependencies
pip install flash-attn --no-build-isolation  # optional, to use flash attention 2. Will fallback to torch SDPA if not installed
```

## Gradio demo
The simplest way to get started is by running the Gradio demo interface. After installation, run the command
```
liquid-audio-demo
```
This will start a webserver on port 7860. The interface can then be accessed via the URL http://localhost:7860/.

## Multi-turn, multi-modal chat
The `liquid-audio` provides a lower lever interface to the model and generation routines, ideal for custom usecases.
We demonstrate this with a simple multi-turn chat, where the first turn is given as audio, and the second turn is given as text.

For multi-turn chat with text and audio output, we use interleaved generation. The system prompt should be set to `Respond with interleaved text and audio.`. Here we use audio as the first user turn, and text as the second one.
```python
import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
HF_REPO = "LiquidAI/LFM2-Audio-1.5B"

processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()

# Set up inputs for the model
chat = ChatState(processor)

chat.new_turn("system")
chat.add_text("Respond with interleaved text and audio.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/question.wav")
chat.add_audio(wav, sampling_rate)
chat.end_turn()

chat.new_turn("assistant")

# Generate text and audio tokens.
text_out: list[torch.Tensor] = []
audio_out: list[torch.Tensor] = []
modality_out: list[LFMModality] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
        text_out.append(t)
        modality_out.append(LFMModality.TEXT)
    else:
        audio_out.append(t)
        modality_out.append(LFMModality.AUDIO_OUT)

# output: Sure! How about "Handcrafted Woodworking, Precision Made for You"? Another option could be "Quality Woodworking, Quality Results." If you want something more personal, you might try "Your Woodworking Needs, Our Expertise."

# Detokenize audio, removing the last "end-of-audio" codes
# Mimi returns audio at 24kHz
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
with torch.no_grad():
    waveform = processor.mimi.decode(mimi_codes)[0]
torchaudio.save("answer1.wav", waveform.cpu(), 24_000)

# Append newly generated tokens to chat history
chat.append(
    text = torch.stack(text_out, 1),
    audio_out = torch.stack(audio_out, 1),
    modality_flag = torch.tensor(modality_out),
)
chat.end_turn()

# Start new turn
chat.new_turn("user")
chat.add_text("My business specialized in chairs, can you give me something related to that?")
chat.end_turn()

chat.new_turn("assistant")

# Generate second turn text and audio tokens.
audio_out: list[torch.Tensor] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
    else:
        audio_out.append(t)

# output: Sure thing! How about “Comfortable Chairs, Crafted with Care” or “Elegant Seats, Handcrafted for You”? Let me know if you’d like a few more options.

# Detokenize second turn audio, removing the last "end-of-audio" codes
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
with torch.no_grad():
    waveform = processor.mimi.decode(mimi_codes)[0]
torchaudio.save("answer2.wav", waveform.cpu(), 24_000)
```

### ASR, TTS, additional information
Please visit the `liquid-audio` [package repository](https://github.com/Liquid4All/liquid-audio) for additional examples and sample audio snippets.

## 📈 Performance

### VoiceBench (audio input)

Higher is better. AlpacaEval, CommonEval and WildVoice are scored out of 5.

| Model           | Components & Size | AlpacaEval | CommonEval | WildVoice | SD-QA | MMSU  | OBQA  | BBH   | IFEval | ADVBench | Overall |
| --------------- | ----------------- | ---------- | ---------- | --------- | ----- | ----- | ----- | ----- | ------ | -------- | ------- |
| LFM2-Audio-1.5B | 1.5B parameters   | 3.71       | 3.49       | 3.17      | 30.56 | 31.95 | 44.40 | 30.54 | 98.85  | 67.33    | 56.78   |
| Moshi           | 7B parameters     | 2.01       | 1.60       | 1.30      | 15.64 | 24.04 | 25.93 | 47.40 | 10.12  | 44.23    | 29.51   |
| Qwen2.5-Omni-3B | 5B parameters     | 3.72       | 3.51       | 3.42      | 44.94 | 55.29 | 76.26 | 61.30 | 32.90  | 88.46    | 63.57   |
| Mini-Omni2      | 0.6B parameters   | 2.32       | 2.18       | 1.79      | 9.31  | 24.27 | 26.59 | 46.40 | 11.56  | 57.50    | 33.49   |

### ASR

Word Error Rate (WER), lower is better.

| Model                | Components & Size | Audio output  | Open | AMI   | GigaSpeech | LibriSpeech-clean | LibriSpeech-other | TED-LIUM | Average |
| -------------------- | ----------------- | ------------- | ---- | ----- | ---------- | ----------------- | ----------------- | -------- | ------- |
| LFM2-Audio-1.5B      | 1.5B parameters   | Yes           | Yes  | 15.58 | 10.67      | 2.01              | 4.39              | 3.56     | 7.24    |
| Qwen2.5-Omni-3B      | 5B parameters     | Yes           | Yes  | 15.95 | 10.02      | 2.01              | 3.91              | 3.86     | 7.15    |
| Whisper-large-V3     | 1.5B parameters   | No — ASR only | Yes  | 16.73 | 10.76      | 2.73              | 5.54              | 3.91     | 7.93    |
| elevenlabs/scribe_v1 | unknown           | No — ASR only | No   | 14.43 | 9.66       | 1.79              | 3.31              | 3.17     | 6.47    |


## 📬 Contact

If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact).

## License
The code in this the package repository and associated weights are licensed under the [LFM Open License v1.0](LICENSE).

The code for the audio encoder is based on [Nvidia NeMo](https://github.com/NVIDIA-NeMo/NeMo/tree/main), licensed under [Apache 2.0](https://github.com/NVIDIA-NeMo/NeMo/blob/294ddff187f68c055d87ffe9400e65975b38693d/LICENSE), and the [canary-180m-flash](https://huggingface.co/nvidia/canary-180m-flash) checkpoint, licensed under [CC-BY 4.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/cc-by-4.0.md). To simplify dependency resolution, we also ship the Python code of [Kyutai Mimi](https://github.com/kyutai-labs/moshi), licensed under the [MIT License](https://github.com/kyutai-labs/moshi/blob/aee53fc0fc0119e4d7343e5ea4dd6ddafd7f09c4/LICENSE-MIT).
We also redistribute weights for [Kyutai Mimi](https://huggingface.co/kyutai/moshiko-pytorch-bf16), licensed under [CC-BY-4.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/cc-by-4.0.md).