MiDashengLM-7B-1021
MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.
📖 For more detailed introduction and technical report, please visit our GitHub repository.
Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency.
Usage
Load Model
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Construct Prompt
user_prompt = "Caption the audio." # You may try any other prompt
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful language and speech assistant."}
],
},
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "audio",
"path": "/path/to/example.wav",
# or "url": "https://example.com/example.wav"
# or "audio": np.random.randn(16000)
},
],
},
]
Generate Output
import torch
with torch.no_grad():
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
add_special_tokens=True,
return_dict=True,
).to(device=model.device, dtype=model.dtype)
generation = model.generate(**model_inputs)
output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
Results
The following evaluation results are based on the model version: mispeech/midashenglm-7b-1021-fp32.
Audio Captioning Results
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|---|---|---|---|---|
| Music | MusicCaps | 59.11 | 43.71 | 35.43 |
| Music | Songdescriber | 46.42 | 45.31 | 44.63 |
| Sound | AudioCaps | 62.13 | 60.79 | 49.00 |
| Sound | ClothoV2 | 49.35 | 47.55 | 48.01 |
| Sound | AutoACD | 67.13 | 55.93 | 44.76 |
Metrics: FENSE (higher is better).
Audio and Paralinguistic Classification
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|---|---|---|---|---|
| VoxCeleb1 | ACC↑ | 92.66 | 59.71 | 82.72 |
| VoxLingua107 | ACC↑ | 93.72 | 51.03 | 73.65 |
| VoxCeleb-Gender | ACC↑ | 97.72 | 99.82 | 99.69 |
| VGGSound | ACC↑ | 52.19 | 0.97 | 2.20 |
| Cochlscene | ACC↑ | 75.81 | 23.88 | 18.34 |
| NSynth | ACC↑ | 80.32 | 60.45 | 38.09 |
| FMA | ACC↑ | 62.94 | 66.77 | 27.91 |
| FSDKaggle2018 | ACC↑ | 73.38 | 31.38 | 24.75 |
| AudioSet | mAP↑ | 9.90 | 6.48 | 3.47 |
| FSD50K | mAP↑ | 38.10 | 23.87 | 27.23 |
ASR Performance
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|---|---|---|---|---|
| LibriSpeech test-clean | English | 3.6 | 1.7 | 1.3 |
| LibriSpeech test-other | English | 5.9 | 3.4 | 2.4 |
| People's Speech | English | 26.12 | 28.6 | 22.3 |
| AISHELL2 Mic | Chinese | 3.2 | 2.5 | 2.7 |
| AISHELL2 iOS | Chinese | 2.9 | 2.6 | 2.6 |
| AISHELL2 Android | Chinese | 3.1 | 2.7 | 2.6 |
| GigaSpeech2 | Indonesian | 22.3 | 21.2 | >100 |
| GigaSpeech2 | Thai | 38.4 | 53.8 | >100 |
| GigaSpeech2 | Viet | 17.7 | 18.6 | >100 |
Metrics: WER/CER (lower is better).
Question Answering Results
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|---|---|---|---|---|---|
| MMAU-Pro | IF | ACC↑ | 37.93 | 61.30 | 42.30 |
| MMAU-Pro | Multi-Audio | ACC↑ | 42.33 | 24.30 | 17.20 |
| MMAU-Pro | Music | ACC↑ | 62.20 | 61.50 | 57.60 |
| MMAU-Pro | Open-ended | ACC↑ | 63.21 | 52.30 | 34.50 |
| MMAU-Pro | Sound | ACC↑ | 58.36 | 47.60 | 46.00 |
| MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | 46.00 |
| MMAU-Pro | Sound–Music–Speech | ACC↑ | 71.43 | 28.50 | 42.80 |
| MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | 43.70 |
| MMAU-Pro | Speech | ACC↑ | 61.17 | 57.40 | 52.20 |
| MMAU-Pro | Speech–Music | ACC↑ | 58.70 | 53.20 | 54.30 |
| MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | 60.20 | 48.90 |
| MMAU-Pro | Voice | ACC↑ | 54.83 | 60.00 | 50.60 |
| MMAU-Pro | Average | ACC↑ | 55.92 | 52.20 | 46.60 |
| MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | 78.10 | 75.68 |
| MMAU-v05.15.25 | Music | ACC↑ | 70.96 | 65.90 | 66.77 |
| MMAU-v05.15.25 | Speech | ACC↑ | 76.28 | 70.60 | 62.16 |
| MMAU-v05.15.25 | Average | ACC↑ | 74.90 | 71.50 | 68.20 |
| MuChoMusic | ACC↑ | 73.04 | 64.79 | 67.40 | |
| MusicQA | FENSE↑ | 61.56 | 60.60 | 40.00 | |
| AudioCaps-QA | FENSE↑ | 54.20 | 53.28 | 47.34 |
Metrics: Higher is better.
Citation
MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.
If you find MiDashengLM useful in your research, please consider citing our work:
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
- Downloads last month
- 41
Model tree for mispeech/midashenglm-7b-1021-fp32
Base model
Qwen/Qwen2.5-Omni-7B