MiDashengLM-7B-1021

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.

📖 For more detailed introduction and technical report, please visit our GitHub repository.

Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency.

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-1021-fp32"  # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Construct Prompt

user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful language and speech assistant."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]

Generate Output

import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]

Results

The following evaluation results are based on the model version: mispeech/midashenglm-7b-1021-fp32.

Audio Captioning Results

Domain	Dataset	MiDashengLM	Qwen2.5-Omni-7B	Kimi-Audio-Instruct
Music	MusicCaps	59.11	43.71	35.43
Music	Songdescriber	46.42	45.31	44.63
Sound	AudioCaps	62.13	60.79	49.00
Sound	ClothoV2	49.35	47.55	48.01
Sound	AutoACD	67.13	55.93	44.76

Metrics: FENSE (higher is better).

Audio and Paralinguistic Classification

Dataset	Metric	MiDashengLM	Qwen2.5-Omni-7B	Kimi-Audio-Instruct
VoxCeleb1	ACC↑	92.66	59.71	82.72
VoxLingua107	ACC↑	93.72	51.03	73.65
VoxCeleb-Gender	ACC↑	97.72	99.82	99.69
VGGSound	ACC↑	52.19	0.97	2.20
Cochlscene	ACC↑	75.81	23.88	18.34
NSynth	ACC↑	80.32	60.45	38.09
FMA	ACC↑	62.94	66.77	27.91
FSDKaggle2018	ACC↑	73.38	31.38	24.75
AudioSet	mAP↑	9.90	6.48	3.47
FSD50K	mAP↑	38.10	23.87	27.23

ASR Performance

Dataset	Language	MiDashengLM	Qwen2.5-Omni-7B	Kimi-Audio-Instruct
LibriSpeech test-clean	English	3.6	1.7	1.3
LibriSpeech test-other	English	5.9	3.4	2.4
People's Speech	English	26.12	28.6	22.3
AISHELL2 Mic	Chinese	3.2	2.5	2.7
AISHELL2 iOS	Chinese	2.9	2.6	2.6
AISHELL2 Android	Chinese	3.1	2.7	2.6
GigaSpeech2	Indonesian	22.3	21.2	>100
GigaSpeech2	Thai	38.4	53.8	>100
GigaSpeech2	Viet	17.7	18.6	>100

Metrics: WER/CER (lower is better).

Question Answering Results

Dataset	Subset	Metric	MiDashengLM	Qwen2.5-Omni-7B	Kimi-Audio-Instruct
MMAU-Pro	IF	ACC↑	37.93	61.30	42.30
MMAU-Pro	Multi-Audio	ACC↑	42.33	24.30	17.20
MMAU-Pro	Music	ACC↑	62.20	61.50	57.60
MMAU-Pro	Open-ended	ACC↑	63.21	52.30	34.50
MMAU-Pro	Sound	ACC↑	58.36	47.60	46.00
MMAU-Pro	Sound–Music	ACC↑	42.00	40.00	46.00
MMAU-Pro	Sound–Music–Speech	ACC↑	71.43	28.50	42.80
MMAU-Pro	Spatial	ACC↑	18.77	41.20	43.70
MMAU-Pro	Speech	ACC↑	61.17	57.40	52.20
MMAU-Pro	Speech–Music	ACC↑	58.70	53.20	54.30
MMAU-Pro	Speech–Sound	ACC↑	51.14	60.20	48.90
MMAU-Pro	Voice	ACC↑	54.83	60.00	50.60
MMAU-Pro	Average	ACC↑	55.92	52.20	46.60
MMAU-v05.15.25	Sound	ACC↑	77.48	78.10	75.68
MMAU-v05.15.25	Music	ACC↑	70.96	65.90	66.77
MMAU-v05.15.25	Speech	ACC↑	76.28	70.60	62.16
MMAU-v05.15.25	Average	ACC↑	74.90	71.50	68.20
MuChoMusic		ACC↑	73.04	64.79	67.40
MusicQA		FENSE↑	61.56	60.60	40.00
AudioCaps-QA		FENSE↑	54.20	53.28	47.34

Metrics: Higher is better.

Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

If you find MiDashengLM useful in your research, please consider citing our work:

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}