MiDashengLM-7B-1021

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.

📖 For more detailed introduction and technical report, please visit our GitHub repository.

Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency.

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-1021-fp32"  # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Construct Prompt

user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful language and speech assistant."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]

Generate Output

import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]

Results

The following evaluation results are based on the model version: mispeech/midashenglm-7b-1021-fp32.

Audio Captioning Results

Domain Dataset MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
Music MusicCaps 59.11 43.71 35.43
Music Songdescriber 46.42 45.31 44.63
Sound AudioCaps 62.13 60.79 49.00
Sound ClothoV2 49.35 47.55 48.01
Sound AutoACD 67.13 55.93 44.76

Metrics: FENSE (higher is better).

Audio and Paralinguistic Classification

Dataset Metric MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
VoxCeleb1 ACC↑ 92.66 59.71 82.72
VoxLingua107 ACC↑ 93.72 51.03 73.65
VoxCeleb-Gender ACC↑ 97.72 99.82 99.69
VGGSound ACC↑ 52.19 0.97 2.20
Cochlscene ACC↑ 75.81 23.88 18.34
NSynth ACC↑ 80.32 60.45 38.09
FMA ACC↑ 62.94 66.77 27.91
FSDKaggle2018 ACC↑ 73.38 31.38 24.75
AudioSet mAP↑ 9.90 6.48 3.47
FSD50K mAP↑ 38.10 23.87 27.23

ASR Performance

Dataset Language MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
LibriSpeech test-clean English 3.6 1.7 1.3
LibriSpeech test-other English 5.9 3.4 2.4
People's Speech English 26.12 28.6 22.3
AISHELL2 Mic Chinese 3.2 2.5 2.7
AISHELL2 iOS Chinese 2.9 2.6 2.6
AISHELL2 Android Chinese 3.1 2.7 2.6
GigaSpeech2 Indonesian 22.3 21.2 >100
GigaSpeech2 Thai 38.4 53.8 >100
GigaSpeech2 Viet 17.7 18.6 >100

Metrics: WER/CER (lower is better).

Question Answering Results

Dataset Subset Metric MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
MMAU-Pro IF ACC↑ 37.93 61.30 42.30
MMAU-Pro Multi-Audio ACC↑ 42.33 24.30 17.20
MMAU-Pro Music ACC↑ 62.20 61.50 57.60
MMAU-Pro Open-ended ACC↑ 63.21 52.30 34.50
MMAU-Pro Sound ACC↑ 58.36 47.60 46.00
MMAU-Pro Sound–Music ACC↑ 42.00 40.00 46.00
MMAU-Pro Sound–Music–Speech ACC↑ 71.43 28.50 42.80
MMAU-Pro Spatial ACC↑ 18.77 41.20 43.70
MMAU-Pro Speech ACC↑ 61.17 57.40 52.20
MMAU-Pro Speech–Music ACC↑ 58.70 53.20 54.30
MMAU-Pro Speech–Sound ACC↑ 51.14 60.20 48.90
MMAU-Pro Voice ACC↑ 54.83 60.00 50.60
MMAU-Pro Average ACC↑ 55.92 52.20 46.60
MMAU-v05.15.25 Sound ACC↑ 77.48 78.10 75.68
MMAU-v05.15.25 Music ACC↑ 70.96 65.90 66.77
MMAU-v05.15.25 Speech ACC↑ 76.28 70.60 62.16
MMAU-v05.15.25 Average ACC↑ 74.90 71.50 68.20
MuChoMusic ACC↑ 73.04 64.79 67.40
MusicQA FENSE↑ 61.56 60.60 40.00
AudioCaps-QA FENSE↑ 54.20 53.28 47.34

Metrics: Higher is better.

Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

If you find MiDashengLM useful in your research, please consider citing our work:

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}
Downloads last month
41
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mispeech/midashenglm-7b-1021-fp32

Finetuned
(37)
this model

Collection including mispeech/midashenglm-7b-1021-fp32