--- license: apache-2.0 language: - en - zh - th - id - vi pipeline_tag: audio-text-to-text tags: - multimodal - audio-language-model - audio base_model: - mispeech/dasheng-0.6B - Qwen/Qwen2.5-Omni-7B base_model_relation: finetune --- # MiDashengLM-7B-1021 MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512. 📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm). Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency. ## Usage ### Load Model ```python from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16" model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_id) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) ``` ### Construct Prompt ```python user_prompt = "Caption the audio." # You may try any other prompt messages = [ { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "audio", "path": "/path/to/example.wav", # or "url": "https://example.com/example.wav" # or "audio": np.random.randn(16000) }, ], }, ] ``` ### Generate Output ```python import torch with torch.no_grad(): model_inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, add_special_tokens=True, return_dict=True, ).to(device=model.device, dtype=model.dtype) generation = model.generate(**model_inputs) output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."] ``` ## Results The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`. ### Audio Captioning Results | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| | Music | MusicCaps | **59.11** | 43.71 | 35.43 | | Music | Songdescriber | **46.42** | 45.31 | 44.63 | | Sound | AudioCaps | **62.13** | 60.79 | 49.00 | | Sound | ClothoV2 | **49.35** | 47.55 | 48.01 | | Sound | AutoACD | **67.13** | 55.93 | 44.76 | *Metrics: FENSE (higher is better).* ### Audio and Paralinguistic Classification | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:----------------:|:------:|:--------------:|:----------------:|:------------------:| | VoxCeleb1 | ACC↑ | **92.66** | 59.71 | 82.72 | | VoxLingua107 | ACC↑ | **93.72** | 51.03 | 73.65 | | VoxCeleb-Gender | ACC↑ | 97.72 | **99.82** | 99.69 | | VGGSound | ACC↑ | **52.19** | 0.97 | 2.20 | | Cochlscene | ACC↑ | **75.81** | 23.88 | 18.34 | | NSynth | ACC↑ | **80.32** | 60.45 | 38.09 | | FMA | ACC↑ | 62.94 | **66.77** | 27.91 | | FSDKaggle2018 | ACC↑ | **73.38** | 31.38 | 24.75 | | AudioSet | mAP↑ | **9.90** | 6.48 | 3.47 | | FSD50K | mAP↑ | **38.10** | 23.87 | 27.23 | ### ASR Performance | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:------------------:|:------------:|:-------------:|:------------:|:-------------------:| | LibriSpeech test-clean | English | 3.6 | 1.7 | **1.3** | | LibriSpeech test-other | English | 5.9 | 3.4 | **2.4** | | People's Speech | English | 26.12 | 28.6 | **22.3** | | AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 | | AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** | | AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** | | GigaSpeech2 | Indonesian | 22.3 | **21.2** | >100 | | GigaSpeech2 | Thai | **38.4** | 53.8 | >100 | | GigaSpeech2 | Viet | **17.7** | 18.6 | >100 | *Metrics: WER/CER (lower is better).* ### Question Answering Results | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:| | MMAU-Pro | IF | ACC↑ | 37.93 | **61.30** | 42.30 | | MMAU-Pro | Multi-Audio | ACC↑ | **42.33** | 24.30 | 17.20 | | MMAU-Pro | Music | ACC↑ | **62.20** | 61.50 | 57.60 | | MMAU-Pro | Open-ended | ACC↑ | **63.21** | 52.30 | 34.50 | | MMAU-Pro | Sound | ACC↑ | **58.36** | 47.60 | 46.00 | | MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | **46.00** | | MMAU-Pro | Sound–Music–Speech | ACC↑ | **71.43** | 28.50 | 42.80 | | MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | **43.70** | | MMAU-Pro | Speech | ACC↑ | **61.17** | 57.40 | 52.20 | | MMAU-Pro | Speech–Music | ACC↑ | **58.70** | 53.20 | 54.30 | | MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | **60.20** | 48.90 | | MMAU-Pro | Voice | ACC↑ | 54.83 | **60.00** | 50.60 | | MMAU-Pro | Average | ACC↑ | **55.92** | 52.20 | 46.60 | | MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | **78.10** | 75.68 | | MMAU-v05.15.25 | Music | ACC↑ | **70.96** | 65.90 | 66.77 | | MMAU-v05.15.25 | Speech | ACC↑ | **76.28** | 70.60 | 62.16 | | MMAU-v05.15.25 | Average | ACC↑ | **74.90** | 71.50 | 68.20 | | MuChoMusic | | ACC↑ | **73.04** | 64.79 | 67.40 | | MusicQA | | FENSE↑ | **61.56** | 60.60 | 40.00 | | AudioCaps-QA | | FENSE↑ | **54.20** | 53.28 | 47.34 | *Metrics: Higher is better.* ## Citation MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**. If you find MiDashengLM useful in your research, please consider citing our work: ```bibtex @techreport{midashenglm7b, title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions}, author = {{Horizon Team, MiLM Plus}}, institution= {Xiaomi Inc.}, year = {2025}, note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)}, url = {https://arxiv.org/abs/2508.03983}, eprint = {2508.03983}, } ```