Ling-mini-2.0: Mini-Sized, Maximum Efficiency

Community Article Published October 2, 2025

Upvote

Zhang Zhiqiang

zzqsmall

Richard Bian

RichardBian

Foreword

Powerful General and Specialized Reasoning Capabilities

Equivalent to 7x or More Performance Leverage Over Dense Architectures

High-Speed Generation of 300+ tokens/s

First Open-Sourced FP8 High-Efficiency Training Solution

More Open Model Releases

So, where to find the models?

Foreword

Since 9/10/2025, Ant Group officially began open-sourcing Ling 2.0 - a series of MoE (Mixture-of-Experts) architecture Large Language Models (LLMs) that combines State-of-the-Art (SOTA) performance with high efficiency. This is the latest open-source LLM series produced by Ant inclusionAI, an AI research initiative backed by Ant Group, the master company of Alipay.

The first released version, Ling-mini-2.0, is compact yet powerful. It features a total of 16B parameters, but only 1.4B of these are activated per input token (non-embedding 789M). It was pre-trained on over 20T tokens of high-quality corpus and has significantly enhanced complex reasoning and instruction-following capabilities through multi-stage supervised fine-tuning and reinforcement learning. This allows it to achieve the top performance level of dense LLMs under 10B and MoE models of comparable or larger scale, despite having only 1.4B activated parameters.

Powerful General and Specialized Reasoning Capabilities

We tested the model on high-difficulty general reasoning tasks, such as Code (LiveCodeBench, CodeForces) and Mathematics (AIME 2025, HMMT 2025), as well as knowledge-based reasoning tasks covering multiple specialized disciplines (MMLU-Pro, Humanity's Last Exam). Compared to dense models under 10B (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and MoE models of comparable or larger scale (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding comprehensive reasoning capabilities.

Equivalent to 7x or More Performance Leverage Over Dense Architectures

Guided by the Ling Scaling Laws, Ling 2.0 adopts an MoE architecture with a 1/32 activation ratio. We achieved empirically optimal configurations across various aspects, including expert granularity, shared expert ratio, attention ratio, an aux-loss free + sigmoid routing balancing strategy, MTP (Multi-Tower Prediction) layer, QK-Norm, and half RoPE. This enabled the small-activation MoE to achieve a performance leverage equivalent to 7x or more over dense architectures. In other words, Ling-mini-2.0, with only 1.4B (non-embedding 789M) activated parameters, can achieve performance equivalent to a dense model of approximately 7-8B parameters.

High-Speed Generation of 300+ tokens/s

The highly sparse, small-activation MoE architecture provides a significant advantage in training and inference. In simple Q&A scenarios with outputs under 2000 tokens, Ling-mini-2.0's generation speed can reach 300+ tokens/s (H20 deployment), which is more than 2 times faster compared to 8B dense models. Using YaRN extrapolation, Ling-mini-2.0 supports a 128K context window. As the output length increases, its relative speedup can reach a maximum of over 7 times.

First Open-Sourced FP8 High-Efficiency Training Solution

Ling 2.0 utilizes FP8 mixed-precision training throughout the entire process. A comparison with BF16 over more than 1T tokens of training data showed that the two curves are nearly identical in both the loss curve and dozens of downstream benchmarks. To help the community efficiently continue pre-training and fine-tuning with limited computing resources, we are concurrently open-sourcing the FP8 training solution. Building upon tile/blockwise FP8 scaling, it further introduces technologies such as FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map to maximally optimize GPU memory utilization. In computing power tests using 8/16/32 cards of 80G GPUs, Ling-mini-2.0, when the MTP layer is enabled, achieved a 30–60% throughput improvement compared to LLaMA 3.1 8B and Qwen3 8B. When MTP is disabled, the throughput improvement reached 90–120%.

More Open Model Releases

We believe Ling-mini-2.0 is an ideal starting point for MoE model research. It is the first small-scale model to integrate features like 1/32 sparsity, the MTP layer, and FP8 training, while demonstrating exceptional results and performance. It has the potential to become the preferred choice for small-sized language models. To foster community development, in addition to the post-training version and the open-sourced FP8 high-efficiency training solution, we are also releasing 5 pre-training versions: the base model before post-training, Ling-mini-2.0-base, and four phase-based models covering 5T, 10T, 15T, and 20T tokens. This will facilitate more in-depth research and application within the community.

We welcome everyone to visit our open-source repository for download and use, or to try it directly on HuggingFace Spaces. Under the Ling 2.0 architecture, we will continuously update with larger, faster, and better language and multimodal models, and we look forward to introducing them in future releases!

So, where to find the models?

https://huggingface.co/spaces/inclusionAI/ling-mini-2.0 to experience https://huggingface.co/inclusionAI/Ling-mini-2.0 to download https://github.com/inclusionAI/Ling-V2

Collection: PublicRelease, Ling-2.0

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment