tencent
/

Hunyuan-A13B-Instruct

+---
+license: other
+license_name: tencent-hunyuan-a13b
+license_link: LICENSE
+---
+<p align="center">
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
+</p><p></p>
+<p align="center">
+    &nbsp<a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>&nbsp&nbsp
+## Model Introduction
+The A13B models released by Tencent Hunyuan this time: [Tencent-Hunyuan-A13B-Pretrain](https://huggingface.co/tencent/Hunyuan-A13B-Pretrain) , [Tencent-Hunyuan-A13B-Instruct](https://huggingface.co/tencent/Hunyuan-A13B-Instruct) , [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8) and [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8), use better data allocation and training, have strong performance, and have achieved a good balance between computing and performance. It stands out from many large-scale language models and is currently one of the strongest Chinese Mixture of Experts (MoE) models, featuring a total of 80 billion parameters and 13 billion active parameters.
+### Introduction to Technical Advantages
+**Model**
+- **High-Quality Synthetic Data**: By enhancing training with synthetic data, Hunyuan-A13B is able to learn richer representations, handle long-context inputs, and generalize better to unseen data.
+- **KV Cache Compression**: Utilizing Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies, it significantly reduces memory usage and computational overhead of the KV cache, thereby improving inference throughput.
+- **Expert-Specific Learning Rate Scaling**: Different learning rates are assigned to different experts, ensuring that each sub-model can effectively learn from the data and contribute to overall performance.
+- **Long-Context Processing Capability**: Both the pre-trained model and the instruction-tuned model support text sequences of up to 256K tokens, significantly enhancing the ability to handle long-context tasks.
+- **Extensive Benchmarking**: Extensive experiments across multiple languages and tasks have validated the practical effectiveness and safety of Hunyuan-A13B.
+- **Hybrid Reasoning Capability**: It supports both fast thinking and slow thinking inference modes.
+**Architecture**
+Hunyuan-A13B adopts a Fine-grained Mixture of Experts (Fine-grained MoE) architecture, comprising a total of 80 billion parameters with 13 billion active parameters. The model has been trained on over 20 trillion tokens. It supports a context length of up to 256K tokens. The following are the detailed specifications of the model architecture:
+- **Total Parameters**: 80B
+- **Active Parameters**: 13B
+- **Number of Layers**: 32
+- **Attention Heads**: 32
+- **Number of Shared Experts**: 1
+- **Number of Non-Shared Experts**: 64
+- **Routing Strategy**: Top-8
+- **Activation Function**: SwiGLU
+- **Hidden Layer Dimension**: 4096
+- **Expert Hidden Layer Dimension**: 3072
+&nbsp;
+## Related News
+* 2025.6.27 We have open-sourced  **Hunyuan-A13B-Pretrain** , **Hunyuan-A13B-Instruct** , **Hunyuan-A13B-Instruct-FP8** , **Hunyuan-A13B-Instruct** on Hugging Face.
+<br>
+## Benchmark
+Note: The following benchmarks are evaluated by TRT-LLM-backend
+| Model            | Hunyuan-Large | Qwen2.5-72B | Qwen3-32B | Qwen3-A22B | Hunyuan-A13B |
+|------------------|---------------|--------------|---------------|-------------|---------------|
+| MMLU             | 88.4          | 86.1         | 83.61          | 87.81        | 88.17          |
+| MMLU-Pro         | 60.20          | 58.10        | 65.54          | 68.18           | 67.23          |
+| MMLU-Redux              |  87.47         | 83.90         | 83.41          | 87.40        | 87.67          |
+| BBH        | 86.30             | 85.8            | 87.38      | 88.87        | 87.56          |
+| SuperGPQA    |  38.90         | 37.84 *         | 39.78          | 44.06           | 41.32          |
+| EvalPlus       | 75.69          | 66.05         | 72.05          | 77.60        | 78.64          |
+| MultiPL-E             | 59.13             | 61.00            | 67.06          | 65.94        | 69.33          |
+| MBPP | 72.60             | 84.70            | 78.20          | 81.40        | 83.86          |
+| CRUX-O             | 60.63          | 56.00 *         | 72.50          | 79.00        | 77.00          |
+| MATH            | 69.80          | 62.1         | 61.62          | 71.84        | 72.35          |
+| GSM8k         | 92.80             | 91.5           | 93.40          | 94.39        | 91.83          |
+| GPQA            | -             | 45.9            | 47.97          | 47.47        | 43.44          |
+| INCLUDE           | 66.48             | 76.98 *            | 67.97          | 73.46        | 74.90         |
+| MGSM              | 67.52             | 79.53 *            | 82.68          | 83.53        | 76.00          |
+| MMMLU            | 76.89          | 79.28 *         | 83.83          | 86.70        | 84.68          |
+&nbsp;
+| Topic               | Bench                         | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
+|:-------------------:|:-----------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
+| **Mathematics**     | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
+| **Science**         | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
+| **Coding**          | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
+| **Reasoning**       | BBH<br>DROP<br>ZebraLogic    | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
+| **Instruction<br>Following** | IF-Eval<br>SysBench  | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
+| **Text<br>Creation**| LengthCtrl<br>InsCtrl       | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
+| **NLU**             | ComplexNLU<br>Word-Task     | 64.7<br>67.1 | 64.5<br>81.8 | 59.8<br>56.4 | 61.2<br>62.9 |
+| **Agent**           | BDCL v3<br> $\tau$-bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 63.8<br>58.7<br>n/a<br>55.3 | 70.8<br>46.7<br>n/a<br>51.7 | 78.3<br>54.7<br>51.2<br>63.5 |
+| **Average**         | -                            | n/a | n/a | n/a | n/a |
+## Quick Start
+You can refer to the content in [Hunyuan-A13B](https://github.com/Tencent-Hunyuan/Hunyuan-A13B) to get started quickly. The training and inference code can use the version provided in this github repository.
+### Transformer
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import os
+def main():
+    model_name_or_path = os.environ['MODEL_PATH']
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",
+                                                 trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
+    for name, param in model.named_parameters():
+        print(f"{name}: {param.size()}")
+    messages = [
+        {
+            "role": "system",
+            "content": "You are a helpful assistant.",
+        },
+        {"role": "user", "content": "Write a short summary of the benefits of regular exercise."},
+    ]
+    tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+    outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=100,do_sample=True)
+    print(tokenizer.decode(outputs[0]))
+if __name__ == '__main__':
+    main()
+```
+## Deployment
+For deployment, you can use frameworks such as *vLLM*, *SGLang*, or *TensorRT-LLM* to serve the model and create an OpenAI-compatible API endpoint.
+### vllm
+#### Docker Image
+We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official support is currently under development.
+- To get started:
+```
+Pull the Docker image:docker pull xxx
+```
+- Start the API server:
+```
+docker start xxx
+```
+#### Source Code
+Support for this model has been added via  this PR:  (https://github.com/vllm-project/vllm/pull/20114 )in the vLLM project.
+You can build and run vLLM from source after merging this pull request into your local repository.
+After applying the changes, you can start the API server by following the standard vLLM setup instructions.
+### SGLlang
+#### Docker Image
+We also provide a pre-built Docker image based on the latest version of SGLang.
+To get started:
+- Pull the Docker image
+```
+docker pull xxx
+```
+- Start the API server:
+```
+docker run --gpus all \
+    --shm-size 32g \
+    -p 30000:30000 \
+    --ipc=host \
+    xxx \
+    python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
+```
+#### Source Code
+The necessary integration has already been merged into the main branch via this PR(https://github.com/sgl-project/sglang/pull/7549 ).
+Once you have cloned or updated your local SGLang repository, you can build and run the API server using the standard SGLang setup process.
+After applying the changes, you can start the API server by following the standard SGLang setup instructions.
+```
+python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
+```
+###  TensorRT-LLM
+#### Docker Image
+We also provide a pre-built Docker image based on the latest version of TensorRT-LLM.
+To get started:
+- Pull the Docker image
+```
+docker pull xxx
+```
+- Start the API server:
+```
+docker run --gpus all \
+    --shm-size 32g \
+    -p 30000:30000 \
+    --ipc=host \
+    xxx \
+    python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
+```
+#### Source Code
+The necessary integration has already been merged into the main branch via this PR(xxx ).
+Once you have cloned or updated your local TensorRT-LLM. repository, you can build and run the API server using the standard TensorRT-LLM. setup process.
+After applying the changes, you can start the API server by following the standard TensorRT-LLM. setup instructions.
+## Inference Performance
+This section presents the efficiency test results of deploying various models using vLLM, including inference speed (tokens/s) under different batch sizes.
+Evaluation Script:
+```python
+python3 benchmark_throughput.py --backend vllm \
+         --input-len 2048 \
+         --output-len 14336 \
+         --model $MODEL_PATH \
+         --tensor-parallel-size $TP \
+         --use-v2-block-manager \
+         --async-engine \
+         --trust-remote-code \
+         --num_prompts $BATCH_SIZE \
+         --max-num-seqs $BATCH_SIZE
+```
+| Inference Framework | Model      | Number of GPUs (GPU productA) | input_length | batch=1             | batch=16              | batch=32       |
+|------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
+| vLLM | Hunyuan-A13B-Instruct                   |    8     | 2048                  |      190.84     |       1246.54      |       1981.99     |
+| vLLM | Hunyuan-A13B-Instruct                   |    4     | 2048                  |     158.90      |       779.10       |    1301.75        |
+| vLLM | Hunyuan-A13B-Instruct                   |    2     | 2048                  |    111.72       |      327.31        |    346.54         |
+| vLLM | Hunyuan-A13B-Instruct(int8 weight only) |    2      | 2048                  |   109.10       |      444.17        |     721.93        |
+| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8)       |    2      | 2048                  |    91.83       |      372.01        |      617.70       |
+| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8)       |    1      | 2048                  |     60.07      |         148.80     |      160.41       |
+## Contact Us
+If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).