Improve model card with detailed quick start, system requirements, and paper link clarification (#20)
Browse files- Improve model card with detailed quick start, system requirements, and paper link clarification (aa08a0da0888d6ba56175389d89b90eec6dc8ebe)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
    	
        README.md
    CHANGED
    
    | @@ -1,45 +1,186 @@ | |
| 1 | 
            -
            ---
         | 
| 2 | 
            -
             | 
| 3 | 
            -
             | 
| 4 | 
            -
            -  | 
| 5 | 
            -
             | 
| 6 | 
            -
             | 
| 7 | 
            -
             | 
| 8 | 
            -
            ---
         | 
| 9 | 
            -
             | 
| 10 | 
            -
            # GLM-4.5
         | 
| 11 | 
            -
             | 
| 12 | 
            -
            <div align="center">
         | 
| 13 | 
            -
            <img src=https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg width="15%"/>
         | 
| 14 | 
            -
            </div>
         | 
| 15 | 
            -
            <p align="center">
         | 
| 16 | 
            -
                π Join our <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
         | 
| 17 | 
            -
                <br>
         | 
| 18 | 
            -
                π Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>, <a href="https://arxiv.org/abs/2508.06471" target="_blank">technical report</a>, and <a href="https://zhipu-ai.feishu.cn/wiki/Gv3swM0Yci7w7Zke9E0crhU7n7D" target="_blank">Zhipu AI technical documentation</a>.
         | 
| 19 | 
            -
                <br>
         | 
| 20 | 
            -
                π Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
         | 
| 21 | 
            -
                <br>
         | 
| 22 | 
            -
                π One click to <a href="https://chat.z.ai">GLM-4.5</a>.
         | 
| 23 | 
            -
            </p>
         | 
| 24 | 
            -
              
         | 
| 25 | 
            -
            ## Model Introduction
         | 
| 26 | 
            -
             | 
| 27 | 
            -
            The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.
         | 
| 28 | 
            -
             | 
| 29 | 
            -
            Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.
         | 
| 30 | 
            -
             | 
| 31 | 
            -
            We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.
         | 
| 32 | 
            -
             | 
| 33 | 
            -
            As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source | 
| 34 | 
            -
             | 
| 35 | 
            -
            
         | 
| 36 | 
            -
             | 
| 37 | 
            -
            For more eval results, show cases, and technical details, please visit
         | 
| 38 | 
            -
            our [technical blog](https://z.ai/blog/glm-4.5) | 
| 39 | 
            -
             | 
| 40 | 
            -
             | 
| 41 | 
            -
             | 
| 42 | 
            -
             | 
| 43 | 
            -
             | 
| 44 | 
            -
             | 
| 45 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            language:
         | 
| 3 | 
            +
            - en
         | 
| 4 | 
            +
            - zh
         | 
| 5 | 
            +
            library_name: transformers
         | 
| 6 | 
            +
            license: mit
         | 
| 7 | 
            +
            pipeline_tag: text-generation
         | 
| 8 | 
            +
            ---
         | 
| 9 | 
            +
             | 
| 10 | 
            +
            # GLM-4.5
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            <div align="center">
         | 
| 13 | 
            +
            <img src=https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg width="15%"/>
         | 
| 14 | 
            +
            </div>
         | 
| 15 | 
            +
            <p align="center">
         | 
| 16 | 
            +
                π Join our <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
         | 
| 17 | 
            +
                <br>
         | 
| 18 | 
            +
                π Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>, <a href="https://arxiv.org/abs/2508.06471" target="_blank">technical report</a>, and <a href="https://zhipu-ai.feishu.cn/wiki/Gv3swM0Yci7w7Zke9E0crhU7n7D" target="_blank">Zhipu AI technical documentation</a>.
         | 
| 19 | 
            +
                <br>
         | 
| 20 | 
            +
                π Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
         | 
| 21 | 
            +
                <br>
         | 
| 22 | 
            +
                π One click to <a href="https://chat.z.ai">GLM-4.5</a>.
         | 
| 23 | 
            +
            </p>
         | 
| 24 | 
            +
              
         | 
| 25 | 
            +
            ## Model Introduction
         | 
| 26 | 
            +
             | 
| 27 | 
            +
            The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.
         | 
| 28 | 
            +
             | 
| 29 | 
            +
            Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.
         | 
| 30 | 
            +
             | 
| 31 | 
            +
            We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency.
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            
         | 
| 36 | 
            +
             | 
| 37 | 
            +
            For more eval results, show cases, and technical details, please visit
         | 
| 38 | 
            +
            our [technical blog](https://z.ai/blog/glm-4.5) or [technical report](https://arxiv.org/abs/2508.06471).
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            The model code, tool parser and reasoning parser can be found in the implementation of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py) and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py).
         | 
| 41 | 
            +
             | 
| 42 | 
            +
            ## Model Downloads
         | 
| 43 | 
            +
             | 
| 44 | 
            +
            You can directly experience the model on [Hugging Face](https://huggingface.co/spaces/zai-org/GLM-4.5-Space)
         | 
| 45 | 
            +
            or [ModelScope](https://modelscope.cn/studios/ZhipuAI/GLM-4.5-Demo) or download the model by following the links below.
         | 
| 46 | 
            +
             | 
| 47 | 
            +
            | Model            | Download Links                                                                                                                                | Model Size | Precision |
         | 
| 48 | 
            +
            |------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
         | 
| 49 | 
            +
            | GLM-4.5          | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5)                   | 355B-A32B  | BF16      |
         | 
| 50 | 
            +
            | GLM-4.5-Air      | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air)           | 106B-A12B  | BF16      |
         | 
| 51 | 
            +
            | GLM-4.5-FP8      | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5-FP8)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-FP8)           | 355B-A32B  | FP8       |
         | 
| 52 | 
            +
            | GLM-4.5-Air-FP8  | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-FP8)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-FP8)   | 106B-A12B  | FP8       |
         | 
| 53 | 
            +
            | GLM-4.5-Base     | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Base)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Base)         | 355B-A32B  | BF16      |
         | 
| 54 | 
            +
            | GLM-4.5-Air-Base | [π€ Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-Base)<br> [π€ ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-Base) | 106B-A12B  | BF16      |
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            ## System Requirements
         | 
| 57 | 
            +
             | 
| 58 | 
            +
            ### Inference
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            We provide minimum and recommended configurations for "full-featured" model inference. The data in the table below is
         | 
| 61 | 
            +
            based on the following conditions:
         | 
| 62 | 
            +
             | 
| 63 | 
            +
            1. All models use MTP layers and specify
         | 
| 64 | 
            +
               `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` to ensure competitive
         | 
| 65 | 
            +
               inference speed.
         | 
| 66 | 
            +
            2. The `cpu-offload` parameter is not used.
         | 
| 67 | 
            +
            3. Inference batch size does not exceed `8`.
         | 
| 68 | 
            +
            4. All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
         | 
| 69 | 
            +
            5. Server memory must exceed `1T` to ensure normal model loading and operation.
         | 
| 70 | 
            +
             | 
| 71 | 
            +
            The models can run under the configurations in the table below:
         | 
| 72 | 
            +
             | 
| 73 | 
            +
            | Model       | Precision | GPU Type and Count   | Test Framework |
         | 
| 74 | 
            +
            |-------------|-----------|----------------------|----------------|
         | 
| 75 | 
            +
            | GLM-4.5     | BF16      | H100 x 16 / H200 x 8 | sglang         |
         | 
| 76 | 
            +
            | GLM-4.5     | FP8       | H100 x 8 / H200 x 4  | sglang         |
         | 
| 77 | 
            +
            | GLM-4.5-Air | BF16      | H100 x 4 / H200 x 2  | sglang         |
         | 
| 78 | 
            +
            | GLM-4.5-Air | FP8       | H100 x 2 / H200 x 1  | sglang         |
         | 
| 79 | 
            +
             | 
| 80 | 
            +
            Under the configurations in the table below, the models can utilize their full 128K context length:
         | 
| 81 | 
            +
             | 
| 82 | 
            +
            | Model       | Precision | GPU Type and Count    | Test Framework |
         | 
| 83 | 
            +
            |-------------|-----------|-----------------------|----------------|
         | 
| 84 | 
            +
            | GLM-4.5     | BF16      | H100 x 32 / H200 x 16 | sglang         |
         | 
| 85 | 
            +
            | GLM-4.5     | FP8       | H100 x 16 / H200 x 8  | sglang         |
         | 
| 86 | 
            +
            | GLM-4.5-Air | BF16      | H100 x 8 / H200 x 4   | sglang         |
         | 
| 87 | 
            +
            | GLM-4.5-Air | FP8       | H100 x 4 / H200 x 2   | sglang         |
         | 
| 88 | 
            +
             | 
| 89 | 
            +
            ### Fine-tuning
         | 
| 90 | 
            +
             | 
| 91 | 
            +
            The code can run under the configurations in the table below
         | 
| 92 | 
            +
            using [Llama Factory](https://github.com/hiyouga/LLaMA-Factory):
         | 
| 93 | 
            +
             | 
| 94 | 
            +
            | Model       | GPU Type and Count | Strategy | Batch Size (per GPU) |
         | 
| 95 | 
            +
            |-------------|--------------------|----------|----------------------|
         | 
| 96 | 
            +
            | GLM-4.5     | H100 x 16          | Lora     | 1                    |
         | 
| 97 | 
            +
            | GLM-4.5-Air | H100 x 4           | Lora     | 1                    |
         | 
| 98 | 
            +
             | 
| 99 | 
            +
            The code can run under the configurations in the table below using [Swift](https://github.com/modelscope/ms-swift):
         | 
| 100 | 
            +
             | 
| 101 | 
            +
            | Model       | GPU Type and Count | Strategy | Batch Size (per GPU) |
         | 
| 102 | 
            +
            |-------------|--------------------|----------|----------------------|
         | 
| 103 | 
            +
            | GLM-4.5     | H20 (96GiB) x 16   | Lora     | 1                    |
         | 
| 104 | 
            +
            | GLM-4.5-Air | H20 (96GiB) x 4    | Lora     | 1                    |
         | 
| 105 | 
            +
            | GLM-4.5     | H20 (96GiB) x 128  | SFT      | 1                    |
         | 
| 106 | 
            +
            | GLM-4.5-Air | H20 (96GiB) x 32   | SFT      | 1                    |
         | 
| 107 | 
            +
            | GLM-4.5     | H20 (96GiB) x 128  | RL       | 1                    |
         | 
| 108 | 
            +
            | GLM-4.5-Air | H20 (96GiB) x 32   | RL       | 1                    |
         | 
| 109 | 
            +
             | 
| 110 | 
            +
            ## Quick Start
         | 
| 111 | 
            +
             | 
| 112 | 
            +
            Please install the required packages according to `requirements.txt`.
         | 
| 113 | 
            +
             | 
| 114 | 
            +
            ```shell
         | 
| 115 | 
            +
            pip install -r requirements.txt
         | 
| 116 | 
            +
            ```
         | 
| 117 | 
            +
             | 
| 118 | 
            +
            ### transformers
         | 
| 119 | 
            +
             | 
| 120 | 
            +
            Please refer to the `trans_infer_cli.py` code in the `inference` folder.
         | 
| 121 | 
            +
             | 
| 122 | 
            +
            ### vLLM
         | 
| 123 | 
            +
             | 
| 124 | 
            +
            + Both BF16 and FP8 can be started with the following code:
         | 
| 125 | 
            +
             | 
| 126 | 
            +
            ```shell
         | 
| 127 | 
            +
            vllm serve zai-org/GLM-4.5-Air \
         | 
| 128 | 
            +
                --tensor-parallel-size 8 \
         | 
| 129 | 
            +
                --tool-call-parser glm45 \
         | 
| 130 | 
            +
                --reasoning-parser glm45 \
         | 
| 131 | 
            +
                --enable-auto-tool-choice \
         | 
| 132 | 
            +
                --served-model-name glm-4.5-air
         | 
| 133 | 
            +
            ```
         | 
| 134 | 
            +
             | 
| 135 | 
            +
            If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
         | 
| 136 | 
            +
            `--cpu-offload-gb 16` (only applicable to vLLM).
         | 
| 137 | 
            +
             | 
| 138 | 
            +
            If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
         | 
| 139 | 
            +
            specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer` (different GPUs have different TORCH_CUDA_ARCH_LIST
         | 
| 140 | 
            +
            values, please check accordingly).
         | 
| 141 | 
            +
             | 
| 142 | 
            +
            ### SGLang
         | 
| 143 | 
            +
             | 
| 144 | 
            +
            + BF16
         | 
| 145 | 
            +
             | 
| 146 | 
            +
            ```shell
         | 
| 147 | 
            +
            python3 -m sglang.launch_server \
         | 
| 148 | 
            +
              --model-path zai-org/GLM-4.5-Air \
         | 
| 149 | 
            +
              --tp-size 8 \
         | 
| 150 | 
            +
              --tool-call-parser glm45  \
         | 
| 151 | 
            +
              --reasoning-parser glm45 \
         | 
| 152 | 
            +
              --speculative-algorithm EAGLE \
         | 
| 153 | 
            +
              --speculative-num-steps 3 \
         | 
| 154 | 
            +
              --speculative-eagle-topk 1 \
         | 
| 155 | 
            +
              --speculative-num-draft-tokens 4 \
         | 
| 156 | 
            +
              --mem-fraction-static 0.7 \
         | 
| 157 | 
            +
              --served-model-name glm-4.5-air \
         | 
| 158 | 
            +
              --host 0.0.0.0 \
         | 
| 159 | 
            +
              --port 8000
         | 
| 160 | 
            +
            ```
         | 
| 161 | 
            +
             | 
| 162 | 
            +
            + FP8
         | 
| 163 | 
            +
             | 
| 164 | 
            +
            ```shell
         | 
| 165 | 
            +
            python3 -m sglang.launch_server \
         | 
| 166 | 
            +
              --model-path zai-org/GLM-4.5-Air-FP8 \
         | 
| 167 | 
            +
              --tp-size 4 \
         | 
| 168 | 
            +
              --tool-call-parser glm45  \
         | 
| 169 | 
            +
              --reasoning-parser glm45  \
         | 
| 170 | 
            +
              --speculative-algorithm EAGLE \
         | 
| 171 | 
            +
              --speculative-num-steps 3  \
         | 
| 172 | 
            +
              --speculative-eagle-topk 1  \
         | 
| 173 | 
            +
              --speculative-num-draft-tokens 4 \
         | 
| 174 | 
            +
              --mem-fraction-static 0.7 \
         | 
| 175 | 
            +
              --disable-shared-experts-fusion \
         | 
| 176 | 
            +
              --served-model-name glm-4.5-air-fp8 \
         | 
| 177 | 
            +
              --host 0.0.0.0 \
         | 
| 178 | 
            +
              --port 8000
         | 
| 179 | 
            +
            ```
         | 
| 180 | 
            +
             | 
| 181 | 
            +
            ### Request Parameter Instructions
         | 
| 182 | 
            +
             | 
| 183 | 
            +
            + When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests. If you want to disable the
         | 
| 184 | 
            +
              thinking switch, you need to add the `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` parameter.
         | 
| 185 | 
            +
            + Both support tool calling. Please use OpenAI-style tool description format for calls.
         | 
| 186 | 
            +
            + For specific code, please refer to `api_request.py` in the `inference` folder.
         | 

