File size: 4,121 Bytes
33605fc a970723 7a420b6 33605fc 7496e98 33605fc 7496e98 33605fc 7496e98 33605fc 7496e98 33605fc 7496e98 b86f414 33605fc 9906ce3 33605fc c4075e3 7496e98 33605fc 7496e98 33605fc c4075e3 33605fc c4075e3 33605fc 8abcad7 33605fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# MiniMax M2 Model vLLM Deployment Guide
[English Version](./vllm_deploy_guide.md) | [Chinese Version](./vllm_deploy_guide_cn.md)
We recommend using [vLLM](https://docs.vllm.ai/en/stable/) to deploy the [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) model. vLLM is a high-performance inference engine with excellent serving throughput, efficient and intelligent memory management, powerful batch request processing capabilities, and deeply optimized underlying performance. We recommend reviewing vLLM's official documentation to check hardware compatibility before deployment.
## Applicable Models
This document applies to the following models. You only need to change the model name during deployment.
- [MiniMaxAI/MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2)
The deployment process is illustrated below using MiniMax-M2 as an example.
## System Requirements
- OS: Linux
- Python: 3.9 - 3.12
- GPU:
- compute capability 7.0 or higher
- Memory requirements: 220 GB for weights, 240 GB per 1M context tokens
The following are recommended configurations; actual requirements should be adjusted based on your use case:
- 4x 96GB GPUs: Supported context length of up to 400K tokens.
- 8x 144GB GPUs: Supported context length of up to 3M tokens.
## Deployment with Python
It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts.
We recommend installing vLLM in a fresh Python environment:
```bash
uv pip install 'triton-kernels @ git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels' vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow
```
Run the following command to start the vLLM server. vLLM will automatically download and cache the MiniMax-M2 model from Hugging Face.
4-GPU deployment command:
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
MiniMaxAI/MiniMax-M2 --trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
```
8-GPU deployment command:
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
MiniMaxAI/MiniMax-M2 --trust-remote-code \
--enable_expert_parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
```
## Testing Deployment
After startup, you can test the vLLM OpenAI-compatible API with the following command:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
```
## Common Issues
### Hugging Face Network Issues
If you encounter network issues, you can set up a proxy before pulling the model.
```bash
export HF_ENDPOINT=https://hf-mirror.com
```
### MiniMax-M2 model is not currently supported
This vLLM version is outdated. Please upgrade to the latest version.
### torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Add `--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"` to the startup parameters to resolve this issue. For example:
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
MiniMaxAI/MiniMax-M2 --trust-remote-code \
--enable_expert_parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"
```
## Getting Support
If you encounter any issues while deploying the MiniMax model:
- Contact our technical support team through official channels such as email at [model@minimax.io](mailto:model@minimax.io)
- Submit an issue on our [GitHub](https://github.com/MiniMax-AI) repository
We continuously optimize the deployment experience for our models. Feedback is welcome!
|