Great Quant!
This is one of the better quants of this model that I've seen - and I've tried many. Do you mind sharing what your approach was?
you can see the reproduction instructions in the model card!
Thanks for the nice message! Glad to hear
For others who are interested, I am running it like so in Docker compose:
version: '3.8'
services:
vllm-awq:
image: vllm/vllm-openai:nightly
container_name: vllm-server-awq
ports:
- "8004:8000" # Different port for AWQ
environment:
- CUDA_VISIBLE_DEVICES=0
- VLLM_USE_TRTLLM_ATTENTION=1 # Massive prompt processing speedup
- HF_TOKEN=${HF_TOKEN} # Set your Hugging Face token in .env file
volumes:
- ./models:/models
- ./cache:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: [
"--dtype", "half", # Required for AWQ
"--enable-auto-tool-choice", # Required for tool calling to work
"--gpu-memory-utilization", "0.9", # Higher utilization for better performance
"--host", "0.0.0.0",
"--kv-cache-dtype", "fp8", # FP8 KV cache for memory optimization
"--max-model-len", "200000", # 200K context
"--max-num-batched-tokens", "16384", # Higher batched token processing
"--max-num-seqs", "6", # Should be enough to saturate GPU
"--model", "nm-testing/Qwen3-Coder-30B-A3B-Instruct-W4A16-awq", # Coding-optimized 30B compressed-tensors model
"--port", "8000",
"--quantization", "compressed-tensors", # Model uses compressed-tensors quantization
"--served-model-name", "gpt-4", # Override model name for API compatibility
"--tool-call-parser", "qwen3_coder" # Tool call parser for function calling
]
restart: unless-stopped
shm_size: '2gb'
ulimits:
memlock: -1
stack: 67108864
ipc: host
I'm getting about 40-80t/s with up to 8 concurrent calls ~300tps average. This is at lower context of course, my 5090 only supplies enough kv cache for 2 ~100k threads at a time. This is perfect for my various programming needs. So far I haven't seen any odd behaviors as I have with the Q4_K_M variants by unsloth and friends, where they go on a tool call loop and/or spit out duplicate lines as part of a tool call. This will probably be my daily driver for a while.
Many thanks!
Just updated the above Docker compose file. I did some testing and was able to get over 4K tokens / second generation with 100 sequences (at a couple hundred tokens-worth of context) and it is able to process over 40K tokens (prompt processing) per second. After some tuning (included in the changes to the above) on an RTX 5090 I'm getting about 120 tokens / second on a single session and up to 15K prompt processing tokens/s.
Running this in Qwen Code (CLI) is a dream, working with task agents is exceptionally effective. I would consider this setup very close to a Claude Code replacement, the main difference is that you really have to watch it and make sure it's not lying to you - which Claude does as well, just not as much.
@Bellesteck thank you for the information and kind words! We liked it so much we made a tweet out of it -- https://x.com/RedHat_AI/status/1972730630288105642