Great Quant!

by Bellesteck - opened Sep 24

Sep 24

This is one of the better quants of this model that I've seen - and I've tried many. Do you mind sharing what your approach was?

robertgshaw

Sep 24

you can see the reproduction instructions in the model card!

bdellabe

NM Testing org Sep 24

Thanks for the nice message! Glad to hear

Bellesteck

Sep 24

•

edited 28 days ago

For others who are interested, I am running it like so in Docker compose:

version: '3.8'

services:
  vllm-awq:
    image: vllm/vllm-openai:nightly
    container_name: vllm-server-awq
    ports:
      - "8004:8000"  # Different port for AWQ
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_USE_TRTLLM_ATTENTION=1  # Massive prompt processing speedup
      - HF_TOKEN=${HF_TOKEN}  # Set your Hugging Face token in .env file
    volumes:
      - ./models:/models
      - ./cache:/root/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: [
      "--dtype", "half",  # Required for AWQ
      "--enable-auto-tool-choice",  # Required for tool calling to work
      "--gpu-memory-utilization", "0.9",  # Higher utilization for better performance
      "--host", "0.0.0.0",
      "--kv-cache-dtype", "fp8",  # FP8 KV cache for memory optimization
      "--max-model-len", "200000",  # 200K context
      "--max-num-batched-tokens", "16384",  # Higher batched token processing
      "--max-num-seqs", "6",  # Should be enough to saturate GPU
      "--model", "nm-testing/Qwen3-Coder-30B-A3B-Instruct-W4A16-awq",  # Coding-optimized 30B compressed-tensors model
      "--port", "8000",
      "--quantization", "compressed-tensors",  # Model uses compressed-tensors quantization
      "--served-model-name", "gpt-4",  # Override model name for API compatibility
      "--tool-call-parser", "qwen3_coder"  # Tool call parser for function calling
    ]
    restart: unless-stopped
    shm_size: '2gb'
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host

I'm getting about 40-80t/s with up to 8 concurrent calls ~300tps average. This is at lower context of course, my 5090 only supplies enough kv cache for 2 ~100k threads at a time. This is perfect for my various programming needs. So far I haven't seen any odd behaviors as I have with the Q4_K_M variants by unsloth and friends, where they go on a tool call loop and/or spit out duplicate lines as part of a tool call. This will probably be my daily driver for a while.

Many thanks!

Bellesteck

Sep 29

Just updated the above Docker compose file. I did some testing and was able to get over 4K tokens / second generation with 100 sequences (at a couple hundred tokens-worth of context) and it is able to process over 40K tokens (prompt processing) per second. After some tuning (included in the changes to the above) on an RTX 5090 I'm getting about 120 tokens / second on a single session and up to 15K prompt processing tokens/s.

Running this in Qwen Code (CLI) is a dream, working with task agents is exceptionally effective. I would consider this setup very close to a Claude Code replacement, the main difference is that you really have to watch it and make sure it's not lying to you - which Claude does as well, just not as much.

bdellabe

NM Testing org Oct 1

@Bellesteck thank you for the information and kind words! We liked it so much we made a tweet out of it -- https://x.com/RedHat_AI/status/1972730630288105642

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment