Performance report on RTX 4090D with 48GB RAM: 16 t/s

#1
by SlavikF - opened

I'm running this model on vllm 0.11.0 + OpenWebUI.
GPU: Nvidia RTX 4090D 48GB VRAM

Running on Ubuntu 24 with this docker:

services:
  qwen3vl-32b:
    image: vllm/vllm-openai:v0.11.0
    container_name: qwen3vl-32b-4090D
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']
    ports:
      - "36000:8000"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache:/root/.cache
    ipc: host
    command:
      - "--model"
      - "Qwen/Qwen3-VL-32B-Thinking-FP8"
      - "--max-model-len"
      - "26112"
      - "--served-model-name"
      - "local-qwen3vl-32b"
      - "--dtype"
      - "float16"
      - "--gpu-memory-utilization"
      - "0.99"
      - "--max-num-seqs"
      - "2"
      - "--reasoning-parser"
      - "deepseek_r1"

Takes 3-4 minutes to start.

nvtop shows that 42 GB of VRAM used
I can only use 26112 context, - getting OOM if use any higher value.

Prompt Processing: 2000 t/s
Token Generation: 16 t/s

When I'm running Qwen3-VL-30B-A3B-Thinking-FP8 I'm getting TG 90 t/s

So, in my case, dense model is 5-6 slower that MOE, and I can fit much smaller context.

Sign up or log in to comment