Performance report on RTX 4090D with 48GB RAM: 16 t/s
#1
by
SlavikF
- opened
I'm running this model on vllm 0.11.0 + OpenWebUI.
GPU: Nvidia RTX 4090D 48GB VRAM
Running on Ubuntu 24 with this docker:
services:
qwen3vl-32b:
image: vllm/vllm-openai:v0.11.0
container_name: qwen3vl-32b-4090D
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
ports:
- "36000:8000"
environment:
TORCH_CUDA_ARCH_LIST: "8.9"
volumes:
- /home/slavik/.cache:/root/.cache
ipc: host
command:
- "--model"
- "Qwen/Qwen3-VL-32B-Thinking-FP8"
- "--max-model-len"
- "26112"
- "--served-model-name"
- "local-qwen3vl-32b"
- "--dtype"
- "float16"
- "--gpu-memory-utilization"
- "0.99"
- "--max-num-seqs"
- "2"
- "--reasoning-parser"
- "deepseek_r1"
Takes 3-4 minutes to start.
nvtop shows that 42 GB of VRAM used
I can only use 26112 context, - getting OOM if use any higher value.
Prompt Processing: 2000 t/s
Token Generation: 16 t/s
When I'm running Qwen3-VL-30B-A3B-Thinking-FP8 I'm getting TG 90 t/s
So, in my case, dense model is 5-6 slower that MOE, and I can fit much smaller context.