FP8 please
Any chances for FP8 variant, like you did for Qwen3-30B-A3B models? Will be much appreciated.
Super!
Is there any receipt to produce correct FP8 quant for qwen3-next preserving MTP Draft model? llmcompressor? It looks like FP8 Qwen3-Next-80B-A3B produces no Accepted Tokens.
[metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 69.20 tokens/s, Accepted: 0 tokens, Drafted: 692 tokens, Per-position acceptance rate: 0.000, 0.000, Avg Draft acceptance rate: 0.0%
I use this model: https://huggingface.co/DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
Is it possible to support correct FP8 quantization of the Draft model for spec decoding? Any receipt for llmcompressor is highly apppreciated!
Currently qwen3-next fp8 support is not good, lots of bugs.
maybe we should wait for the official fp8 release.
Unfortunately doesn't work with vLLM when offloading to CPU is enabled.
RuntimeError: Worker failed with error 'Cannot re-initialize the input batch when CPU weight offloading is enabled. See https://github.com/vllm-project/vllm/pull/18298 for more details.'
OK, will wait for llama.cpp to develop Qwen-Next support and then for quantized GGUFs to become available.
In any case, thanks to Qwen team for their models!
vllm v0.10.2 does not support qwen3 next fp8
maybe latest can run
I ran it with the latest (pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly) according to the recommendations in the model description. vllm v0.10.2 indeed fails earlier because of quantization.
