too slow on cpu
1. enter your project folder and venv
cd ~/qwen-demo
source .venv/bin/activate
2. drop the fixed streaming script into place
cat > chat_stream_cpu.py << 'EOF'
#!/usr/bin/env python3
import os, sys, signal
os.environ["CUDA_VISIBLE_DEVICES"] = "" # stay on CPU
os.environ["TOKENIZERS_PARALLELISM"] = "false" # mute HF warning
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
MODEL = "Intel/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound"
print("Loading tokenizer β¦")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
print("Loading model (this will take a while on CPU) β¦")
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype="auto",
device_map="cpu",
trust_remote_code=True
)
history = []
def one_turn(user: str):
history.append({"role": "user", "content": user})
prompt = tok.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
inputs = tok([prompt], return_tensors="pt")
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
print("Assistant: ", end="", flush=True)
generated_ids = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
pad_token_id=tok.eos_token_id,
streamer=streamer
)
text = tok.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
history.append({"role": "assistant", "content": text})
print()
if len(history) > 20: # keep last 10 exchanges
del history[:2]
def main():
print("Type 'quit' or Ctrl-D to exit.\n" + "-"*50)
while True:
try:
user = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGood-bye!"); break
if user.lower() in {"quit", "exit", "q"}:
print("Good-bye!"); break
if not user: continue
one_turn(user)
if name == "main":
signal.signal(signal.SIGINT, lambda *_: sys.exit(0))
main()
EOF
3. make it executable and run
chmod +x chat_stream_cpu.py
./chat_stream_cpu.py
i was running like this but i am gettin 0.1t/sec am i doing anythink wrong ?
How much RAM do you have? It looks like you need 64GB minimum just for the weights, and even then you might be paging to disk with context + system stuff.
How much RAM do you have? It looks like you need 64GB minimum just for the weights, and even then you might be paging to disk with context + system stuff.
256gb with dual cpu xeon cpu 2680 v4 e5 and 12gb rtx card
Hey what are you two doing over here? Get back to the ik_llama.cpp GGUFs! ;p lol... (just kidding!)
I too am trying to figure out if vLLM/sglang supports hybrid CPU+GPU inference and which Qwen3-Next-80B-A3B quantization will work best for that. I assume int4 might be good for CPU assuming good SIMD implementations exist, but honestly no idea.
Supposedly there is some --cpu_offload_gb 24.0 thing to put some of the weights on CPU but leave the rest in VRAM?
Given it will likely be a little while before any ik/llama.cpp support lands for GGUFs, might as well try to figure this out now.
Honestly I have not tried either vllm/sglang for anything but small models that fit in vram batched, heh.
My memory is that vllm is funny with quantization. FP8 and BF16 are its first class citizens since theyβre fastest. But you might have more luck with Aphrodite, a vllm fork more tailored to consumer GPUs:
https://aphrodite.pygmalion.chat/installation/installation-cpu/
https://aphrodite.pygmalion.chat/usage/openai/#command-line-arguments-for-the-server
Though so far I only see pure CPU engines, or more primitive shuffling of weights between CPU and GPU.
@gopi87 You might have a NUMA issue? Try limiting the affinity to just one CPU.
thanks guys
i am just playing around with vllm trying to run with 12gb and 180gp on cpu using kimi k2 to patch it
even its also started to troll me π
"If the gods are kind you will now get
Copy
INFO: Started server process [pid]
INFO: Application startup complete".
#install vllm
pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install bitsandbytes
MAX_JOBS=16 uv pip install git+https://github.com/vllm-project/vllm.git
#batch work
// run this inside the same venv you start vllm from
python - <<'PY'
import site, os, re
site_pkg = site.getsitepackages()[0]
f = os.path.join(site_pkg, "vllm", "v1", "worker", "gpu_model_runner.py")
with open(f) as fh: s = fh.read()
s = re.sub(r'assert self.cache_config.cpu_offload_gb == 0,.*?)', '', s, flags=re.S)
with open(f, "w") as fh: fh.write(s)
print("patched β", f)
PY
#run this model
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve
Qwen/Qwen3-Next-80B-A3B-Instruct
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
--dtype float16
--cpu-offload-gb 180
--enforce-eager
--max-num-seqs 2
--max-num-batched-tokens 4096
#result
on 12gb vram and 256gb ram getting arround 0.6 t/sec
Yeah, it must be shuffling weights. You will have to use vllm's or Aphrodite's CPU-only engine.
#install vllm
pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install bitsandbytes
MAX_JOBS=16 uv pip install git+https://github.com/vllm-project/vllm.git#batch work
// run this inside the same venv you start vllm from
python - <<'PY'
import site, os, re
site_pkg = site.getsitepackages()[0]
f = os.path.join(site_pkg, "vllm", "v1", "worker", "gpu_model_runner.py")
with open(f) as fh: s = fh.read()
s = re.sub(r'assert self.cache_config.cpu_offload_gb == 0,.*?)', '', s, flags=re.S)
with open(f, "w") as fh: fh.write(s)
print("patched β", f)
PY#run this model
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve
Qwen/Qwen3-Next-80B-A3B-Instruct
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
--dtype float16
--cpu-offload-gb 180
--enforce-eager
--max-num-seqs 2
--max-num-batched-tokens 4096#result
on 12gb vram and 256gb ram getting arround 0.6 t/sec
this one working very well at f16 still working optimizing the speed
this one working very well at f16 still working optimizing the speed
hey you went from native pytorch at 0.1 tok/sec up to vllm at 0.6 tok/sec so that is already 600% improvement!
more action over on this hf repo quant suggests --cpu-offload-gb is not actually working with vLLM so no hybrid CPU+GPU possible yet maybe?
this one working very well at f16 still working optimizing the speed
hey you went from native pytorch at 0.1 tok/sec up to vllm at 0.6 tok/sec so that is already 600% improvement!
more action over on this hf repo quant suggests
--cpu-offload-gbis not actually working with vLLM so no hybrid CPU+GPU possible yet maybe?
i am using the hybrid its is possiple in vllm i mentioned some patch work above.
The model running with 512 Experts requires approximately 320 GB of memory.
Too big for me, heh.
I'm not familiar with KTransformers, but looks like they typically rely on GGUF for quantization.
is someone tried this?
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Qwen3-Next.md
i will try this with swap file 50gb
last time I used ktransformers (my old guide is still up here, but very outdated), it supported mmap() similar to ik/llama.cpp.
so you might be able to load it without an explicit swap file, as mmap() will be Read-Only and not accidentally make a bunch of writes to your SSD. i have an old video using this "troll rig" technique shown here: https://www.youtube.com/watch?v=4ucmn3b44x4
good luck getting it to run!
last time I used ktransformers (my old guide is still up here, but very outdated), it supported
mmap()similar to ik/llama.cpp.so you might be able to load it without an explicit swap file, as
mmap()will be Read-Only and not accidentally make a bunch of writes to your SSD. i have an old video using this "troll rig" technique shown here: https://www.youtube.com/watch?v=4ucmn3b44x4good luck getting it to run!
thanks for sharing this sir.
i just tested with tiny one will try big one now.
Could you tell me guys how did you able to run it on fully cpu without relying on gpu , where as I am having 128gb ddr5