too slow on cpu

#2
by gopi87 - opened

1. enter your project folder and venv

cd ~/qwen-demo
source .venv/bin/activate

2. drop the fixed streaming script into place

cat > chat_stream_cpu.py << 'EOF'
#!/usr/bin/env python3
import os, sys, signal
os.environ["CUDA_VISIBLE_DEVICES"] = "" # stay on CPU
os.environ["TOKENIZERS_PARALLELISM"] = "false" # mute HF warning

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

MODEL = "Intel/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound"
print("Loading tokenizer …")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
print("Loading model (this will take a while on CPU) …")
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype="auto",
device_map="cpu",
trust_remote_code=True
)

history = []

def one_turn(user: str):
history.append({"role": "user", "content": user})
prompt = tok.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
inputs = tok([prompt], return_tensors="pt")
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)

print("Assistant: ", end="", flush=True)
generated_ids = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tok.eos_token_id,
    streamer=streamer
)
text = tok.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
history.append({"role": "assistant", "content": text})
print()
if len(history) > 20:  # keep last 10 exchanges
    del history[:2]

def main():
print("Type 'quit' or Ctrl-D to exit.\n" + "-"*50)
while True:
try:
user = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGood-bye!"); break
if user.lower() in {"quit", "exit", "q"}:
print("Good-bye!"); break
if not user: continue
one_turn(user)

if name == "main":
signal.signal(signal.SIGINT, lambda *_: sys.exit(0))
main()
EOF

3. make it executable and run

chmod +x chat_stream_cpu.py
./chat_stream_cpu.py

i was running like this but i am gettin 0.1t/sec am i doing anythink wrong ?

How much RAM do you have? It looks like you need 64GB minimum just for the weights, and even then you might be paging to disk with context + system stuff.

How much RAM do you have? It looks like you need 64GB minimum just for the weights, and even then you might be paging to disk with context + system stuff.

256gb with dual cpu xeon cpu 2680 v4 e5 and 12gb rtx card

@gopi87 @Downtown-Case

Hey what are you two doing over here? Get back to the ik_llama.cpp GGUFs! ;p lol... (just kidding!)

I too am trying to figure out if vLLM/sglang supports hybrid CPU+GPU inference and which Qwen3-Next-80B-A3B quantization will work best for that. I assume int4 might be good for CPU assuming good SIMD implementations exist, but honestly no idea.

Supposedly there is some --cpu_offload_gb 24.0 thing to put some of the weights on CPU but leave the rest in VRAM?

Given it will likely be a little while before any ik/llama.cpp support lands for GGUFs, might as well try to figure this out now.

Honestly I have not tried either vllm/sglang for anything but small models that fit in vram batched, heh.

My memory is that vllm is funny with quantization. FP8 and BF16 are its first class citizens since they’re fastest. But you might have more luck with Aphrodite, a vllm fork more tailored to consumer GPUs:

https://aphrodite.pygmalion.chat/installation/installation-cpu/

https://aphrodite.pygmalion.chat/usage/openai/#command-line-arguments-for-the-server

Though so far I only see pure CPU engines, or more primitive shuffling of weights between CPU and GPU.

@gopi87 You might have a NUMA issue? Try limiting the affinity to just one CPU.

@Downtown-Case @ubergarm

thanks guys

i am just playing around with vllm trying to run with 12gb and 180gp on cpu using kimi k2 to patch it

even its also started to troll me 😁

"If the gods are kind you will now get
Copy
INFO: Started server process [pid]
INFO: Application startup complete".

#install vllm

pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install bitsandbytes
MAX_JOBS=16 uv pip install git+https://github.com/vllm-project/vllm.git

#batch work

// run this inside the same venv you start vllm from
python - <<'PY'
import site, os, re
site_pkg = site.getsitepackages()[0]
f = os.path.join(site_pkg, "vllm", "v1", "worker", "gpu_model_runner.py")
with open(f) as fh: s = fh.read()
s = re.sub(r'assert self.cache_config.cpu_offload_gb == 0,.*?)', '', s, flags=re.S)
with open(f, "w") as fh: fh.write(s)
print("patched β†’", f)
PY

#run this model
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve
Qwen/Qwen3-Next-80B-A3B-Instruct
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
--dtype float16
--cpu-offload-gb 180
--enforce-eager
--max-num-seqs 2
--max-num-batched-tokens 4096

#result
on 12gb vram and 256gb ram getting arround 0.6 t/sec

Yeah, it must be shuffling weights. You will have to use vllm's or Aphrodite's CPU-only engine.

#install vllm

pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install bitsandbytes
MAX_JOBS=16 uv pip install git+https://github.com/vllm-project/vllm.git

#batch work

// run this inside the same venv you start vllm from
python - <<'PY'
import site, os, re
site_pkg = site.getsitepackages()[0]
f = os.path.join(site_pkg, "vllm", "v1", "worker", "gpu_model_runner.py")
with open(f) as fh: s = fh.read()
s = re.sub(r'assert self.cache_config.cpu_offload_gb == 0,.*?)', '', s, flags=re.S)
with open(f, "w") as fh: fh.write(s)
print("patched β†’", f)
PY

#run this model
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve
Qwen/Qwen3-Next-80B-A3B-Instruct
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
--dtype float16
--cpu-offload-gb 180
--enforce-eager
--max-num-seqs 2
--max-num-batched-tokens 4096

#result
on 12gb vram and 256gb ram getting arround 0.6 t/sec

this one working very well at f16 still working optimizing the speed

@gopi87

this one working very well at f16 still working optimizing the speed

hey you went from native pytorch at 0.1 tok/sec up to vllm at 0.6 tok/sec so that is already 600% improvement!

more action over on this hf repo quant suggests --cpu-offload-gb is not actually working with vLLM so no hybrid CPU+GPU possible yet maybe?

@gopi87

this one working very well at f16 still working optimizing the speed

hey you went from native pytorch at 0.1 tok/sec up to vllm at 0.6 tok/sec so that is already 600% improvement!

more action over on this hf repo quant suggests --cpu-offload-gb is not actually working with vLLM so no hybrid CPU+GPU possible yet maybe?

i am using the hybrid its is possiple in vllm i mentioned some patch work above.

The model running with 512 Experts requires approximately 320 GB of memory.

Too big for me, heh.

I'm not familiar with KTransformers, but looks like they typically rely on GGUF for quantization.

is someone tried this?
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Qwen3-Next.md

i will try this with swap file 50gb

@gopi87

last time I used ktransformers (my old guide is still up here, but very outdated), it supported mmap() similar to ik/llama.cpp.

so you might be able to load it without an explicit swap file, as mmap() will be Read-Only and not accidentally make a bunch of writes to your SSD. i have an old video using this "troll rig" technique shown here: https://www.youtube.com/watch?v=4ucmn3b44x4

good luck getting it to run!

@gopi87

last time I used ktransformers (my old guide is still up here, but very outdated), it supported mmap() similar to ik/llama.cpp.

so you might be able to load it without an explicit swap file, as mmap() will be Read-Only and not accidentally make a bunch of writes to your SSD. i have an old video using this "troll rig" technique shown here: https://www.youtube.com/watch?v=4ucmn3b44x4

good luck getting it to run!

thanks for sharing this sir.

i just tested with tiny one will try big one now.

Could you tell me guys how did you able to run it on fully cpu without relying on gpu , where as I am having 128gb ddr5

Sign up or log in to comment