Inference with llama.cpp + Open WebUI gives repeating `?`

#1
by whoisjeremylam - opened

Is there a specific build of llama.cpp that should be used to support AutoRound?

This is the command

CUDA_VISIBLE_DEVICES=1 \
~/llama.cpp/build/bin/llama-server \
  -t 23 \
  -m /home/ai/models/Intel/Ling-flash-2.0-gguf-q2ks-mixed-AutoRound/Ling-flash-Q2_K_S.gguf \
  --alias Ling-flash \
  --no-mmap \
  --host 0.0.0.0 \
  --port 5000 \
  -c 13056 \
  -ngl 999 \
  -ub 4096 -b 4096

llama.cpp build from main:

$ git rev-parse --short HEAD
6de8ed751

image

same here.
latest llama.cpp (github master) freshly built on ubuntu+cuda, and using llama.cpp built-in UI.
returns repeating '?' no matter what is the prompt.
otherwise, works fine with other models.

CPU works fine, but CUDA has issues, we’re investigating the root cause.

corfirmed :(, with -ot exps=CPU works as expected

Sign up or log in to comment