Inference with llama.cpp + Open WebUI gives repeating `?`

by whoisjeremylam - opened 4 days ago

4 days ago

Is there a specific build of llama.cpp that should be used to support AutoRound?

This is the command

CUDA_VISIBLE_DEVICES=1 \
~/llama.cpp/build/bin/llama-server \
  -t 23 \
  -m /home/ai/models/Intel/Ling-flash-2.0-gguf-q2ks-mixed-AutoRound/Ling-flash-Q2_K_S.gguf \
  --alias Ling-flash \
  --no-mmap \
  --host 0.0.0.0 \
  --port 5000 \
  -c 13056 \
  -ngl 999 \
  -ub 4096 -b 4096

llama.cpp build from main:

$ git rev-parse --short HEAD
6de8ed751

saadsafi

2 days ago

same here.
latest llama.cpp (github master) freshly built on ubuntu+cuda, and using llama.cpp built-in UI.
returns repeating '?' no matter what is the prompt.
otherwise, works fine with other models.

wenhuach

Intel org about 19 hours ago

CPU works fine, but CUDA has issues, we’re investigating the root cause.

luckydevil13

about 11 hours ago

corfirmed :(, with -ot exps=CPU works as expected

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment