IQ2_KS

#1
by gghfez - opened

Thanks for doing these, I'm looking forward to trying this model!
Are you doing an IQ2_KS for this one?
(I'm using your IQ2_KS for the previous release with 256GB RAM + 6x24GB VRAM)

Is IQ2_ks good enough for you in terms of quality?

I'll raise my hand for IQ2_KS as well. :-)

Is IQ2_ks good enough for you in terms of quality?

I don't know yet for this one, but for K2, yes. Specifically Ubergarm's IQ2_ks is the only way I can run it locally without it being obviously lobotomized.

That quant/model is able to find logic issues in my fairly bespoke coding projects that Opus 4.1 misses and it's my favorite model for creative writing.

I just tried out the unsloth IQ2_XXS regenerating the last response in my K2 chats and it's a lot worse. Misses bugs K2 found, inattentive for creative writing, etc. It also uses more memory / I have to place more tensors on CPU.

Hopefully an IQ2_KS will be as great as the K2 one.

Owner

Dealing with some hardware stuff, but got the imatrix uploaded, I'll prioritize cooking the IQ2_KS first and then do some other sizes.

Thanks and appreciate the feedback!

Owner

Also heads up @Thireus - the new imatrix is up as you saw already, but while using it now I notice it is missing importance weights for the first ffn_(gate|down|up) dense layer (blk 0 only on Kimi-K2) as well as the shared expert ffn_(gate|down|up)_shexp. I'll be leaving those all full q8_0 for this round given that, and probably leave the attn all q8_0 as well just given it is a small percentage of overall weights more or less and the original seemed quite sensitive to quantization there.

example messages during quantizing:

====== llama_model_quantize_internal: did not find weights for blk.0.ffn_gate.weight
...
====== llama_model_quantize_internal: did not find weights for blk.56.ffn_up_shexp.weight

Seems to have everything it needs for the routed exps which are the most important given we're quantizing those the most.

Also I was unable to run imatrix with --layer-importance as it gave this error:

llama_kv_cache_init:        CPU KV buffer size =    34.31 MiB
llama_new_context_with_model: KV self size  =   34.31 MiB, c^KV (f16):   34.31 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     0.63 MiB
llama_new_context_with_model:        CPU compute buffer size =   334.00 MiB
llama_new_context_with_model: graph nodes  = 3340
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 192 (n_threads_batch = 384) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 551.937 ms
compute_imatrix: computing over 826 chunks with batch_size 512
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
======================================= HAVE_FANCY_SIMD is defined
Oops, inconsistent ffn vs last_input size

This Oops may be related to the missing importance weights above, but didn't have time to try to debug further.

fwiw I used the triton-cpu method to fp8 to bf16 cast the safetensors. Then I used mainline llama.cpp convert_hf_to_gguf.py and then switch over to ik_llama.cppfor quantizing the pure q8_0, getting imatrix from it, then quantizing the rest now from bf16 gguf.

@ubergarm , thanks for the heads up!

Owner

@gghfez @mtcl @whoisjeremylam

Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!

It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.

Cheers!

With a name like terminus it sounds like this could be the final iteration of the v3 family…

Because V4 or R2 is coming, right, right?!

Because V4 or R2 is coming, right, right?!
I'm hoping it's smaller, or at least not bigger than V3

I was just wondering if anyone had tried running the Moonshot K2 Vendor Verifer (https://github.com/MoonshotAI/K2-Vendor-Verfier) that was just recently released?

ik_llama.cpp gets a strange error -

(I've clipped part of the log from the request)
alternative-1-response-kv ::= ""response"" space ":" space string char ::= [^"\\x7F\x00-\x1F] | [\] (["\bfnrt] | "u" [0-9a-fA-F]{4}) root ::= alternative-0 | alternative-1 space ::= | " " | "\n"{1,2} [ \t]{0,20} string ::= """ char* """ space Grammar lazy: false Chat format: Generic INFO [ launch_slot_with_task] slot is processing task | tid="129805448495104" timestamp=1758845652 id_slot=0 id_task=0INFO [ update_slots] kv cache rm [p0, end) | tid="129805448495104" timestamp=1758845652 id_slot=0 id_task=0 p0=0 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid diff: '{"response": "根据现有搜索结果,尚未找到关于“工作负载自动化”“CORBA集成”“JCL管理”等类别在大型机环境下具体年度订阅或维护支出的公开数据。因此,无法直接估算每个类别的等值年度支出。\n\n不过,若贵公司整体大型机软件年度开支为1600万美元,且签订的是多年期合同,可按以 (I've clipped the output)

@whoisjeremylam

Interesting, I saw a post on r/LocalLLaMA showing the comparison of API providers with various "versions" of Kimi-K2 and their performance on this new tool provided by moonshot.

A few thoughts:

  1. How are you running llama-server especially regarding stuff like --jinja or not, if you are using your own myCustomTemplate.jinja chat template, and any other args like --reasoning-format and --reasoning-budget stuff? If you're not using any of those, does llama-server debug logs look like it is showing the correct expected template?

  2. How are you calling the python tool especially making sure to hit whatever the correct API endpoints you want (e.g. /v1/* is the openai API compliant ones psure now, and the older ones are not behind /v1/* if I understand correctly recent changes on ik_llama.cpp.

This is what the github u linked says, if you give me yours I might be able to try it:

python tool_calls_eval.py samples.jsonl \
    --model kimi-k2-0905-preview \
    --base-url https://openrouter.ai/api/v1 \
    --api-key YOUR_OPENROUTER_API_KEY \
    --concurrency 5 \
    --extra-body '{"provider": {"only": ["YOUR_DESIGNATED_PROVIDER"]}}'
  1. Does this test expect it to actually do something with the tool calls, or just set them up correctly? If it just sets them up then I guess it should be able to test ik_llama.cpp without additional toolcall framework stuff?

Would be curious to see how these quants performs on that!

EDIT: A new PR about tool-calling just came in if you want to apply and test and report back on that gh issue thread: https://github.com/ikawrakow/ik_llama.cpp/pull/799

@whoisjeremylam If you get that to run, please post your results here. I'm curious as well.

Interesting, I saw a post on r/LocalLLaMA showing the comparison of API providers with various "versions" of Kimi-K2 and their performance on this new tool provided by moonshot.

Yeah that was weird yesterday, it felt like deja vu as it got posted like 3 or 4 times lol.

I don't buy the theory that providers are "running it Q2 ggufs" etc as there's no way they'd be running llama.cpp at scale.

P.S. has anyone tried adding the new -ooae (offload only activated experts) flag?

Oh I just realized that the openai api has "Responses API" and the older "Chat Completions API" but they are different with different behaviors and JSON respones: https://github.com/openai/openai-python?tab=readme-ov-file#usage might be important if you are trying to get tool use working with your client.

-ooae (offload only activated experts) flag?

I've never tried it myself, but the PR suggests it can give a good speedup for some models, but slows down other models depending on how the routed experts are used: https://github.com/ikawrakow/ik_llama.cpp/pull/698

Right, seemed to have no effect for Kimi-K2 IQ2_KS.

Sorry @ubergarm ! I've been away and then got very busy with work.

  1. How are you running llama-server especially regarding stuff like --jinja or not, if you are using your own myCustomTemplate.jinja chat template, and any other args like --reasoning-format and --reasoning-budget stuff? If you're not using any of those, does llama-server debug logs look like it is showing the correct expected template?

I'm using the built-in --jinja template.

  1. How are you calling the python tool especially making sure to hit whatever the correct API endpoints you want (e.g. /v1/* is the openai API compliant ones psure now, and the older ones are not behind /v1/* if I understand correctly recent changes on ik_llama.cpp.

Sure, here is the command that I am using:

python tool_calls_eval.py samples.jsonl \
    --model kimi-k2-0905 \
    --base-url http://192.168.100.200:5000/v1 \
    --api-key not_used \
    --concurrency 1 \
    --output results.jsonl \
    --summary summary.json
  1. Does this test expect it to actually do something with the tool calls, or just set them up correctly? If it just sets them up then I guess it should be able to test ik_llama.cpp without additional toolcall framework stuff?

I'm not sure to be honest... I haven't looked into what the script is actually doing...

Would be curious to see how these quants performs on that!

That's a great question that I thought might be answered by running this script!

EDIT: A new PR about tool-calling just came in if you want to apply and test and report back on that gh issue thread: https://github.com/ikawrakow/ik_llama.cpp/pull/799

I did just try today and unfortunately, I get a core dump. I've raised an issue since core dumping isn't presumably good. https://github.com/ikawrakow/ik_llama.cpp/issues/865

Sign up or log in to comment