ubergarm/DeepSeek-R1-0528-GGUF · Playing with ik_llama params for IQ4_KS

Jul 7

•

In response to discussion here I have been experimenting a bit with some ik_llama params. Sharing for discussion and RTX 5090 owners chasing the best performance.

All tests have been done on Epyc 9355 with single RTX 5090 running Windows, on a fairly recent build of ik_llama:

PS>  .\bin\llama-server --version
version: 3772 (5236c98b)
built with Clang 19.1.5

All tests use -amb 512 and -ctk f16 as lowering these did not bring meaningful performance gain nor ability to offload more layers to GPU. Also, I use --threads 28 for -ub 512 and --threads 32 for higher -ub as it seems to be the best match on my system.

First, the baseline:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 28 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.259	157.12	7.645	16.74
512	128	512	3.151	162.50	7.650	16.73
512	128	1024	3.195	160.26	7.697	16.63
512	128	1536	3.230	158.53	7.664	16.70
512	128	2048	3.325	154.00	7.669	16.69
512	128	2560	3.314	154.49	7.726	16.57
512	128	3072	3.362	152.28	7.711	16.60
512	128	3584	3.399	150.62	7.744	16.53
512	128	4096	3.475	147.32	7.714	16.59
512	128	4608	3.487	146.84	7.745	16.53
512	128	5120	3.520	145.44	7.817	16.37
512	128	5632	3.598	142.29	7.804	16.40
512	128	6144	3.593	142.49	7.869	16.27
512	128	6656	3.841	133.31	7.892	16.22
512	128	7168	3.671	139.46	7.931	16.14
512	128	7680	3.700	138.36	7.925	16.15
512	128	8192	3.766	135.94	7.920	16.16

For all @ubergarm models I tested so far, I always used deafult -ub 512 as higher values resulted in significant S_PP t/s drop on this machine, as shown here:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 28 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	18.917	108.26	30.285	16.91
2048	512	2048	19.237	106.46	30.370	16.86
2048	512	4096	19.540	104.81	30.479	16.80
2048	512	6144	19.852	103.17	30.898	16.57
2048	512	8192	19.688	104.02	31.020	16.51

Interestingly, in the linked discussion @Kebob discovered that passing -op 26,0,27,0,29,0 allows to increase -ub without the performance penalty. So I tried:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	11.815	173.34	29.642	17.27
2048	512	2048	11.708	174.92	29.718	17.23
2048	512	4096	12.538	163.34	30.150	16.98
2048	512	6144	12.409	165.04	30.589	16.74
2048	512	8192	12.902	158.73	30.919	16.56

Looks good to me! Increasing further to -ub 4096 is possible, but increased VRAM usage means there is no longer free VRAM for offloading that single expert layer. When I tried, the performance was about the same:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 4096 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	23.598	173.57	59.926	17.09
4096	1024	4096	24.347	168.23	59.955	17.08
4096	1024	8192	25.489	160.70	62.189	16.47

That's it. Just a tip to try, and something for @ubergarm to think about, and for @Kebob to know he is not alone seeing this behaviour :). Many thanks to both of you!

Kebob

Jul 7

Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?

anikifoss

Jul 8

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

Kebob

Jul 8

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

I did try this, and it didn't make a difference. The only thing that made a difference was either using your Q4_K_R4 quants or disabling those tensors.

I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With -op 27,0,29,0 (re-enabling op 26), I get a little bit better performance. You may want to test the same @sousekd .

sousekd

Jul 8

•

edited Jul 8

Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?

12 channels, DDR5 6400 on a single socket, 768 GB. I guess it is the reason for quite nice TG t/s.

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 2 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	18.770	109.11	29.862	17.15
2048	512	2048	19.417	105.48	30.522	16.77

@anikifoss Same issue. I thought it is caused by Windows... or how I compile and build ik_llama, but I assume @Kebob is on Linux.
Anyway, just finished downloading your GGUF. I'll give it a ride tomorrow :).

sousekd

Jul 8

I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With -op 27,0,29,0 (re-enabling op 26), I get a little bit better performance. You may want to test the same @sousekd .

Hmm, doesn't seem to work for me:

XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	16.408	124.82	29.834	17.16
2048	512	2048	16.854	121.51	29.793	17.19

Sanity check:

XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	11.702	175.02	29.797	17.18
2048	512	2048	11.926	171.72	29.956	17.09

@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).

Kebob

Jul 8

@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).

It was just an idea I had. I'll need to do more testing as I originally tested it on https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF. I just re-tested against the IQ4_KS_R4 quants in this repo and I see no difference.

sousekd

Jul 8

It was just an idea I had.

I wish I have ideas like that! :)

ubergarm

Owner Jul 8

Thanks for sharing all your tips for the 5090 club! I am wondering if it has something to do with the _r4 flavors I have been using or not and not so much the ks vs k quants, but haven't had a chance to dig into it more closely.

fwiw I'm currently uploading a new iq3_ks that is not _r4 that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.

https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF

sousekd

Jul 8

•

edited Jul 8

Hmm, doesn't seem to work for me:

Okay, taking back this one. It was late yesterday and I may have made a mistake. Also, it seems to me that the server (or ik_llama) has its moods, giving quite a different results from time to time. Anyway, in today's mood the results seems to be in favor of -op 27,0,29,0, especially on longer context, and not hit as hard on shorter context as it seemed when I posted the previous results.

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	11.748	174.32	29.767	17.20
2048	512	2048	11.970	171.10	29.836	17.16
2048	512	4096	12.098	169.28	29.980	17.08
2048	512	6144	12.752	160.60	30.359	16.87
2048	512	8192	14.013	146.15	30.864	16.59
2048	512	10240	14.120	145.04	32.726	15.65
2048	512	12288	15.252	134.28	34.555	14.82
2048	512	14336	14.223	143.99	34.252	14.95
2048	512	16384	14.072	145.53	37.130	13.79
2048	512	18432	14.353	142.69	44.312	11.55
2048	512	20480	14.689	139.42	43.282	11.83
2048	512	22528	15.710	130.36	44.196	11.58
2048	512	24576	15.722	130.26	44.062	11.62
2048	512	26624	15.772	129.85	44.526	11.50
2048	512	28672	16.337	125.36	44.039	11.63
2048	512	30720	16.880	121.33	44.705	11.45

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	12.402	165.13	29.459	17.38
2048	512	2048	12.842	159.48	30.073	17.03
2048	512	4096	13.830	148.08	30.314	16.89
2048	512	6144	13.118	156.12	30.838	16.60
2048	512	8192	13.118	156.13	30.962	16.54
2048	512	10240	13.574	150.87	31.037	16.50
2048	512	12288	14.502	141.22	31.698	16.15
2048	512	14336	13.952	146.79	31.598	16.20
2048	512	16384	14.894	137.50	32.068	15.97
2048	512	18432	15.149	135.19	33.219	15.41
2048	512	20480	16.170	126.65	34.629	14.79
2048	512	22528	15.486	132.25	35.577	14.39
2048	512	24576	16.883	121.31	35.522	14.41
2048	512	26624	15.762	129.94	35.570	14.39
2048	512	28672	16.430	124.65	35.937	14.25
2048	512	30720	16.625	123.19	36.151	14.16

fwiw I'm currently uploading a new iq3_ks that is not _r4 that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.

Oh that's great! I definitely have an interest in the larger version offering the best possible quality, even at the expense of some speed - the whole point of building this server :). The plan is to use smaller models where the speed is important, but have a model as smart and knowledgable as possible, even in obscure topics and languages, where and when needed. Thank you very much for all your work!

anikifoss

14 days ago

Sorry to resurrect the old thread, but I had a question:
Is -op 26,0,27,0,29,0 still useful, or is it a thing of the past?

sousekd

14 days ago

@anikifoss The numbers (26, 27, 29) have changed. Based on @ikawrakow 's comments, it is probably still a thing for systems with slow PCI. See a related discussion here. I might do some tests later today with your DS-3.1-Terminus (downloading...).

ikawrakow

14 days ago

Yes, the ops have changed. This is the current set of operations that can lead to large tensor offloads from CPU to GPU:

27:  MUL_MAT
28:  MUL_MAT_ID
30:  FUSED_UP_GATE
31:  MOE_FUSED_UP_GATE

Sorry about that. The change resulted from adding an op that was necessary for a model, and me not realizing that I should put the op at the end to not affect the -op command line argument, and putting the op at its more logical place instead. From this we got from (26,27,29) to (27,28,30). Then I added the fused ffn_up+ffn_gate operation, so it became (27,28,30,31)

sousekd

14 days ago

@anikifoss So on my machine the -op 27,0,28,0,30,0,31,0 speeds up TG by 1-2 t/s but slows down PP by 20% (HQ4 Terminus).

ubergarm
/

DeepSeek-R1-0528-GGUF

Playing with ik_llama params for IQ4_KS_R4 on RTX 5090