Playing with ik_llama params for IQ4_KS_R4 on RTX 5090
In response to discussion here I have been experimenting a bit with some ik_llama params. Sharing for discussion and RTX 5090 owners chasing the best performance.
All tests have been done on Epyc 9355 with single RTX 5090 running Windows, on a fairly recent build of ik_llama:
PS> .\bin\llama-server --version
version: 3772 (5236c98b)
built with Clang 19.1.5
All tests use -amb 512 and -ctk f16 as lowering these did not bring meaningful performance gain nor ability to offload more layers to GPU. Also, I use --threads 28 for -ub 512 and --threads 32 for higher -ub as it seems to be the best match on my system.
First, the baseline:
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 28 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 3.259 | 157.12 | 7.645 | 16.74 |
| 512 | 128 | 512 | 3.151 | 162.50 | 7.650 | 16.73 |
| 512 | 128 | 1024 | 3.195 | 160.26 | 7.697 | 16.63 |
| 512 | 128 | 1536 | 3.230 | 158.53 | 7.664 | 16.70 |
| 512 | 128 | 2048 | 3.325 | 154.00 | 7.669 | 16.69 |
| 512 | 128 | 2560 | 3.314 | 154.49 | 7.726 | 16.57 |
| 512 | 128 | 3072 | 3.362 | 152.28 | 7.711 | 16.60 |
| 512 | 128 | 3584 | 3.399 | 150.62 | 7.744 | 16.53 |
| 512 | 128 | 4096 | 3.475 | 147.32 | 7.714 | 16.59 |
| 512 | 128 | 4608 | 3.487 | 146.84 | 7.745 | 16.53 |
| 512 | 128 | 5120 | 3.520 | 145.44 | 7.817 | 16.37 |
| 512 | 128 | 5632 | 3.598 | 142.29 | 7.804 | 16.40 |
| 512 | 128 | 6144 | 3.593 | 142.49 | 7.869 | 16.27 |
| 512 | 128 | 6656 | 3.841 | 133.31 | 7.892 | 16.22 |
| 512 | 128 | 7168 | 3.671 | 139.46 | 7.931 | 16.14 |
| 512 | 128 | 7680 | 3.700 | 138.36 | 7.925 | 16.15 |
| 512 | 128 | 8192 | 3.766 | 135.94 | 7.920 | 16.16 |
For all
@ubergarm
models I tested so far, I always used deafult -ub 512 as higher values resulted in significant S_PP t/s drop on this machine, as shown here:
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 -ub 2048 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 28 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 18.917 | 108.26 | 30.285 | 16.91 |
| 2048 | 512 | 2048 | 19.237 | 106.46 | 30.370 | 16.86 |
| 2048 | 512 | 4096 | 19.540 | 104.81 | 30.479 | 16.80 |
| 2048 | 512 | 6144 | 19.852 | 103.17 | 30.898 | 16.57 |
| 2048 | 512 | 8192 | 19.688 | 104.02 | 31.020 | 16.51 |
Interestingly, in the linked discussion
@Kebob
discovered that passing -op 26,0,27,0,29,0 allows to increase -ub without the performance penalty. So I tried:
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 -ub 2048 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-op 26,0,27,0,29,0 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 32 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 11.815 | 173.34 | 29.642 | 17.27 |
| 2048 | 512 | 2048 | 11.708 | 174.92 | 29.718 | 17.23 |
| 2048 | 512 | 4096 | 12.538 | 163.34 | 30.150 | 16.98 |
| 2048 | 512 | 6144 | 12.409 | 165.04 | 30.589 | 16.74 |
| 2048 | 512 | 8192 | 12.902 | 158.73 | 30.919 | 16.56 |
Looks good to me! Increasing further to -ub 4096 is possible, but increased VRAM usage means there is no longer free VRAM for offloading that single expert layer. When I tried, the performance was about the same:
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 -ub 4096 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-op 26,0,27,0,29,0 `
-ot exps=CPU `
--parallel 1 `
--threads 32 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 23.598 | 173.57 | 59.926 | 17.09 |
| 4096 | 1024 | 4096 | 24.347 | 168.23 | 59.955 | 17.08 |
| 4096 | 1024 | 8192 | 25.489 | 160.70 | 62.189 | 16.47 |
That's it. Just a tip to try, and something for @ubergarm to think about, and for @Kebob to know he is not alone seeing this behaviour :). Many thanks to both of you!
Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?
I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?
I'm wondering if this issues has something to do with the new mla. Have you tried with
-mla 2and withouth-op 26,0,27,0,29,0?
I did try this, and it didn't make a difference. The only thing that made a difference was either using your Q4_K_R4 quants or disabling those tensors.
I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With -op 27,0,29,0 (re-enabling op 26), I get a little bit better performance. You may want to test the same
@sousekd
.
Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?
12 channels, DDR5 6400 on a single socket, 768 GB. I guess it is the reason for quite nice TG t/s.
I'm wondering if this issues has something to do with the new mla. Have you tried with
-mla 2and withouth-op 26,0,27,0,29,0?
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 2 -fa -fmoe `
-amb 512 -b 4096 -ub 2048 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 32 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 18.770 | 109.11 | 29.862 | 17.15 |
| 2048 | 512 | 2048 | 19.417 | 105.48 | 30.522 | 16.77 |
@anikifoss
Same issue. I thought it is caused by Windows... or how I compile and build ik_llama, but I assume
@Kebob
is on Linux.
Anyway, just finished downloading your GGUF. I'll give it a ride tomorrow :).
I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With
-op 27,0,29,0(re-enabling op 26), I get a little bit better performance. You may want to test the same @sousekd .
Hmm, doesn't seem to work for me:
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 16.408 | 124.82 | 29.834 | 17.16 |
| 2048 | 512 | 2048 | 16.854 | 121.51 | 29.793 | 17.19 |
Sanity check:
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 11.702 | 175.02 | 29.797 | 17.18 |
| 2048 | 512 | 2048 | 11.926 | 171.72 | 29.956 | 17.09 |
@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).
@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).
It was just an idea I had. I'll need to do more testing as I originally tested it on https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF. I just re-tested against the IQ4_KS_R4 quants in this repo and I see no difference.
It was just an idea I had.
I wish I have ideas like that! :)
Thanks for sharing all your tips for the 5090 club! I am wondering if it has something to do with the _r4 flavors I have been using or not and not so much the ks vs k quants, but haven't had a chance to dig into it more closely.
fwiw I'm currently uploading a new iq3_ks that is not _r4 that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.
https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF
Hmm, doesn't seem to work for me:
Okay, taking back this one. It was late yesterday and I may have made a mistake. Also, it seems to me that the server (or ik_llama) has its moods, giving quite a different results from time to time. Anyway, in today's mood the results seems to be in favor of -op 27,0,29,0, especially on longer context, and not hit as hard on shorter context as it seemed when I posted the previous results.
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 -ub 2048 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-op 26,0,27,0,29,0 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 32 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 11.748 | 174.32 | 29.767 | 17.20 |
| 2048 | 512 | 2048 | 11.970 | 171.10 | 29.836 | 17.16 |
| 2048 | 512 | 4096 | 12.098 | 169.28 | 29.980 | 17.08 |
| 2048 | 512 | 6144 | 12.752 | 160.60 | 30.359 | 16.87 |
| 2048 | 512 | 8192 | 14.013 | 146.15 | 30.864 | 16.59 |
| 2048 | 512 | 10240 | 14.120 | 145.04 | 32.726 | 15.65 |
| 2048 | 512 | 12288 | 15.252 | 134.28 | 34.555 | 14.82 |
| 2048 | 512 | 14336 | 14.223 | 143.99 | 34.252 | 14.95 |
| 2048 | 512 | 16384 | 14.072 | 145.53 | 37.130 | 13.79 |
| 2048 | 512 | 18432 | 14.353 | 142.69 | 44.312 | 11.55 |
| 2048 | 512 | 20480 | 14.689 | 139.42 | 43.282 | 11.83 |
| 2048 | 512 | 22528 | 15.710 | 130.36 | 44.196 | 11.58 |
| 2048 | 512 | 24576 | 15.722 | 130.26 | 44.062 | 11.62 |
| 2048 | 512 | 26624 | 15.772 | 129.85 | 44.526 | 11.50 |
| 2048 | 512 | 28672 | 16.337 | 125.36 | 44.039 | 11.63 |
| 2048 | 512 | 30720 | 16.880 | 121.33 | 44.705 | 11.45 |
PS> .\bin\llama-sweep-bench.exe `
--alias $ModelAlias `
--model $ModelPath `
--no-mmap `
-mla 3 -fa -fmoe `
-amb 512 -b 4096 -ub 2048 `
-ctk f16 `
-c 32768 `
-ngl 63 `
-op 27,0,29,0 `
-ot "blk\.(3)\.ffn_.*=CUDA0" `
-ot exps=CPU `
--parallel 1 `
--threads 32 `
--threads-batch 32 `
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 12.402 | 165.13 | 29.459 | 17.38 |
| 2048 | 512 | 2048 | 12.842 | 159.48 | 30.073 | 17.03 |
| 2048 | 512 | 4096 | 13.830 | 148.08 | 30.314 | 16.89 |
| 2048 | 512 | 6144 | 13.118 | 156.12 | 30.838 | 16.60 |
| 2048 | 512 | 8192 | 13.118 | 156.13 | 30.962 | 16.54 |
| 2048 | 512 | 10240 | 13.574 | 150.87 | 31.037 | 16.50 |
| 2048 | 512 | 12288 | 14.502 | 141.22 | 31.698 | 16.15 |
| 2048 | 512 | 14336 | 13.952 | 146.79 | 31.598 | 16.20 |
| 2048 | 512 | 16384 | 14.894 | 137.50 | 32.068 | 15.97 |
| 2048 | 512 | 18432 | 15.149 | 135.19 | 33.219 | 15.41 |
| 2048 | 512 | 20480 | 16.170 | 126.65 | 34.629 | 14.79 |
| 2048 | 512 | 22528 | 15.486 | 132.25 | 35.577 | 14.39 |
| 2048 | 512 | 24576 | 16.883 | 121.31 | 35.522 | 14.41 |
| 2048 | 512 | 26624 | 15.762 | 129.94 | 35.570 | 14.39 |
| 2048 | 512 | 28672 | 16.430 | 124.65 | 35.937 | 14.25 |
| 2048 | 512 | 30720 | 16.625 | 123.19 | 36.151 | 14.16 |
fwiw I'm currently uploading a new
iq3_ksthat is not_r4that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.
Oh that's great! I definitely have an interest in the larger version offering the best possible quality, even at the expense of some speed - the whole point of building this server :). The plan is to use smaller models where the speed is important, but have a model as smart and knowledgable as possible, even in obscure topics and languages, where and when needed. Thank you very much for all your work!
Sorry to resurrect the old thread, but I had a question:
Is -op 26,0,27,0,29,0 still useful, or is it a thing of the past?
@anikifoss The numbers (26, 27, 29) have changed. Based on @ikawrakow 's comments, it is probably still a thing for systems with slow PCI. See a related discussion here. I might do some tests later today with your DS-3.1-Terminus (downloading...).
Yes, the ops have changed. This is the current set of operations that can lead to large tensor offloads from CPU to GPU:
27: MUL_MAT
28: MUL_MAT_ID
30: FUSED_UP_GATE
31: MOE_FUSED_UP_GATE
Sorry about that. The change resulted from adding an op that was necessary for a model, and me not realizing that I should put the op at the end to not affect the -op command line argument, and putting the op at its more logical place instead. From this we got from (26,27,29) to (27,28,30). Then I added the fused ffn_up+ffn_gate operation, so it became (27,28,30,31)
@anikifoss
So on my machine the -op 27,0,28,0,30,0,31,0 speeds up TG by 1-2 t/s but slows down PP by 20% (HQ4 Terminus).