Playing with ik_llama params for IQ4_KS_R4 on RTX 5090

#17
by sousekd - opened

In response to discussion here I have been experimenting a bit with some ik_llama params. Sharing for discussion and RTX 5090 owners chasing the best performance.

All tests have been done on Epyc 9355 with single RTX 5090 running Windows, on a fairly recent build of ik_llama:

PS>  .\bin\llama-server --version
version: 3772 (5236c98b)
built with Clang 19.1.5

All tests use -amb 512 and -ctk f16 as lowering these did not bring meaningful performance gain nor ability to offload more layers to GPU. Also, I use --threads 28 for -ub 512 and --threads 32 for higher -ub as it seems to be the best match on my system.

First, the baseline:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 28 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.259 157.12 7.645 16.74
512 128 512 3.151 162.50 7.650 16.73
512 128 1024 3.195 160.26 7.697 16.63
512 128 1536 3.230 158.53 7.664 16.70
512 128 2048 3.325 154.00 7.669 16.69
512 128 2560 3.314 154.49 7.726 16.57
512 128 3072 3.362 152.28 7.711 16.60
512 128 3584 3.399 150.62 7.744 16.53
512 128 4096 3.475 147.32 7.714 16.59
512 128 4608 3.487 146.84 7.745 16.53
512 128 5120 3.520 145.44 7.817 16.37
512 128 5632 3.598 142.29 7.804 16.40
512 128 6144 3.593 142.49 7.869 16.27
512 128 6656 3.841 133.31 7.892 16.22
512 128 7168 3.671 139.46 7.931 16.14
512 128 7680 3.700 138.36 7.925 16.15
512 128 8192 3.766 135.94 7.920 16.16

For all @ubergarm models I tested so far, I always used deafult -ub 512 as higher values resulted in significant S_PP t/s drop on this machine, as shown here:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 28 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 18.917 108.26 30.285 16.91
2048 512 2048 19.237 106.46 30.370 16.86
2048 512 4096 19.540 104.81 30.479 16.80
2048 512 6144 19.852 103.17 30.898 16.57
2048 512 8192 19.688 104.02 31.020 16.51

Interestingly, in the linked discussion @Kebob discovered that passing -op 26,0,27,0,29,0 allows to increase -ub without the performance penalty. So I tried:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 11.815 173.34 29.642 17.27
2048 512 2048 11.708 174.92 29.718 17.23
2048 512 4096 12.538 163.34 30.150 16.98
2048 512 6144 12.409 165.04 30.589 16.74
2048 512 8192 12.902 158.73 30.919 16.56

Looks good to me! Increasing further to -ub 4096 is possible, but increased VRAM usage means there is no longer free VRAM for offloading that single expert layer. When I tried, the performance was about the same:

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 4096 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 23.598 173.57 59.926 17.09
4096 1024 4096 24.347 168.23 59.955 17.08
4096 1024 8192 25.489 160.70 62.189 16.47

That's it. Just a tip to try, and something for @ubergarm to think about, and for @Kebob to know he is not alone seeing this behaviour :). Many thanks to both of you!

Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

I did try this, and it didn't make a difference. The only thing that made a difference was either using your Q4_K_R4 quants or disabling those tensors.

I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With -op 27,0,29,0 (re-enabling op 26), I get a little bit better performance. You may want to test the same @sousekd .

Awesome! And glad to hear I wasn't alone. Out of curiosity, how much RAM do you have? I assume you have 12 channels?

12 channels, DDR5 6400 on a single socket, 768 GB. I guess it is the reason for quite nice TG t/s.

I'm wondering if this issues has something to do with the new mla. Have you tried with -mla 2 and withouth -op 26,0,27,0,29,0?

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 2 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 18.770 109.11 29.862 17.15
2048 512 2048 19.417 105.48 30.522 16.77

@anikifoss Same issue. I thought it is caused by Windows... or how I compile and build ik_llama, but I assume @Kebob is on Linux.
Anyway, just finished downloading your GGUF. I'll give it a ride tomorrow :).

I always meant to test re-enabling one of the tensors at a time, so I just gave it a quick try. With -op 27,0,29,0 (re-enabling op 26), I get a little bit better performance. You may want to test the same @sousekd .

Hmm, doesn't seem to work for me:

XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 16.408 124.82 29.834 17.16
2048 512 2048 16.854 121.51 29.793 17.19

Sanity check:

XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_
threads_batch = 32
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 11.702 175.02 29.797 17.18
2048 512 2048 11.926 171.72 29.956 17.09

@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).

@Kebob where did you got the idea BTW? I could not find anything about that when I tried :).

It was just an idea I had. I'll need to do more testing as I originally tested it on https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF. I just re-tested against the IQ4_KS_R4 quants in this repo and I see no difference.

It was just an idea I had.

I wish I have ideas like that! :)

Owner

Thanks for sharing all your tips for the 5090 club! I am wondering if it has something to do with the _r4 flavors I have been using or not and not so much the ks vs k quants, but haven't had a chance to dig into it more closely.

fwiw I'm currently uploading a new iq3_ks that is not _r4 that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.

https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF

Hmm, doesn't seem to work for me:

Okay, taking back this one. It was late yesterday and I may have made a mistake. Also, it seems to me that the server (or ik_llama) has its moods, giving quite a different results from time to time. Anyway, in today's mood the results seems to be in favor of -op 27,0,29,0, especially on longer context, and not hit as hard on shorter context as it seemed when I posted the previous results.

PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 26,0,27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 11.748 174.32 29.767 17.20
2048 512 2048 11.970 171.10 29.836 17.16
2048 512 4096 12.098 169.28 29.980 17.08
2048 512 6144 12.752 160.60 30.359 16.87
2048 512 8192 14.013 146.15 30.864 16.59
2048 512 10240 14.120 145.04 32.726 15.65
2048 512 12288 15.252 134.28 34.555 14.82
2048 512 14336 14.223 143.99 34.252 14.95
2048 512 16384 14.072 145.53 37.130 13.79
2048 512 18432 14.353 142.69 44.312 11.55
2048 512 20480 14.689 139.42 43.282 11.83
2048 512 22528 15.710 130.36 44.196 11.58
2048 512 24576 15.722 130.26 44.062 11.62
2048 512 26624 15.772 129.85 44.526 11.50
2048 512 28672 16.337 125.36 44.039 11.63
2048 512 30720 16.880 121.33 44.705 11.45
PS>  .\bin\llama-sweep-bench.exe `
    --alias $ModelAlias `
    --model $ModelPath `
    --no-mmap `
    -mla 3 -fa -fmoe `
    -amb 512 -b 4096 -ub 2048 `
    -ctk f16 `
    -c 32768 `
    -ngl 63 `
    -op 27,0,29,0 `
    -ot "blk\.(3)\.ffn_.*=CUDA0" `
    -ot exps=CPU `
    --parallel 1 `
    --threads 32 `
    --threads-batch 32 `
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 12.402 165.13 29.459 17.38
2048 512 2048 12.842 159.48 30.073 17.03
2048 512 4096 13.830 148.08 30.314 16.89
2048 512 6144 13.118 156.12 30.838 16.60
2048 512 8192 13.118 156.13 30.962 16.54
2048 512 10240 13.574 150.87 31.037 16.50
2048 512 12288 14.502 141.22 31.698 16.15
2048 512 14336 13.952 146.79 31.598 16.20
2048 512 16384 14.894 137.50 32.068 15.97
2048 512 18432 15.149 135.19 33.219 15.41
2048 512 20480 16.170 126.65 34.629 14.79
2048 512 22528 15.486 132.25 35.577 14.39
2048 512 24576 16.883 121.31 35.522 14.41
2048 512 26624 15.762 129.94 35.570 14.39
2048 512 28672 16.430 124.65 35.937 14.25
2048 512 30720 16.625 123.19 36.151 14.16

fwiw I'm currently uploading a new iq3_ks that is not _r4 that could be interesting for y'all to try out. I might make a larger and smaller version of it eventually too if there is interest.

Oh that's great! I definitely have an interest in the larger version offering the best possible quality, even at the expense of some speed - the whole point of building this server :). The plan is to use smaller models where the speed is important, but have a model as smart and knowledgable as possible, even in obscure topics and languages, where and when needed. Thank you very much for all your work!

Sorry to resurrect the old thread, but I had a question:
Is -op 26,0,27,0,29,0 still useful, or is it a thing of the past?

@anikifoss The numbers (26, 27, 29) have changed. Based on @ikawrakow 's comments, it is probably still a thing for systems with slow PCI. See a related discussion here. I might do some tests later today with your DS-3.1-Terminus (downloading...).

Yes, the ops have changed. This is the current set of operations that can lead to large tensor offloads from CPU to GPU:

27:  MUL_MAT
28:  MUL_MAT_ID
30:  FUSED_UP_GATE
31:  MOE_FUSED_UP_GATE

Sorry about that. The change resulted from adding an op that was necessary for a model, and me not realizing that I should put the op at the end to not affect the -op command line argument, and putting the op at its more logical place instead. From this we got from (26,27,29) to (27,28,30). Then I added the fused ffn_up+ffn_gate operation, so it became (27,28,30,31)

@anikifoss So on my machine the -op 27,0,28,0,30,0,31,0 speeds up TG by 1-2 t/s but slows down PP by 20% (HQ4 Terminus).

Sign up or log in to comment