ubergarm commited on
Commit
84d15ba
·
1 Parent(s): 3bdf2b0

Add IQ3_KS quant perplexity and details

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -43,6 +43,10 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
43
  * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
44
  - `Final estimate: PPL = 3.2730 +/- 0.01738`
45
  - Fits 32k context in under 24GiB VRAM
 
 
 
 
46
  * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
47
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
48
  - Fits 32k context in under 16GiB VRAM
@@ -228,6 +232,84 @@ custom=$(
228
 
229
  </details>
230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
  #### `IQ2_K_R4` 2.799 BPW (220GiB)
232
  Special mix `IQ3_K_R4` `ffn_down` and `IQ2_K_R4` `ffn_(up|gate)` routed experts. All other layers *roughly* `iq5_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` or manually ofline repack if you want to mmap() off disk.
233
 
 
43
  * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
44
  - `Final estimate: PPL = 3.2730 +/- 0.01738`
45
  - Fits 32k context in under 24GiB VRAM
46
+ * `DeepSeek-R1-0528-IQ3_KS` 282 GiB
47
+ - Final estimate: PPL = 3.2983 +/- 0.01759
48
+ - Fits 32k context in under 16GiB VRAM
49
+ - Fits 64k context in under 24GiB VRAM
50
  * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
51
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
52
  - Fits 32k context in under 16GiB VRAM
 
232
 
233
  </details>
234
 
235
+ #### `IQ3_KS` 281.463 GiB (3.598 BPW)
236
+ Special mix with all new `IQ3_KS` `ffn_(gate|up)_exps` and `IQ4_KS` `ffn_down_exps` routed experts. Mostly `iq5_ks/iq4_ks` for attn and shared expert. `iq5_k` `token_embd` and `iq6_k` `output` "head".
237
+
238
+ <details>
239
+
240
+ <summary>👈 Secret Recipe</summary>
241
+
242
+ ```bash
243
+ #!/usr/bin/env bash
244
+
245
+ custom="
246
+ # First 3 dense layers (0-3) (GPU)
247
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
248
+ blk\.[0-2]\.attn_k_b.*=q5_0
249
+ blk\.[0-2]\.attn_.*=iq5_ks
250
+ blk\.[0-2]\.ffn_down.*=iq5_ks
251
+ blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
252
+ blk\.[0-2]\..*=iq5_ks
253
+
254
+ # All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
255
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
256
+ blk\.[3-9]\.attn_k_b.*=q5_0
257
+ blk\.[1-5][0-9]\.attn_k_b.*=q5_0
258
+ blk\.60\.attn_k_b.*=q5_0
259
+
260
+ blk\.[3-9]\.attn_.*=iq5_ks
261
+ blk\.[1-5][0-9]\.attn_.*=iq5_ks
262
+ blk\.60\.attn_.*=iq5_ks
263
+
264
+ #blk\.[3-9]\.ffn_norm\.weight=iq5_ks
265
+ #blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_ks
266
+ #blk\.60\.ffn_norm\.weight=iq5_ks
267
+
268
+ #blk\.[3-9]\.exp_probs_b\.bias=iq5_ks
269
+ #blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_ks
270
+ #blk\.60\.exp_probs_b\.bias=iq5_ks
271
+
272
+ # Shared Experts (3-60) (GPU)
273
+ blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
274
+ blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
275
+ blk\.60\.ffn_down_shexp\.weight=iq5_ks
276
+
277
+ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
278
+ blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
279
+ blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
280
+
281
+ # Routed Experts (3-60) (CPU)
282
+ blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
283
+ blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
284
+ blk\.60\.ffn_down_exps\.weight=iq4_ks
285
+
286
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
287
+ blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
288
+ blk\.60\.ffn_(gate|up)_exps\.weight=iq3_ks
289
+
290
+ # put last so output weight doesn't catch all the attn ones
291
+ # Token embedding and output tensors (GPU)
292
+ # note token_embd cannot be repacked quant type
293
+ token_embd\.weight=iq5_k
294
+ output\.weight=iq6_k
295
+ "
296
+
297
+ custom=$(
298
+ echo "$custom" | grep -v '^#' | \
299
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
300
+ )
301
+
302
+ ./build/bin/llama-quantize \
303
+ --custom-q "$custom" \
304
+ --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
305
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
306
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_KS.gguf \
307
+ IQ3_KS \
308
+ 24
309
+ ```
310
+
311
+ </details>
312
+
313
  #### `IQ2_K_R4` 2.799 BPW (220GiB)
314
  Special mix `IQ3_K_R4` `ffn_down` and `IQ2_K_R4` `ffn_(up|gate)` routed experts. All other layers *roughly* `iq5_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` or manually ofline repack if you want to mmap() off disk.
315