ubergarm
/

DeepSeek-R1-0528-GGUF

@@ -43,6 +43,10 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
 * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
   - `Final estimate: PPL = 3.2730 +/- 0.01738`
   - Fits 32k context in under 24GiB VRAM
 * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
@@ -228,6 +232,84 @@ custom=$(
 </details>
 #### `IQ2_K_R4` 2.799 BPW (220GiB)
 Special mix `IQ3_K_R4` `ffn_down` and `IQ2_K_R4` `ffn_(up|gate)` routed experts. All other layers *roughly* `iq5_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` or manually ofline repack if you want to mmap() off disk.

 * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
   - `Final estimate: PPL = 3.2730 +/- 0.01738`
   - Fits 32k context in under 24GiB VRAM
+* `DeepSeek-R1-0528-IQ3_KS` 282 GiB
+  - Final estimate: PPL = 3.2983 +/- 0.01759
+  - Fits 32k context in under 16GiB VRAM
+  - Fits 64k context in under 24GiB VRAM
 * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
 </details>
+#### `IQ3_KS` 281.463 GiB (3.598 BPW)
+Special mix with all new `IQ3_KS` `ffn_(gate|up)_exps` and `IQ4_KS` `ffn_down_exps` routed experts. Mostly `iq5_ks/iq4_ks` for attn and shared expert. `iq5_k` `token_embd` and `iq6_k` `output` "head".
+<details>
+<summary>👈 Secret Recipe</summary>
+```bash
+#!/usr/bin/env bash
+custom="
+# First 3 dense layers (0-3) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[0-2]\.attn_k_b.*=q5_0
+blk\.[0-2]\.attn_.*=iq5_ks
+blk\.[0-2]\.ffn_down.*=iq5_ks
+blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
+blk\.[0-2]\..*=iq5_ks
+# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[3-9]\.attn_k_b.*=q5_0
+blk\.[1-5][0-9]\.attn_k_b.*=q5_0
+blk\.60\.attn_k_b.*=q5_0
+blk\.[3-9]\.attn_.*=iq5_ks
+blk\.[1-5][0-9]\.attn_.*=iq5_ks
+blk\.60\.attn_.*=iq5_ks
+#blk\.[3-9]\.ffn_norm\.weight=iq5_ks
+#blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_ks
+#blk\.60\.ffn_norm\.weight=iq5_ks
+#blk\.[3-9]\.exp_probs_b\.bias=iq5_ks
+#blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_ks
+#blk\.60\.exp_probs_b\.bias=iq5_ks
+# Shared Experts (3-60) (GPU)
+blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
+blk\.60\.ffn_down_shexp\.weight=iq5_ks
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
+# Routed Experts (3-60) (CPU)
+blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
+blk\.60\.ffn_down_exps\.weight=iq4_ks
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
+blk\.60\.ffn_(gate|up)_exps\.weight=iq3_ks
+# put last so output weight doesn't catch all the attn ones
+# Token embedding and output tensors (GPU)
+# note token_embd cannot be repacked quant type
+token_embd\.weight=iq5_k
+output\.weight=iq6_k
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_KS.gguf \
+    IQ3_KS \
+    24
+```
+</details>
 #### `IQ2_K_R4` 2.799 BPW (220GiB)
 Special mix `IQ3_K_R4` `ffn_down` and `IQ2_K_R4` `ffn_(up|gate)` routed experts. All other layers *roughly* `iq5_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` or manually ofline repack if you want to mmap() off disk.