Update README.md
Browse files
README.md
CHANGED
|
@@ -219,7 +219,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
|
|
| 219 |
| Benchmark | | | |
|
| 220 |
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 221 |
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 222 |
-
|
|
| 223 |
|
| 224 |
|
| 225 |
<details>
|
|
@@ -250,7 +250,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 250 |
| Benchmark | | | |
|
| 251 |
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 252 |
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 253 |
-
| Peak Memory (GB) | 24.50 | 8.
|
| 254 |
|
| 255 |
|
| 256 |
<details>
|
|
@@ -308,12 +308,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 308 |
## Results (H100 machine)
|
| 309 |
|
| 310 |
|
| 311 |
-
| Benchmark (Latency) | |
|
| 312 |
-
|
| 313 |
-
| | google/gemma-3-12b-it |
|
| 314 |
-
| latency (batch_size=1) | 3.73s |
|
| 315 |
-
| latency (batch_size=256) |
|
|
|
|
| 316 |
|
|
|
|
| 317 |
<details>
|
| 318 |
<summary> Reproduce Model Performance Results </summary>
|
| 319 |
|
|
|
|
| 219 |
| Benchmark | | | |
|
| 220 |
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 221 |
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 222 |
+
| professional_law | TODO | 54.24 | TODO |
|
| 223 |
|
| 224 |
|
| 225 |
<details>
|
|
|
|
| 250 |
| Benchmark | | | |
|
| 251 |
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 252 |
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 253 |
+
| Peak Memory (GB) | 24.50 | 8.57 (65% reduction) | 12.71 (48% reduction) |
|
| 254 |
|
| 255 |
|
| 256 |
<details>
|
|
|
|
| 308 |
## Results (H100 machine)
|
| 309 |
|
| 310 |
|
| 311 |
+
| Benchmark (Latency) | | | |
|
| 312 |
+
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 313 |
+
| | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 314 |
+
| latency (batch_size=1) | 3.73s | 2.76 (1.35x speedup) | 2.76s (1.35x speedup) |
|
| 315 |
+
| latency (batch_size=256) | 13.63s | 14.32 (0.95x speedup) | 14.30s (0.95x speedup) |
|
| 316 |
+
|
| 317 |
|
| 318 |
+
Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
|
| 319 |
<details>
|
| 320 |
<summary> Reproduce Model Performance Results </summary>
|
| 321 |
|