Update README.md
Browse files
README.md
CHANGED
|
@@ -251,10 +251,10 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 251 |
|
| 252 |
## Results
|
| 253 |
|
| 254 |
-
| Benchmark | | |
|
| 255 |
-
|
| 256 |
-
| | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-AWQ-INT4 |
|
| 257 |
-
| Peak Memory (GB) | 8.91 | 3.95 (55.67% reduction)
|
| 258 |
|
| 259 |
|
| 260 |
|
|
@@ -311,15 +311,15 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 311 |
# Model Performance
|
| 312 |
|
| 313 |
## Results (H100 machine)
|
| 314 |
-
| Benchmark (Latency) | | |
|
| 315 |
-
|
| 316 |
-
| | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 317 |
-
| latency (batch_size=1) | 1.60s | 1.37s (1.17x speedup)
|
| 318 |
-
| latency (batch_size=256) | 5.47s | 5.55s (0.98x speedup)
|
| 319 |
-
|
| 320 |
|
| 321 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 322 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
|
|
|
| 323 |
|
| 324 |
<details>
|
| 325 |
<summary> Reproduce Model Performance Results </summary>
|
|
|
|
| 251 |
|
| 252 |
## Results
|
| 253 |
|
| 254 |
+
| Benchmark | | | |
|
| 255 |
+
|------------------|----------------|--------------------------------|--------------------------------|
|
| 256 |
+
| | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 |
|
| 257 |
+
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) | 3.95 (55.67% reduction) |
|
| 258 |
|
| 259 |
|
| 260 |
|
|
|
|
| 311 |
# Model Performance
|
| 312 |
|
| 313 |
## Results (H100 machine)
|
| 314 |
+
| Benchmark (Latency) | | | |
|
| 315 |
+
|----------------------------------|----------------|--------------------------|--------------------------|
|
| 316 |
+
| | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 317 |
+
| latency (batch_size=1) | 1.60s | TODO | 1.37s (1.17x speedup) |
|
| 318 |
+
| latency (batch_size=256) | 5.47s | TODO | 5.55s (0.98x speedup) |
|
|
|
|
| 319 |
|
| 320 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 321 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
| 322 |
+
Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is for the new INT4 config that's optimized for H100.
|
| 323 |
|
| 324 |
<details>
|
| 325 |
<summary> Reproduce Model Performance Results </summary>
|