pytorch
/

Phi-4-mini-instruct-AWQ-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on Sep 14

Commit

32d06e3

·

verified ·

1 Parent(s): 428e083

Update README.md

Files changed (1) hide show

README.md +10 -10

README.md CHANGED Viewed

@@ -251,10 +251,10 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 ## Results
-| Benchmark        |                |                                |
-|------------------|----------------|--------------------------------|
-|                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-AWQ-INT4              |
-| Peak Memory (GB) | 8.91   | 3.95 (55.67% reduction)    |
@@ -311,15 +311,15 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 # Model Performance
 ## Results (H100 machine)
-| Benchmark (Latency)              |                |                          |
-|----------------------------------|----------------|--------------------------|
-|                                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-AWQ-INT4        |
-| latency (batch_size=1)           | 1.60s          | 1.37s (1.17x speedup)    |
-| latency (batch_size=256)         | 5.47s          | 5.55s (0.98x speedup)    |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
 <details>
 <summary> Reproduce Model Performance Results </summary>

 ## Results
+| Benchmark        |                |                                |                                |
+|------------------|----------------|--------------------------------|--------------------------------|
+|                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4              |
+| Peak Memory (GB) | 8.91   | 2.98 (67% reduction)    | 3.95 (55.67% reduction) |
 # Model Performance
 ## Results (H100 machine)
+| Benchmark (Latency)              |                |                          |                          |
+|----------------------------------|----------------|--------------------------|--------------------------|
+|                                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
+| latency (batch_size=1)           | 1.60s          | TODO    | 1.37s (1.17x speedup) |
+| latency (batch_size=256)         | 5.47s          | TODO    | 5.55s (0.98x speedup) |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
+Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is for the new INT4 config that's optimized for H100.
 <details>
 <summary> Reproduce Model Performance Results </summary>