pytorch
/

Phi-4-mini-instruct-AWQ-INT4

@@ -20,7 +20,8 @@ tags:
 This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
-and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
@@ -253,8 +254,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark        |                |                                |                                |
 |------------------|----------------|--------------------------------|--------------------------------|
-|                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4              |
-| Peak Memory (GB) | 8.91   | 2.98 (67% reduction)    | 3.95 (55.67% reduction) |
@@ -314,12 +315,12 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | Benchmark (Latency)              |                |                          |                          |
 |----------------------------------|----------------|--------------------------|--------------------------|
 |                                  | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
-| latency (batch_size=1)           | 1.61s          | 1.09s (1.48x speedup)    | 1.37s (1.17x speedup) |
-| latency (batch_size=256)         | 5.31s          | 32.33s    | 5.44s (0.98x speedup) |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
-Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
 AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
 <details>

 This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
+and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically. It recovered accuracy from `mmlu_pro`
+from INT4 checkpoint from 36.74 to 43.13, while bfloat16 accuracy is 46.43.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
 | Benchmark        |                |                                |                                |
 |------------------|----------------|--------------------------------|--------------------------------|
+|                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4              |
+| Peak Memory (GB) | 8.91   | 3.02 (66% reduction)    | 3.95 (55.67% reduction) |
 | Benchmark (Latency)              |                |                          |                          |
 |----------------------------------|----------------|--------------------------|--------------------------|
 |                                  | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
+| latency (batch_size=1)           | 1.61s          | 1.33s (1.21x speedup)    | 1.37s (1.17x speedup) |
+| latency (batch_size=256)         | 5.31s          | 5.38s (0.99x speedup)    | 5.44s (0.98x speedup) |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
+Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100 that doesn't regress the performance for batch size 256. It's possible to generate
 AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
 <details>