Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,8 @@ tags:
|
|
| 20 |
This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
| 21 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
| 22 |
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
|
| 23 |
-
and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically.
|
|
|
|
| 24 |
|
| 25 |
# Inference with vLLM
|
| 26 |
Install vllm nightly and torchao nightly to get some recent changes:
|
|
@@ -253,8 +254,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 253 |
|
| 254 |
| Benchmark | | | |
|
| 255 |
|------------------|----------------|--------------------------------|--------------------------------|
|
| 256 |
-
| | microsoft/Phi-4-mini-instruct |
|
| 257 |
-
| Peak Memory (GB) | 8.91 |
|
| 258 |
|
| 259 |
|
| 260 |
|
|
@@ -314,12 +315,12 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 314 |
| Benchmark (Latency) | | | |
|
| 315 |
|----------------------------------|----------------|--------------------------|--------------------------|
|
| 316 |
| | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 317 |
-
| latency (batch_size=1) | 1.61s | 1.
|
| 318 |
-
| latency (batch_size=256) | 5.31s |
|
| 319 |
|
| 320 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 321 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
| 322 |
-
Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
|
| 323 |
AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
|
| 324 |
|
| 325 |
<details>
|
|
|
|
| 20 |
This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
| 21 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
| 22 |
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
|
| 23 |
+
and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically. It recovered accuracy from `mmlu_pro`
|
| 24 |
+
from INT4 checkpoint from 36.74 to 43.13, while bfloat16 accuracy is 46.43.
|
| 25 |
|
| 26 |
# Inference with vLLM
|
| 27 |
Install vllm nightly and torchao nightly to get some recent changes:
|
|
|
|
| 254 |
|
| 255 |
| Benchmark | | | |
|
| 256 |
|------------------|----------------|--------------------------------|--------------------------------|
|
| 257 |
+
| | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 |
|
| 258 |
+
| Peak Memory (GB) | 8.91 | 3.02 (66% reduction) | 3.95 (55.67% reduction) |
|
| 259 |
|
| 260 |
|
| 261 |
|
|
|
|
| 315 |
| Benchmark (Latency) | | | |
|
| 316 |
|----------------------------------|----------------|--------------------------|--------------------------|
|
| 317 |
| | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 318 |
+
| latency (batch_size=1) | 1.61s | 1.33s (1.21x speedup) | 1.37s (1.17x speedup) |
|
| 319 |
+
| latency (batch_size=256) | 5.31s | 5.38s (0.99x speedup) | 5.44s (0.98x speedup) |
|
| 320 |
|
| 321 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 322 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
| 323 |
+
Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100 that doesn't regress the performance for batch size 256. It's possible to generate
|
| 324 |
AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
|
| 325 |
|
| 326 |
<details>
|