jerryzh168 commited on
Commit
4a9158a
·
verified ·
1 Parent(s): a2f67ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -20,7 +20,8 @@ tags:
20
  This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
21
  using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
22
  This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
23
- and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically.
 
24
 
25
  # Inference with vLLM
26
  Install vllm nightly and torchao nightly to get some recent changes:
@@ -253,8 +254,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
253
 
254
  | Benchmark | | | |
255
  |------------------|----------------|--------------------------------|--------------------------------|
256
- | | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 |
257
- | Peak Memory (GB) | 8.91 | 2.98 (67% reduction) | 3.95 (55.67% reduction) |
258
 
259
 
260
 
@@ -314,12 +315,12 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
314
  | Benchmark (Latency) | | | |
315
  |----------------------------------|----------------|--------------------------|--------------------------|
316
  | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
317
- | latency (batch_size=1) | 1.61s | 1.09s (1.48x speedup) | 1.37s (1.17x speedup) |
318
- | latency (batch_size=256) | 5.31s | 32.33s | 5.44s (0.98x speedup) |
319
 
320
  Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
321
  int4 weight only checkpoint is only expected to have speedup for memory bound situations.
322
- Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
323
  AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
324
 
325
  <details>
 
20
  This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
21
  using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
22
  This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 56% VRAM reduction (3.95 GB needed)
23
+ and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlu_pro` task to recover the accuracy for `mmlu_pro` specifically. It recovered accuracy from `mmlu_pro`
24
+ from INT4 checkpoint from 36.74 to 43.13, while bfloat16 accuracy is 46.43.
25
 
26
  # Inference with vLLM
27
  Install vllm nightly and torchao nightly to get some recent changes:
 
254
 
255
  | Benchmark | | | |
256
  |------------------|----------------|--------------------------------|--------------------------------|
257
+ | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 |
258
+ | Peak Memory (GB) | 8.91 | 3.02 (66% reduction) | 3.95 (55.67% reduction) |
259
 
260
 
261
 
 
315
  | Benchmark (Latency) | | | |
316
  |----------------------------------|----------------|--------------------------|--------------------------|
317
  | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
318
+ | latency (batch_size=1) | 1.61s | 1.33s (1.21x speedup) | 1.37s (1.17x speedup) |
319
+ | latency (batch_size=256) | 5.31s | 5.38s (0.99x speedup) | 5.44s (0.98x speedup) |
320
 
321
  Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
322
  int4 weight only checkpoint is only expected to have speedup for memory bound situations.
323
+ Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100 that doesn't regress the performance for batch size 256. It's possible to generate
324
  AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
325
 
326
  <details>