Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,7 @@ On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
|
| 31 |
# Quantization Recipe
|
| 32 |
|
| 33 |
First need to install the required packages:
|
| 34 |
-
```
|
| 35 |
pip install git+https://github.com/huggingface/transformers@main
|
| 36 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 37 |
```
|
|
@@ -39,7 +39,7 @@ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/c
|
|
| 39 |
## Untie Embedding Weights
|
| 40 |
Before quantization, since we need quantize input embedding and unembedding (lm_head) layer which are tied, but we want to quantize them separately, we first need to untie the model:
|
| 41 |
|
| 42 |
-
```
|
| 43 |
from transformers import (
|
| 44 |
AutoModelForCausalLM,
|
| 45 |
AutoProcessor,
|
|
@@ -73,7 +73,7 @@ tokenizer.push_to_hub(save_to)
|
|
| 73 |
|
| 74 |
We used following code to get the quantized model:
|
| 75 |
|
| 76 |
-
```
|
| 77 |
from transformers import (
|
| 78 |
AutoModelForCausalLM,
|
| 79 |
AutoProcessor,
|
|
@@ -156,12 +156,12 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
|
|
| 156 |
Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 157 |
|
| 158 |
## baseline
|
| 159 |
-
```
|
| 160 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 64
|
| 161 |
```
|
| 162 |
|
| 163 |
## int8 dynamic activation and int4 weight quantization (8da4w)
|
| 164 |
-
```
|
| 165 |
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --tasks hellaswag --device cuda:0 --batch_size 64
|
| 166 |
```
|
| 167 |
|
|
@@ -195,13 +195,13 @@ Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.
|
|
| 195 |
|
| 196 |
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
| 197 |
The following script does this for you. We have uploaded the converted checkpoint [phi4-mini-8da4w-converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w-converted.bin) for convenience.
|
| 198 |
-
```
|
| 199 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
|
| 200 |
```
|
| 201 |
|
| 202 |
Once the checkpoint is converted, we can export to ExecuTorch's PTE format with the XNNPACK delegate.
|
| 203 |
|
| 204 |
-
```
|
| 205 |
PARAMS="executorch/examples/models/phi_4_mini/config.json"
|
| 206 |
python -m executorch.examples.models.llama.export_llama \
|
| 207 |
--model "phi_4_mini" \
|
|
|
|
| 31 |
# Quantization Recipe
|
| 32 |
|
| 33 |
First need to install the required packages:
|
| 34 |
+
```Shell
|
| 35 |
pip install git+https://github.com/huggingface/transformers@main
|
| 36 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 37 |
```
|
|
|
|
| 39 |
## Untie Embedding Weights
|
| 40 |
Before quantization, since we need quantize input embedding and unembedding (lm_head) layer which are tied, but we want to quantize them separately, we first need to untie the model:
|
| 41 |
|
| 42 |
+
```Py
|
| 43 |
from transformers import (
|
| 44 |
AutoModelForCausalLM,
|
| 45 |
AutoProcessor,
|
|
|
|
| 73 |
|
| 74 |
We used following code to get the quantized model:
|
| 75 |
|
| 76 |
+
```Py
|
| 77 |
from transformers import (
|
| 78 |
AutoModelForCausalLM,
|
| 79 |
AutoProcessor,
|
|
|
|
| 156 |
Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 157 |
|
| 158 |
## baseline
|
| 159 |
+
```Shell
|
| 160 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 64
|
| 161 |
```
|
| 162 |
|
| 163 |
## int8 dynamic activation and int4 weight quantization (8da4w)
|
| 164 |
+
```Shell
|
| 165 |
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --tasks hellaswag --device cuda:0 --batch_size 64
|
| 166 |
```
|
| 167 |
|
|
|
|
| 195 |
|
| 196 |
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
| 197 |
The following script does this for you. We have uploaded the converted checkpoint [phi4-mini-8da4w-converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w-converted.bin) for convenience.
|
| 198 |
+
```Shell
|
| 199 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
|
| 200 |
```
|
| 201 |
|
| 202 |
Once the checkpoint is converted, we can export to ExecuTorch's PTE format with the XNNPACK delegate.
|
| 203 |
|
| 204 |
+
```Shell
|
| 205 |
PARAMS="executorch/examples/models/phi_4_mini/config.json"
|
| 206 |
python -m executorch.examples.models.llama.export_llama \
|
| 207 |
--model "phi_4_mini" \
|