Update README.md
Browse files
README.md
CHANGED
|
@@ -17,14 +17,14 @@ base_model:
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
-
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (
|
| 21 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 22 |
|
| 23 |
-
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-
|
| 24 |
(The provided pte file is exported with the default max_seq_length/max_context_length of 128; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 25 |
|
| 26 |
# Running in a mobile app
|
| 27 |
-
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-
|
| 28 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 29 |
|
| 30 |

|
|
@@ -128,7 +128,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
| 128 |
|
| 129 |
# Push to hub
|
| 130 |
MODEL_NAME = model_id.split("/")[-1]
|
| 131 |
-
save_to = f"{USER_ID}/{MODEL_NAME}-
|
| 132 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
| 133 |
tokenizer.push_to_hub(save_to)
|
| 134 |
|
|
@@ -171,7 +171,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
|
|
| 171 |
|
| 172 |
| Benchmark | | |
|
| 173 |
|----------------------------------|----------------|---------------------------|
|
| 174 |
-
| | Phi-4-mini-ins | Phi-4-mini-instruct-
|
| 175 |
| **Popular aggregated benchmark** | | |
|
| 176 |
| mmlu (0 shot) | 66.73 | 60.75 |
|
| 177 |
| mmlu_pro (5-shot) | 46.43 | 11.75 |
|
|
@@ -201,9 +201,9 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
|
|
| 201 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 202 |
```
|
| 203 |
|
| 204 |
-
## int8 dynamic activation and int4 weight quantization (
|
| 205 |
```Shell
|
| 206 |
-
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-
|
| 207 |
```
|
| 208 |
</details>
|
| 209 |
|
|
@@ -212,8 +212,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --t
|
|
| 212 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 214 |
|
| 215 |
-
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-
|
| 216 |
-
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-
|
| 217 |
```Shell
|
| 218 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
|
| 219 |
```
|
|
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
+
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
|
| 21 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 22 |
|
| 23 |
+
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-8da4w.pte) for direct use in ExecuTorch.
|
| 24 |
(The provided pte file is exported with the default max_seq_length/max_context_length of 128; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 25 |
|
| 26 |
# Running in a mobile app
|
| 27 |
+
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
| 28 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 29 |
|
| 30 |

|
|
|
|
| 128 |
|
| 129 |
# Push to hub
|
| 130 |
MODEL_NAME = model_id.split("/")[-1]
|
| 131 |
+
save_to = f"{USER_ID}/{MODEL_NAME}-INT8-INT4"
|
| 132 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
| 133 |
tokenizer.push_to_hub(save_to)
|
| 134 |
|
|
|
|
| 171 |
|
| 172 |
| Benchmark | | |
|
| 173 |
|----------------------------------|----------------|---------------------------|
|
| 174 |
+
| | Phi-4-mini-ins | Phi-4-mini-instruct-INT8-INT4 |
|
| 175 |
| **Popular aggregated benchmark** | | |
|
| 176 |
| mmlu (0 shot) | 66.73 | 60.75 |
|
| 177 |
| mmlu_pro (5-shot) | 46.43 | 11.75 |
|
|
|
|
| 201 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 202 |
```
|
| 203 |
|
| 204 |
+
## int8 dynamic activation and int4 weight quantization (INT8-INT4)
|
| 205 |
```Shell
|
| 206 |
+
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-INT8-INT4 --tasks hellaswag --device cuda:0 --batch_size 8
|
| 207 |
```
|
| 208 |
</details>
|
| 209 |
|
|
|
|
| 212 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 214 |
|
| 215 |
+
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
| 216 |
+
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
|
| 217 |
```Shell
|
| 218 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
|
| 219 |
```
|