Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruc
|
|
| 24 |
(The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 25 |
|
| 26 |
# Running in a mobile app
|
| 27 |
-
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/
|
| 28 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 29 |
|
| 30 |

|
|
@@ -120,10 +120,10 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
|
|
| 120 |
weight_scale_dtype=torch.bfloat16,
|
| 121 |
)
|
| 122 |
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
|
| 123 |
-
quantization_config = TorchAoConfig(quant_type=quant_config,
|
| 124 |
|
| 125 |
# either use `untied_model_id` or `untied_model_local_path`
|
| 126 |
-
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id,
|
| 127 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 128 |
|
| 129 |
# Push to hub
|
|
@@ -212,15 +212,15 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-INT8-INT4
|
|
| 212 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 214 |
|
| 215 |
-
|
| 216 |
-
|
| 217 |
```Shell
|
| 218 |
-
|
| 219 |
-
python -m executorch.examples.models.phi_4_mini.convert_weights $HF_MODEL_DIR pytorch_model_converted.bin
|
| 220 |
```
|
| 221 |
|
| 222 |
-
Once the checkpoint
|
| 223 |
-
|
|
|
|
| 224 |
|
| 225 |
```Shell
|
| 226 |
python-m executorch.examples.models.llama.export_llama \
|
|
|
|
| 24 |
(The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 25 |
|
| 26 |
# Running in a mobile app
|
| 27 |
+
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/model.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
| 28 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 29 |
|
| 30 |

|
|
|
|
| 120 |
weight_scale_dtype=torch.bfloat16,
|
| 121 |
)
|
| 122 |
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
|
| 123 |
+
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
|
| 124 |
|
| 125 |
# either use `untied_model_id` or `untied_model_local_path`
|
| 126 |
+
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
|
| 127 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 128 |
|
| 129 |
# Push to hub
|
|
|
|
| 212 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 214 |
|
| 215 |
+
ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
|
| 216 |
+
So we first use a conversion script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
|
| 217 |
```Shell
|
| 218 |
+
python -m executorch.examples.models.phi_4_mini.convert_weights $(hf download pytorch/Phi-4-mini-instruct-INT8-INT4) pytorch_model_converted.bin
|
|
|
|
| 219 |
```
|
| 220 |
|
| 221 |
+
Once we have the checkpoint, we export it to ExecuTorch with the XNNPACK backend as follows with a max_seq_length/max_context_length of 1024.
|
| 222 |
+
|
| 223 |
+
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/phi_4_mini/config/config.json within the ExecuTorch repo.)
|
| 224 |
|
| 225 |
```Shell
|
| 226 |
python-m executorch.examples.models.llama.export_llama \
|