Update README.md

Browse files

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -17,14 +17,14 @@ base_model:
 pipeline_tag: text-generation
 ---
-[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
-We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use in ExecuTorch.
 (The provided pte file is exported with the default max_seq_length/max_context_length of 128; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
-The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
@@ -128,7 +128,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
 # Push to hub
 MODEL_NAME = model_id.split("/")[-1]
-save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
 quantized_model.push_to_hub(save_to, safe_serialization=False)
 tokenizer.push_to_hub(save_to)
@@ -171,7 +171,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
-|                                  | Phi-4-mini-ins | Phi-4-mini-instruct-8da4w |
 | **Popular aggregated benchmark** |                |                           |
 | mmlu (0 shot)                    | 66.73          | 60.75                     |
 | mmlu_pro (5-shot)                | 46.43	        | 11.75                     |
@@ -201,9 +201,9 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
-## int8 dynamic activation and int4 weight quantization (8da4w)
 ```Shell
-lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 </details>
@@ -212,8 +212,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --t
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
-We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
-The following script does this for you.  We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model_converted.bin) for convenience.
 ```Shell
 python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
 ```

 pipeline_tag: text-generation
 ---
+[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-8da4w.pte) for direct use in ExecuTorch.
 (The provided pte file is exported with the default max_seq_length/max_context_length of 128; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
+The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
 # Push to hub
 MODEL_NAME = model_id.split("/")[-1]
+save_to = f"{USER_ID}/{MODEL_NAME}-INT8-INT4"
 quantized_model.push_to_hub(save_to, safe_serialization=False)
 tokenizer.push_to_hub(save_to)
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
+|                                  | Phi-4-mini-ins | Phi-4-mini-instruct-INT8-INT4 |
 | **Popular aggregated benchmark** |                |                           |
 | mmlu (0 shot)                    | 66.73          | 60.75                     |
 | mmlu_pro (5-shot)                | 46.43	        | 11.75                     |
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
+## int8 dynamic activation and int4 weight quantization (INT8-INT4)
 ```Shell
+lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-INT8-INT4 --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 </details>
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
+We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
+The following script does this for you.  We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
 ```Shell
 python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
 ```