Spaces:
Runtime error
Runtime error
| DEPLOY_TEXT = f""" | |
| # ๐ Deployment Tips | |
| A collection of powerful models is valuable, but ultimately, you need to be able to use them effectively. | |
| This tab is dedicated to providing guidance and code snippets for performing inference with leaderboard models on Intel platforms. | |
| Below, you'll find a table of open-source software options for inference, along with the supported Intel Hardware Platforms. | |
| A ๐ indicates that inference with the associated software package is supported on the hardware. We hope this information | |
| helps you choose the best option for your specific use case. Happy building! | |
| <div style="display: flex; justify-content: center;"> | |
| <table border="1"> | |
| <tr> | |
| <th>Inference Software</th> | |
| <th>Gaudi</th> | |
| <th>Xeon</th> | |
| <th>GPU Max</th> | |
| <th>Arc GPU</th> | |
| <th>Core Ultra</th> | |
| </tr> | |
| <tr> | |
| <td>Optimum Habana</td> | |
| <td>๐</td> | |
| <td></td> | |
| <td></td> | |
| <td></td> | |
| <td></td> | |
| </tr> | |
| <tr> | |
| <td>Intel Extension for PyTorch</td> | |
| <td></td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td></td> | |
| </tr> | |
| <tr> | |
| <td>Intel Extension for Transformers</td> | |
| <td></td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td></td> | |
| </tr> | |
| <tr> | |
| <td>OpenVINO</td> | |
| <td></td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| </tr> | |
| <tr> | |
| <td>BigDL</td> | |
| <td></td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| </tr> | |
| <tr> | |
| <td>NPU Acceleration Library</td> | |
| <td></td> | |
| <td></td> | |
| <td></td> | |
| <td></td> | |
| <td>๐</td> | |
| </tr> | |
| </tr> | |
| <tr> | |
| <td>PyTorch</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| </tr> | |
| </tr> | |
| <tr> | |
| <td>Tensorflow</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| <td>๐</td> | |
| </tr> | |
| </table> | |
| </div> | |
| <hr> | |
| # Intelยฎ Gaudi Accelerators | |
| Habana's SDK, Intel Gaudi Software, supports PyTorch and DeepSpeed for accelerating LLM training and inference. | |
| The Intel Gaudi Software graph compiler will optimize the execution of the operations accumulated in the graph | |
| (e.g. operator fusion, data layout management, parallelization, pipelining and memory management, | |
| and graph-level optimizations). | |
| Optimum Habana provides covenient functionality for various tasks, below you'll find the command line | |
| snippet that you would run to perform inference on Gaudi with meta-llama/Llama-2-7b-hf. | |
| The "run_generation.py" script below can be found [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation) | |
| ```bash | |
| python run_generation.py \ | |
| --model_name_or_path meta-llama/Llama-2-7b-hf \ | |
| --use_hpu_graphs \ | |
| --use_kv_cache \ | |
| --max_new_tokens 100 \ | |
| --do_sample \ | |
| --batch_size 2 \ | |
| --prompt "Hello world" "How are you?" | |
| ``` | |
| <hr> | |
| # Intelยฎ Max Series GPU | |
| ### INT4 Inference (GPU) | |
| ```python | |
| import intel_extension_for_pytorch as ipex | |
| from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM | |
| from transformers import AutoTokenizer | |
| device_map = "xpu" | |
| model_name ="Qwen/Qwen-7B" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| prompt = "When winter becomes spring, the flowers..." | |
| inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) | |
| model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, | |
| device_map=device_map, load_in_4bit=True) | |
| model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) | |
| output = model.generate(inputs) | |
| ``` | |
| <hr> | |
| # Intelยฎ Xeon CPUs | |
| ### Intel Extension for PyTorch - Optimum Intel (no quantization) | |
| Requires installing/updating optimum `pip install --upgrade-strategy eager optimum[ipex] | |
| ` | |
| ```python | |
| from optimum.intel import IPEXModelForCausalLM | |
| from transformers import AutoTokenizer, pipeline | |
| model = IPEXModelForCausalLM.from_pretrained(model_id) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) | |
| results = pipe("A fisherman at sea...") | |
| ``` | |
| ### Intelยฎ Extension for PyTorch - Mixed Precision (fp32 and bf16) | |
| ```python | |
| import torch | |
| import intel_extension_for_pytorch as ipex | |
| import transformers | |
| model= transformers.AutoModelForCausalLM(model_name_or_path).eval() | |
| dtype = torch.float # or torch.bfloat16 | |
| model = ipex.llm.optimize(model, dtype=dtype) | |
| # generation inference loop | |
| with torch.inference_mode(): | |
| model.generate() | |
| ``` | |
| ### Intelยฎ Extension for Transformers - INT4 Inference (CPU) | |
| ```python | |
| from transformers import AutoTokenizer | |
| from intel_extension_for_transformers.transformers import AutoModelForCausalLM | |
| model_name = "Intel/neural-chat-7b-v3-1" | |
| prompt = "When winter becomes spring, the flowers..." | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").input_ids | |
| model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) | |
| outputs = model.generate(inputs) | |
| ``` | |
| <hr> | |
| # Intelยฎ Core Ultra (NPUs and iGPUs) | |
| ### Intelยฎ NPU Acceleration Library | |
| ```python | |
| from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM | |
| import intel_npu_acceleration_library | |
| import torch | |
| model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" | |
| model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) | |
| tokenizer.pad_token_id = tokenizer.eos_token_id | |
| streamer = TextStreamer(tokenizer, skip_special_tokens=True) | |
| print("Compile model for the NPU") | |
| model = intel_npu_acceleration_library.compile(model, dtype=torch.int8) | |
| query = input("Ask something: ") | |
| prefix = tokenizer(query, return_tensors="pt")["input_ids"] | |
| generation_kwargs = dict( | |
| input_ids=prefix, | |
| streamer=streamer, | |
| do_sample=True, | |
| top_k=50, | |
| top_p=0.9, | |
| max_new_tokens=512, | |
| ) | |
| print("Run inference") | |
| _ = model.generate(**generation_kwargs) | |
| ``` | |
| ### OpenVINO Toolking with Optimum Habana | |
| ```python | |
| from optimum.intel import OVModelForCausalLM | |
| from transformers import AutoTokenizer, pipeline | |
| model_id = "helenai/gpt2-ov" | |
| model = OVModelForCausalLM.from_pretrained(model_id) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) | |
| pipe("In the spring, beautiful flowers bloom...") | |
| ``` | |
| <hr> | |
| # Intel ARC GPUs | |
| Coming Soon! | |
| """ |