Commit
·
b1b51d0
1
Parent(s):
94d995f
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ pinned: false
|
|
| 10 |
Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
|
| 11 |
|
| 12 |
- Tensor Parallelism and custom cuda kernels
|
| 13 |
-
-
|
| 14 |
- Quantization with bitsandbytes or gptq
|
| 15 |
- Continuous batching of incoming requests for increased total throughput
|
| 16 |
- Accelerated weight loading (start-up time) with safetensors
|
|
|
|
| 10 |
Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
|
| 11 |
|
| 12 |
- Tensor Parallelism and custom cuda kernels
|
| 13 |
+
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
|
| 14 |
- Quantization with bitsandbytes or gptq
|
| 15 |
- Continuous batching of incoming requests for increased total throughput
|
| 16 |
- Accelerated weight loading (start-up time) with safetensors
|