Update README.md
Browse files
README.md
CHANGED
|
@@ -8,25 +8,14 @@ model-index:
|
|
| 8 |
tags:
|
| 9 |
- alignment-handbook
|
| 10 |
- generated_from_trainer
|
| 11 |
-
|
| 12 |
---
|
| 13 |
|
| 14 |
# Zebra-Llama: Towards Extremely Efficient Hybrid Models
|
| 15 |
Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that achieve Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
|
| 16 |
|
| 17 |
-
|
| 18 |
This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model. It composes efficient hybrid layers, combining Multi-head Latent Attention (MLA) for KV cache compression and Mamba2 (an SSM) for computational efficiency. This approach bypasses the need for costly pre-training from scratch.
|
| 19 |
|
| 20 |
-
|
| 21 |
-
Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that achieve Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
This model, Zebra-Llama-1B-4MLA-12M2, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model. It composes efficient hybrid layers, combining Multi-head Latent Attention (MLA) for KV cache compression and Mamba2 (an SSM) for computational efficiency. This approach bypasses the need for costly pre-training from scratch.
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
The composition follows a three-stage pipeline to effectively transfer knowledge from the pre-trained Transformer.
|
| 31 |
|
| 32 |
|
|
@@ -136,4 +125,4 @@ If you find this model useful, please consider citing the original paper:
|
|
| 136 |
journal={arXiv preprint arXiv:2503.11132},
|
| 137 |
year={2025}
|
| 138 |
}
|
| 139 |
-
```
|
|
|
|
| 8 |
tags:
|
| 9 |
- alignment-handbook
|
| 10 |
- generated_from_trainer
|
| 11 |
+
license: apache-2.0
|
| 12 |
---
|
| 13 |
|
| 14 |
# Zebra-Llama: Towards Extremely Efficient Hybrid Models
|
| 15 |
Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that achieve Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
|
| 16 |
|
|
|
|
| 17 |
This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model. It composes efficient hybrid layers, combining Multi-head Latent Attention (MLA) for KV cache compression and Mamba2 (an SSM) for computational efficiency. This approach bypasses the need for costly pre-training from scratch.
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
The composition follows a three-stage pipeline to effectively transfer knowledge from the pre-trained Transformer.
|
| 20 |
|
| 21 |
|
|
|
|
| 125 |
journal={arXiv preprint arXiv:2503.11132},
|
| 126 |
year={2025}
|
| 127 |
}
|
| 128 |
+
```
|