match dist
Browse files- content/article.md +15 -9
content/article.md
CHANGED
|
@@ -1,5 +1,3 @@
|
|
| 1 |
-
# Digging through tenets and time
|
| 2 |
-
|
| 3 |
|
| 4 |
## Introduction
|
| 5 |
|
|
@@ -59,7 +57,8 @@ As I was looking for things to improve and make better, it's one of the iteratio
|
|
| 59 |
|
| 60 |
However, both of these works were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers) , allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
|
| 61 |
|
| 62 |
-
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively
|
|
|
|
| 63 |
It is explained in details in the documentation above, but overall it works like this, you define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_:
|
| 64 |
|
| 65 |
<summary>Auto-generated modeling code</summary>
|
|
@@ -82,9 +81,16 @@ We often read and understand that `kwargs` are criticized, and we are typing the
|
|
| 82 |
|
| 83 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 84 |
|
| 85 |
-
For a better _information_, we plan to use `python`features such as `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
## Community Kernels
|
| 88 |
|
| 89 |
The same principle extends to normalization, activation, and other hot paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
|
| 90 |
|
|
@@ -94,7 +100,7 @@ class GlmRMSNorm(nn.Module):
|
|
| 94 |
...
|
| 95 |
```
|
| 96 |
|
| 97 |
-
Plus, this opened another angle of contribution for the community. People who are GPU
|
| 98 |
|
| 99 |
## The good modularity
|
| 100 |
|
|
@@ -201,7 +207,7 @@ I took again a similarity measure and looked at the existing graphs. The tool is
|
|
| 201 |
|
| 202 |

|
| 203 |
|
| 204 |
-
## <a id="encoders-ftw"></a>
|
| 205 |
|
| 206 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 207 |
|
|
@@ -249,9 +255,9 @@ Adding a model to transformers means:
|
|
| 249 |
- having it immediately available to the community
|
| 250 |
- usable in vLLM, SGLang, and so on without additional code.
|
| 251 |
|
| 252 |
-
## Inner cooking:
|
| 253 |
|
| 254 |
-
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the
|
| 255 |
|
| 256 |
{{{fragment-memory-profiler}}}
|
| 257 |
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
## Introduction
|
| 3 |
|
|
|
|
| 57 |
|
| 58 |
However, both of these works were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers) , allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
|
| 59 |
|
| 60 |
+
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
|
| 61 |
+
|
| 62 |
It is explained in details in the documentation above, but overall it works like this, you define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_:
|
| 63 |
|
| 64 |
<summary>Auto-generated modeling code</summary>
|
|
|
|
| 81 |
|
| 82 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 83 |
|
| 84 |
+
For a better _information_, we plan to use `python`features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, telling for instance the expected dimensions and contents of a tensor.
|
| 85 |
+
|
| 86 |
+
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 87 |
+
|
| 88 |
+
We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
|
| 89 |
+
|
| 90 |
+
## <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 91 |
+
With th
|
| 92 |
|
| 93 |
+
## <a id="community-kernels"></a>Community Kernels
|
| 94 |
|
| 95 |
The same principle extends to normalization, activation, and other hot paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
|
| 96 |
|
|
|
|
| 100 |
...
|
| 101 |
```
|
| 102 |
|
| 103 |
+
Plus, this opened another angle of contribution for the community. People who are GPU whisperers can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
|
| 104 |
|
| 105 |
## The good modularity
|
| 106 |
|
|
|
|
| 207 |
|
| 208 |

|
| 209 |
|
| 210 |
+
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 211 |
|
| 212 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 213 |
|
|
|
|
| 255 |
- having it immediately available to the community
|
| 256 |
- usable in vLLM, SGLang, and so on without additional code.
|
| 257 |
|
| 258 |
+
## Inner cooking: CUDA Warmup
|
| 259 |
|
| 260 |
+
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading time by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
|
| 261 |
|
| 262 |
{{{fragment-memory-profiler}}}
|
| 263 |
|