better tp
Browse files- content/article.md +20 -19
- webpack.config.js +1 -1
content/article.md
CHANGED
|
@@ -72,7 +72,7 @@ So, what are the principles of `transformers`? We will try to summarize the foun
|
|
| 72 |
</li>
|
| 73 |
<li class="tenet">
|
| 74 |
<a id="modular-toolbox"></a>
|
| 75 |
-
<strong>Modular Toolbox (Not Framework)</strong>
|
| 76 |
<p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
|
| 77 |
<em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
|
| 78 |
</li>
|
|
@@ -136,7 +136,21 @@ For better _information_, we plan to use `python` features such as `Annotated` f
|
|
| 136 |
|
| 137 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 138 |
|
| 139 |
-
We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
## <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 142 |
With th
|
|
@@ -170,7 +184,6 @@ So what do we see? Llama is a basis for many models, and it shows.
|
|
| 170 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 171 |
{{{fragment-d3-graph}}}
|
| 172 |
|
| 173 |
-

|
| 174 |
|
| 175 |
But there is no similar miracle for VLMs across the board.
|
| 176 |
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
|
@@ -255,8 +268,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
|
|
| 255 |
So the question abounds naturally: How can we modularize more?
|
| 256 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 257 |
|
| 258 |
-
|
| 259 |
-

|
| 260 |
|
| 261 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 262 |
|
|
@@ -324,19 +336,8 @@ Adding a model to transformers means:
|
|
| 324 |
|
| 325 |
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
|
| 326 |
|
| 327 |
-
{{{fragment-
|
| 328 |
-
|
| 329 |
-
### Linkedin post (to remove)
|
| 330 |
-
Linkedin post for videos:
|
| 331 |
-
|
| 332 |
-
In transformers, how do we deal with cross-model dependencies, while supporting ~400 models? Maybe you've seen the same 200-lines functions in too many _modeling_file.py_? Duplication isn’t inevitable.
|
| 333 |
-
|
| 334 |
-
The “one‑model/one‑file” rule keeps every model readable and runnable. It also means identical code is copied hundreds of times. Maintenance hurts, contributor PRs snowball, and vision–language models especially end up in siloed forks.
|
| 335 |
-
|
| 336 |
-
modular_*.py fixes the trade‑off, by auto-generating the modeling file from a modular file, which can use inheritance.
|
| 337 |
-
|
| 338 |
-
With a small analyser I’ve mapped which models already share modular pieces and which 100‑plus still repeat themselves. Red nodes in the graph = lowest‑hanging fruit for refactor; blue = already modular.
|
| 339 |
|
| 340 |
-
|
| 341 |
|
| 342 |
-
|
|
|
|
| 72 |
</li>
|
| 73 |
<li class="tenet">
|
| 74 |
<a id="modular-toolbox"></a>
|
| 75 |
+
<strong>Modular Toolbox (Not A Framework)</strong>
|
| 76 |
<p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
|
| 77 |
<em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
|
| 78 |
</li>
|
|
|
|
| 136 |
|
| 137 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 138 |
|
| 139 |
+
We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
|
| 140 |
+
|
| 141 |
+
It is written once in the config and passed to `.from_pretrained()`.
|
| 142 |
+
|
| 143 |
+
The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
|
| 144 |
+
|
| 145 |
+
{{{fragment-tp-plan}}}
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
|
| 149 |
+
|
| 150 |
+
`torchrun --nproc-per-node 4 demo.py`
|
| 151 |
+
|
| 152 |
+
Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
|
| 153 |
+
|
| 154 |
|
| 155 |
## <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 156 |
With th
|
|
|
|
| 184 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 185 |
{{{fragment-d3-graph}}}
|
| 186 |
|
|
|
|
| 187 |
|
| 188 |
But there is no similar miracle for VLMs across the board.
|
| 189 |
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
|
|
|
| 268 |
So the question abounds naturally: How can we modularize more?
|
| 269 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 270 |
|
| 271 |
+
{{fragment-space-embed}}
|
|
|
|
| 272 |
|
| 273 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 274 |
|
|
|
|
| 336 |
|
| 337 |
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
|
| 338 |
|
| 339 |
+
{{{fragment-warmup_demo}}}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 340 |
|
| 341 |
+
## What is coming next
|
| 342 |
|
| 343 |
+
It sounds dumb, but it's true: the future is very soon. One tenet that will be broken when the next major version is released, v5, [backwards compatibility](#backwards-compatibility) will be heavily broken. Instead, what we aim to be is way more of a [modular toolbox](#modular-toolbox), while maintaining a [consistent public surface](#consistent-public-surface).
|
webpack.config.js
CHANGED
|
@@ -123,7 +123,7 @@ module.exports = {
|
|
| 123 |
const article = document.querySelector('d-article');
|
| 124 |
const toc = document.querySelector('d-contents');
|
| 125 |
if (toc) {
|
| 126 |
-
const headings = article.querySelectorAll('h1, h2, h3, h4');
|
| 127 |
let ToC = '<nav role="navigation" class="l-text figcaption">';
|
| 128 |
ToC += '<div class="toc-header"><span class="toc-title">Table of Contents</span></div>';
|
| 129 |
ToC += '<div class="toc-content">';
|
|
|
|
| 123 |
const article = document.querySelector('d-article');
|
| 124 |
const toc = document.querySelector('d-contents');
|
| 125 |
if (toc) {
|
| 126 |
+
const headings = [...article.querySelectorAll('h1, h2, h3, h4')].filter(h => !h.hasAttribute('data-no-toc'));
|
| 127 |
let ToC = '<nav role="navigation" class="l-text figcaption">';
|
| 128 |
ToC += '<div class="toc-header"><span class="toc-title">Table of Contents</span></div>';
|
| 129 |
ToC += '<div class="toc-content">';
|