Molbap HF Staff commited on
Commit
b5d63a3
·
1 Parent(s): 8caaff8
Files changed (2) hide show
  1. content/article.md +20 -19
  2. webpack.config.js +1 -1
content/article.md CHANGED
@@ -72,7 +72,7 @@ So, what are the principles of `transformers`? We will try to summarize the foun
72
  </li>
73
  <li class="tenet">
74
  <a id="modular-toolbox"></a>
75
- <strong>Modular Toolbox (Not Framework)</strong>
76
  <p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
77
  <em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
78
  </li>
@@ -136,7 +136,21 @@ For better _information_, we plan to use `python` features such as `Annotated` f
136
 
137
  ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
138
 
139
- We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  ## <a id="layers-attentions-caches"></a> Layers, attentions and caches
142
  With th
@@ -170,7 +184,6 @@ So what do we see? Llama is a basis for many models, and it shows.
170
  Radically different architectures such as mamba have spawned their own dependency subgraph.
171
  {{{fragment-d3-graph}}}
172
 
173
- ![Graph showing modular related models](static/graph_modular_related_models.png)
174
 
175
  But there is no similar miracle for VLMs across the board.
176
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
@@ -255,8 +268,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
255
  So the question abounds naturally: How can we modularize more?
256
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
257
 
258
-
259
- ![Modular candidates analysis](static/modular_candidates.png)
260
 
261
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
262
 
@@ -324,19 +336,8 @@ Adding a model to transformers means:
324
 
325
  Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
326
 
327
- {{{fragment-memory-profiler}}}
328
-
329
- ### Linkedin post (to remove)
330
- Linkedin post for videos:
331
-
332
- In transformers, how do we deal with cross-model dependencies, while supporting ~400 models? Maybe you've seen the same 200-lines functions in too many _modeling_file.py_? Duplication isn’t inevitable.
333
-
334
- The “one‑model/one‑file” rule keeps every model readable and runnable. It also means identical code is copied hundreds of times. Maintenance hurts, contributor PRs snowball, and vision–language models especially end up in siloed forks.
335
-
336
- modular_*.py fixes the trade‑off, by auto-generating the modeling file from a modular file, which can use inheritance.
337
-
338
- With a small analyser I’ve mapped which models already share modular pieces and which 100‑plus still repeat themselves. Red nodes in the graph = lowest‑hanging fruit for refactor; blue = already modular.
339
 
340
- The result: contributors can focus on novel layers instead of boilerplate, reviews shrink from “new file diff” to “does this override make sense?”, and the codebase stays something you can actually open and read.
341
 
342
- If you maintain or ship models on top of Transformers, take a look at modular, in 2025 it’s how we keep shipping breadth without the bloat. 🛠️
 
72
  </li>
73
  <li class="tenet">
74
  <a id="modular-toolbox"></a>
75
+ <strong>Modular Toolbox (Not A Framework)</strong>
76
  <p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
77
  <em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
78
  </li>
 
136
 
137
  ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
138
 
139
+ We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
140
+
141
+ It is written once in the config and passed to `.from_pretrained()`.
142
+
143
+ The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
144
+
145
+ {{{fragment-tp-plan}}}
146
+
147
+
148
+ Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
149
+
150
+ `torchrun --nproc-per-node 4 demo.py`
151
+
152
+ Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
153
+
154
 
155
  ## <a id="layers-attentions-caches"></a> Layers, attentions and caches
156
  With th
 
184
  Radically different architectures such as mamba have spawned their own dependency subgraph.
185
  {{{fragment-d3-graph}}}
186
 
 
187
 
188
  But there is no similar miracle for VLMs across the board.
189
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
 
268
  So the question abounds naturally: How can we modularize more?
269
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
270
 
271
+ {{fragment-space-embed}}
 
272
 
273
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
274
 
 
336
 
337
  Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
338
 
339
+ {{{fragment-warmup_demo}}}
 
 
 
 
 
 
 
 
 
 
 
340
 
341
+ ## What is coming next
342
 
343
+ It sounds dumb, but it's true: the future is very soon. One tenet that will be broken when the next major version is released, v5, [backwards compatibility](#backwards-compatibility) will be heavily broken. Instead, what we aim to be is way more of a [modular toolbox](#modular-toolbox), while maintaining a [consistent public surface](#consistent-public-surface).
webpack.config.js CHANGED
@@ -123,7 +123,7 @@ module.exports = {
123
  const article = document.querySelector('d-article');
124
  const toc = document.querySelector('d-contents');
125
  if (toc) {
126
- const headings = article.querySelectorAll('h1, h2, h3, h4');
127
  let ToC = '<nav role="navigation" class="l-text figcaption">';
128
  ToC += '<div class="toc-header"><span class="toc-title">Table of Contents</span></div>';
129
  ToC += '<div class="toc-content">';
 
123
  const article = document.querySelector('d-article');
124
  const toc = document.querySelector('d-contents');
125
  if (toc) {
126
+ const headings = [...article.querySelectorAll('h1, h2, h3, h4')].filter(h => !h.hasAttribute('data-no-toc'));
127
  let ToC = '<nav role="navigation" class="l-text figcaption">';
128
  ToC += '<div class="toc-header"><span class="toc-title">Table of Contents</span></div>';
129
  ToC += '<div class="toc-content">';