File size: 28,528 Bytes
b20dcba
c34c7de
 
 
 
 
 
 
 
 
 
 
 
 
9a9aeb4
 
b20dcba
 
 
 
1fc4ada
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5d63a3
1fc4ada
 
 
 
 
b20dcba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a0c25b
b20dcba
 
 
 
 
 
 
3a0c25b
b20dcba
56117f9
 
b20dcba
 
 
 
c0a0e96
b20dcba
3a0c25b
 
b20dcba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a0c25b
 
 
 
 
 
 
 
b20dcba
56117f9
 
b5d63a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56117f9
 
cf5f9aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20dcba
56117f9
b20dcba
cf5f9aa
b20dcba
 
 
 
 
 
 
077bf87
b20dcba
c34c7de
 
b20dcba
 
 
3a0c25b
b20dcba
 
 
 
 
 
 
cf5f9aa
b20dcba
cf5f9aa
b20dcba
3a0c25b
b20dcba
3a0c25b
 
b20dcba
3a0c25b
b20dcba
c34c7de
b20dcba
3a0c25b
cf5f9aa
c34c7de
 
cf5f9aa
b20dcba
3a0c25b
b20dcba
 
 
 
 
 
cf5f9aa
b20dcba
 
 
 
 
 
cf5f9aa
b20dcba
 
 
c0a0e96
b20dcba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a0c25b
 
 
 
 
 
 
 
b20dcba
3a0c25b
b20dcba
cf5f9aa
b20dcba
c34c7de
 
3a0c25b
 
c34c7de
b20dcba
3a0c25b
b20dcba
c34c7de
3a0c25b
 
 
b20dcba
 
 
 
c34c7de
 
 
b20dcba
 
 
 
 
3a0c25b
 
 
 
 
b20dcba
3a0c25b
b20dcba
 
 
 
 
 
 
c34c7de
 
 
b20dcba
c0a0e96
b20dcba
 
c34c7de
 
 
 
 
b20dcba
c0a0e96
cabd939
c34c7de
 
 
 
 
 
 
 
 
b20dcba
 
 
077bf87
e7f22ff
077bf87
c34c7de
 
 
077bf87
 
c34c7de
 
 
 
347ff85
b20dcba
 
c34c7de
b20dcba
c34c7de
e7f22ff
c34c7de
b20dcba
 
 
b5d63a3
b20dcba
b5d63a3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404

## Introduction  
The `transformers` library, built with `PyTorch`, supports all state-of-the-art LLMs, many VLMs, task-specific vision language models, video models, audio models, table models, classical encoders, to a global count of almost 400 models.    
The name of the library itself is mostly majority driven as many models are not even transformers architectures, like Mamba, Zamba, RWKV, and convolution-based models.  
Regardless, each of these is wrought by the research and engineering team that created them, then harmonized into a now famous interface, and callable with a simple `.from_pretrained` command.  
Inference works for all models, training is functional for most. The library is a foundation for many machine learning courses, cookbooks, and overall, several thousands other open-source libraries depend on it. All models are tested as part of a daily CI ensuring their preservation and reproducibility. Most importantly, it is _open-source_ and has been written by the community for a large part.  
This isn't really to brag but to set the stakes: what does it take to keep such a ship afloat, made of so many moving, unrelated parts?  
The ML wave has not stopped, there's more and more models being added, at a steadily growing rate. `Transformers` is widely used, and we read the feedback that users post online. Whether it's about a function that had 300+ keyword arguments, duplicated code and helpers, and mentions of `Copied from ... ` everywhere, along with optimisation concerns. Text-only models are relatively tamed, but multimodal models remain to be harmonized.  
Here we will dissect what is the new design philosophy of transformers, as a continuation from the existing older [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and an accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy).  
More recently, and I recommend the read if it's not done yet, a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) was written, explaining in particular what makes the library faster today.  
Some time ago I dare not say how long, we discussed with transformers maintainers about the state of features in transformers. A lot of recent developments were satisfactory, but if we were only talking about these, self-congratulation would be the only goalpost.  
Reflecting on this philosophy now, as models pile up, is essential and will drive new developments.

### The core tenets of transformers

Every reader, whether an OSS maintainer, power user, or casual fine-tuner, will walk away knowing how to reason about the `transformers` code base, how to use it better, how to meaningfully contribute to it.
This will also showcase new features you might have missed so you'll be up-to-date.

So, what are the principles of `transformers`? We will try to summarize the foundations on which we've built everything, and write the "tenets" of the library.  They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time. 

<div class="tenet-list">
<ol>
<li class="tenet">
<a id="source-of-truth"></a>
<strong>Source of Truth</strong>
<p>We should be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
<em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
</li> 

<li class="tenet">
<a id="one-model-one-file"></a>
<strong>One Model, One File</strong>
<p>All inference (and most of training, loss is separate, not a part of model) logic visible, topโ€‘toโ€‘bottom.</p>
<em>Every model should be completely understandable by reading a single file from top to bottom.</em>
</li> 
<li class="tenet">
<a id="code-is-product"></a>
<strong>Code is Product</strong>
<p>Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.</p>
<em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em>
</li> 
<li class="tenet">
<a id="standardize-dont-abstract"></a>
<strong>Standardize, Don't Abstract</strong>
<p>If it's model behavior, keep it in the file; abstractions only for generic infra.</p>
<em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em>
</li>
<li class="tenet">
<a id="do-repeat-yourself"></a>
<strong>DRY* (DO Repeat Yourself)</strong>
<p>Copy when it helps users; keep successors in sync without centralizing behavior.</p>
<p><strong>Amendment:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.</p>
<em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em>
</li>
<li class="tenet">
<a id="minimal-user-api"></a>
<strong>Minimal User API</strong>
<p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
<em>Keep the public interface simple and predictable - users should know what to expect.</em>
</li>
<li class="tenet">
<a id="backwards-compatibility"></a>
<strong>Backwards Compatibility</strong>
<p>Evolve by additive standardization, <strong>never</strong> break public APIs.</p>
<p><strong>Note:</strong> Some models are showing almost no use, we also stopped adding new features for non-torch frameworks. Still, we adapt to models existing on the hub.</p>
<em>Once something is public, it stays public - evolution through addition, not breaking changes.</em>
</li> 
<li class="tenet">
<a id="consistent-public-surface"></a>
<strong>Consistent Public Surface</strong>
<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.</p>
<em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
</li>  
<li class="tenet">
<a id="modular-toolbox"></a>
<strong>Modular Toolbox (Not A Framework)</strong>
<p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
<em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
</li>
</ol>
</div>  


When a PR is merged, it is because the contribution is worthwhile, and that the  `transformers` team finds the design of the contribution to be aligned with what is above. 

Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).


For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70  `modeling_<file>.py` across `src/transformers/models/.`  Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).

```python
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)
```

You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity). 

So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works. 


## <a id="modular"></a> Going modular


It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.

We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.

It is explained in details in the documentation above, but overall it works like this, you define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_: 

<summary>Auto-generated modeling code</summary>

{{{fragment-glm-compare}}}

As you can see, we can now define any model as a _modular_ of another. This isn't strictly groundbreaking if you've done any programming, you might even think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).

## <a id="attention-classes"></a> External Attention classes

A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api). 

What will forever stay in the modeling code is the `eager_attention_forward` because it is a core part of the modeling,

```python
attention_interface: Callable = eager_attention_forward
if self.config._attn_implementation != "eager":
    attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
```

We often read and understand that `kwargs` are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don''t use the same kwargs. 

It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).

For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):

```python
from typing import Annotated

MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
```


## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism

We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`. 

It is written once in the config and passed to `.from_pretrained()`. 

The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.

{{{fragment-tp-plan}}}


Which allows a user to run with multiple processes per node, e.g. 4 GPUs:

`torchrun --nproc-per-node 4 demo.py`

Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.


## <a id="layers-attentions-caches"></a> Layers, attentions and caches

Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then


```python
ALLOWED_LAYER_TYPES = (
    "full_attention",
    "sliding_attention",
    "chunked_attention",
    "linear_attention",
    ...
)
```

and the configuration can be _explicit_ about which attention type is in which layer, see e.g. gpt-oss, which alternates sliding and full attention:

```python
  "layer_types": [
    "sliding_attention",
    "full_attention",
    ...,
    "sliding_attention",
    "full_attention"
  ],
```

This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also [easy to tweak](#modular-toolbox).

## <a id="community-kernels"></a>Community Kernels

The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a communityโ€‘provided forward, keeping a [consistent public surface](#consistent-public-surface)

```python
@use_kernel_forward_from_hub("RMSNorm")
class GlmRMSNorm(nn.Module):
    ...
```

Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it! 

Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).

## The good modularity

Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions. 
So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others? 

To get this graph, I used the heuristic of modular inheritance. 
1. Does this model have a `modular` file?
2. In this `modular` file, what models, configurations and processings are imported?
3. Recurse through the model list that way. 

So what do we see? Llama is a basis for many models, and it shows.
Radically different architectures such as mamba have spawned their own dependency subgraph.


{{{fragment-dependency-graph}}}

However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models. 
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama. 

Another problem is, this is only for `modular` models. Several models do NOT have a modular file.

## Many models, but not enough yet, are alike

So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index. 

It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.

{{{fragment-model-timeline}}} 

If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).


## VLM improvements, avoiding abstraction 

We don't have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.

For instance, I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like 

```python
class InputsEmbeddingMixerMixin(nn.Module):
    #
```

But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file. 

This is the current state of abstractions across a modeling file:

![Bloatedness visualizer showing abstraction levels](static/Bloatedness_visualizer.png)

The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:

```python
    def get_placeholder_mask(
        self,
        input_ids: torch.LongTensor,
        inputs_embeds: torch.FloatTensor,
        image_features: torch.FloatTensor = None,
        video_features: torch.FloatTensor = None,
    ):
        """
        Obtains multimodal placeholdr mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
        equal to the length of multimodal features. If the lengths are different, an error is raised.
        """
        if input_ids is None:
            special_image_mask = inputs_embeds == self.get_input_embeddings()(
                torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
            )
            special_image_mask = special_image_mask.all(-1)
            special_video_mask = inputs_embeds == self.get_input_embeddings()(
                torch.tensor(self.config.video_token_id, dtype=torch.long, device=inputs_embeds.device)
            )
            special_video_mask = special_video_mask.all(-1)
        else:
            special_image_mask = input_ids == self.config.image_token_id
            special_video_mask = input_ids == self.config.video_token_id

        n_image_tokens = special_image_mask.sum()
        special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
        if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
            raise ValueError(
                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
            )

        n_video_tokens = special_video_mask.sum()
        special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
        if video_features is not None and inputs_embeds[special_video_mask].numel() != video_features.numel():
            raise ValueError(
                f"Videos features and video tokens do not match: tokens: {n_video_tokens}, features {video_features.shape[0]}"
            )

        return special_image_mask, special_video_mask
```

But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model. 

## The weight of maintenance


The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an โ€œeffective LOCโ€ curve: the ๐—บ๐—ฎ๐—ถ๐—ป๐˜๐—ฒ๐—ป๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜€๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ.

๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up. 

Less code to hand-maintain means fewer places to break. 

Cyclomatic complexity isnโ€™t LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐™™ ~ ๐™ญ ๐™ก๐™ฃ ๐™ญ. Lower ๐˜… (lower loc) helps.

{{{fragment-loc-growth}}}

There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide. 

Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract). 

## <a id="encoders-ftw"></a> Embedding models, now and forever.

Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able. 

{{{fragment-model-visualisation}}}

As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.

## On image processing and processors

Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.

The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.

![Fast Image Processors Performance](fast_image_processors.png) 
 


## Reduce barrier to entry/contribution

This is an overall objective: there's no `transformer` without its community. 

We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.

Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.

In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.


## A surgical toolbox for model development

### Attention visualisation

If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
machinery is the `attention mask`, cause of confusion. 

Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.

{{{fragment-attention-visualizer}}}


### Logging entire model activations

Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON. 

It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).

![Model debugger interface](static/model_debugger.png)

### Cooking faster CUDA warmups

Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. 

{{{fragment-warmup_demo}}}

It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, your iteration speed.

## Transformers-serve and continuous batching

Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. 

```bash
transformers serve

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
```

This provides an OpenAI-compatible API with features like [continuous batching](https://github.com/huggingface/transformers/pull/38085) (also check [here](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization.

Continuous batching is in itself very much linked to the great work of vLLM with the `paged attention kernel`, further justifying the facilitation of [external kernels](#community-kernels). 


## Community reusability
 
Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
- having it immediately available to the community
- having it immediately usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)

This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.



## What is coming next

It sounds dumb, but it's true: the future is very soon. One tenet that will be broken when the next major version is released, v5, [backwards compatibility](#backwards-compatibility) will be heavily broken. Instead, what we aim to be is way more of a [modular toolbox](#modular-toolbox), while maintaining a [consistent public surface](#consistent-public-surface).