updaate
Browse files- app/dist/_astro/{index.1qV_xJTe.css โ index.CnFuS3U1.css} +0 -0
 - app/dist/_astro/{index.1qV_xJTe.css.gz โ index.CnFuS3U1.css.gz} +1 -1
 - app/dist/index.html +0 -0
 - app/dist/index.html.gz +2 -2
 - app/src/components/HtmlEmbed.astro +1 -1
 - app/src/content/article.mdx +34 -24
 - app/src/content/embeds/banner.html +3 -3
 - app/src/content/embeds/transformers/modeling_gemma3n_graph.html +0 -0
 
    	
        app/dist/_astro/{index.1qV_xJTe.css โ index.CnFuS3U1.css}
    RENAMED
    
    | 
         The diff for this file is too large to render. 
		See raw diff 
     | 
| 
         | 
    	
        app/dist/_astro/{index.1qV_xJTe.css.gz โ index.CnFuS3U1.css.gz}
    RENAMED
    
    | 
         @@ -1,3 +1,3 @@ 
     | 
|
| 1 | 
         
             
            version https://git-lfs.github.com/spec/v1
         
     | 
| 2 | 
         
            -
            oid sha256: 
     | 
| 3 | 
         
             
            size 18332
         
     | 
| 
         | 
|
| 1 | 
         
             
            version https://git-lfs.github.com/spec/v1
         
     | 
| 2 | 
         
            +
            oid sha256:61065d2e70dbdf047f73dea3d7a80b9389d2f80e2b798fc18544d445548d2191
         
     | 
| 3 | 
         
             
            size 18332
         
     | 
    	
        app/dist/index.html
    CHANGED
    
    | 
         The diff for this file is too large to render. 
		See raw diff 
     | 
| 
         | 
    	
        app/dist/index.html.gz
    CHANGED
    
    | 
         @@ -1,3 +1,3 @@ 
     | 
|
| 1 | 
         
             
            version https://git-lfs.github.com/spec/v1
         
     | 
| 2 | 
         
            -
            oid sha256: 
     | 
| 3 | 
         
            -
            size  
     | 
| 
         | 
|
| 1 | 
         
             
            version https://git-lfs.github.com/spec/v1
         
     | 
| 2 | 
         
            +
            oid sha256:0607da7008565a4fa9721bb30f2e8d303d1070284a7779972d67f73714d5c8ac
         
     | 
| 3 | 
         
            +
            size 1654680
         
     | 
    	
        app/src/components/HtmlEmbed.astro
    CHANGED
    
    | 
         @@ -92,7 +92,7 @@ const htmlWithId = id && html ? html.replace(/<div class="([^"]*)"[^>]*>/, `<div 
     | 
|
| 92 | 
         
             
                background: var(--code-bg);
         
     | 
| 93 | 
         
             
                border: 1px solid var(--border-color);
         
     | 
| 94 | 
         
             
                border-radius: 10px;
         
     | 
| 95 | 
         
            -
                padding:  
     | 
| 96 | 
         
             
                z-index: calc(var(--z-elevated) + 1);
         
     | 
| 97 | 
         
             
                position: relative;
         
     | 
| 98 | 
         
             
              }
         
     | 
| 
         | 
|
| 92 | 
         
             
                background: var(--code-bg);
         
     | 
| 93 | 
         
             
                border: 1px solid var(--border-color);
         
     | 
| 94 | 
         
             
                border-radius: 10px;
         
     | 
| 95 | 
         
            +
                padding: 12px;
         
     | 
| 96 | 
         
             
                z-index: calc(var(--z-elevated) + 1);
         
     | 
| 97 | 
         
             
                position: relative;
         
     | 
| 98 | 
         
             
              }
         
     | 
    	
        app/src/content/article.mdx
    CHANGED
    
    | 
         @@ -24,9 +24,10 @@ Built on `PyTorch`, it's a foundational tool for modern LLM usage, research, edu 
     | 
|
| 24 | 
         | 
| 25 | 
         
             
            This scale presents a monumental engineering challenge.
         
     | 
| 26 | 
         | 
| 27 | 
         
            -
            How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members.  
     | 
| 
         | 
|
| 28 | 
         | 
| 29 | 
         
            -
            This post dissects the design philosophy that makes this possible. It's a continuation of our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently, and I recommend the read if it's not done yet, a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.
         
     | 
| 30 | 
         | 
| 31 | 
         
             
            We codify the "tenets" that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
         
     | 
| 32 | 
         | 
| 
         @@ -54,7 +55,7 @@ Note that the library _evolved_ towards these principles, and that they _emerged 
     | 
|
| 54 | 
         
             
            <li class="tenet">
         
     | 
| 55 | 
         
             
            <a id="source-of-truth"></a>
         
     | 
| 56 | 
         
             
            <strong>Source of Truth</strong>
         
     | 
| 57 | 
         
            -
            <p>We aim be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
         
     | 
| 58 | 
         
             
            <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
         
     | 
| 59 | 
         
             
            </li>
         
     | 
| 60 | 
         | 
| 
         @@ -128,7 +129,7 @@ Every core functionality _must_ be in the modeling code, every non-core function 
     | 
|
| 128 | 
         | 
| 129 | 
         
             
            This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
         
     | 
| 130 | 
         | 
| 131 | 
         
            -
            We needed to separate both principles that were so far intertwined, [repetition](#do-repeat-yourself) and [ 
     | 
| 132 | 
         | 
| 133 | 
         
             
            What was the solution to this?
         
     | 
| 134 | 
         | 
| 
         @@ -139,11 +140,12 @@ Read the code in one place (<a href="#one-model-one-file">One Model, One File</a 
     | 
|
| 139 | 
         | 
| 140 | 
         
             
            ## <a id="modular"></a> Modular transformers
         
     | 
| 141 | 
         | 
| 142 | 
         
            -
            Transformers is an  
     | 
| 143 | 
         | 
| 144 | 
         
             
            We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
         
     | 
| 145 | 
         | 
| 146 | 
         
             
            It works as follows. In order to contribute a model, say for instance  define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_.
         
     | 
| 
         | 
|
| 147 | 
         | 
| 148 | 
         
             
            <summary id="generated-modeling">Auto-generated modeling code</summary>
         
     | 
| 149 | 
         | 
| 
         @@ -157,7 +159,7 @@ What is the consequence? When adding a model, we do not need to go over the enti 
     | 
|
| 157 | 
         | 
| 158 | 
         
             
            When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
         
     | 
| 159 | 
         | 
| 160 | 
         
            -
            What does that  
     | 
| 161 | 
         | 
| 162 | 
         
             
            <div class="crumbs">
         
     | 
| 163 | 
         
             
            A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">tenet kept</a>). Reviewers and contributors maintain the shard, not the repetition. <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
         
     | 
| 
         @@ -166,15 +168,14 @@ A small <code>modular_*.py</code> declares reuse; the expanded modeling file sta 
     | 
|
| 166 | 
         | 
| 167 | 
         
             
            ### A maintainable control surface
         
     | 
| 168 | 
         | 
| 169 | 
         
            -
            The effect of modular can be measured  
     | 
| 170 | 
         
            -
            If it only has a modeling file, we add its LOC count.
         
     | 
| 171 | 
         
             
            However, if a model has a modular_*.py and a corresponding automatically generated modeling_*/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.
         
     | 
| 172 | 
         | 
| 173 | 
         
             
            That gives an "effective LOC" curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.
         
     | 
| 174 | 
         | 
| 175 | 
         
            -
             
     | 
| 176 | 
         | 
| 177 | 
         
            -
            Less code to hand-maintain means fewer places to break 
     | 
| 178 | 
         | 
| 179 | 
         
             
            <HtmlEmbed src="transformers/loc-growth.html" />
         
     | 
| 180 | 
         | 
| 
         @@ -194,9 +195,9 @@ Evidence: effective LOC drops ~15ร when counting shards instead of expanded mod 
     | 
|
| 194 | 
         | 
| 195 | 
         
             
            ### <a id="attention-classes"></a> External Attention classes
         
     | 
| 196 | 
         | 
| 197 | 
         
            -
             
     | 
| 198 | 
         | 
| 199 | 
         
            -
            We keep a `Callable` for the naive implementation of the attention, called "eager" computation.  
     | 
| 200 | 
         | 
| 201 | 
         
             
            In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
         
     | 
| 202 | 
         | 
| 
         @@ -210,7 +211,7 @@ if self.config._attn_implementation != "eager": 
     | 
|
| 210 | 
         | 
| 211 | 
         
             
            A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
         
     | 
| 212 | 
         | 
| 213 | 
         
            -
             
     | 
| 214 | 
         | 
| 215 | 
         
             
            ```python
         
     | 
| 216 | 
         
             
            from typing import Annotated
         
     | 
| 
         @@ -255,7 +256,7 @@ Sharding is configuration (<code>tp_plan</code>), not edits to <code>Linear</cod 
     | 
|
| 255 | 
         | 
| 256 | 
         
             
            ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
         
     | 
| 257 | 
         | 
| 258 | 
         
            -
            Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we  
     | 
| 259 | 
         | 
| 260 | 
         | 
| 261 | 
         
             
            ```python
         
     | 
| 
         @@ -297,7 +298,7 @@ class GlmRMSNorm(nn.Module): 
     | 
|
| 297 | 
         
             
                ...
         
     | 
| 298 | 
         
             
            ```
         
     | 
| 299 | 
         | 
| 300 | 
         
            -
             
     | 
| 301 | 
         | 
| 302 | 
         
             
            Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
         
     | 
| 303 | 
         | 
| 
         @@ -337,10 +338,13 @@ Graph reading guide: nodes are models; edges are modular imports. Llama-lineage 
     | 
|
| 337 | 
         | 
| 338 | 
         
             
            ### Many models, but not enough yet, are alike
         
     | 
| 339 | 
         | 
| 340 | 
         
            -
            So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together.  
     | 
| 
         | 
|
| 341 | 
         | 
| 342 | 
         
             
            It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.
         
     | 
| 343 | 
         | 
| 
         | 
|
| 
         | 
|
| 344 | 
         
             
            <HtmlEmbed src="transformers/model-timeline.html" />
         
     | 
| 345 | 
         | 
| 346 | 
         
             
            If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
         
     | 
| 
         @@ -353,7 +357,7 @@ Similarity (Jaccard; embeddings tried separately) surfaces likely parents; the t 
     | 
|
| 353 | 
         | 
| 354 | 
         
             
            We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
         
     | 
| 355 | 
         | 
| 356 | 
         
            -
            For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an  
     | 
| 357 | 
         | 
| 358 | 
         
             
            ```python
         
     | 
| 359 | 
         
             
            class InputsEmbeddingMixerMixin(nn.Module):
         
     | 
| 
         @@ -362,9 +366,15 @@ class InputsEmbeddingMixerMixin(nn.Module): 
     | 
|
| 362 | 
         | 
| 363 | 
         
             
            But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
         
     | 
| 364 | 
         | 
| 365 | 
         
            -
             
     | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 366 | 
         | 
| 367 | 
         
            -
             
     | 
| 
         | 
|
| 368 | 
         | 
| 369 | 
         
             
            The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
         
     | 
| 370 | 
         | 
| 
         @@ -419,7 +429,7 @@ Keep VLM embedding mix in the modeling file (semantics), standardize safe helper 
     | 
|
| 419 | 
         | 
| 420 | 
         
             
            ### On image processing and processors
         
     | 
| 421 | 
         | 
| 422 | 
         
            -
            Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
         
     | 
| 423 | 
         | 
| 424 | 
         
             
            The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
         
     | 
| 425 | 
         | 
| 
         @@ -439,7 +449,7 @@ Having a framework means forcing users into it. It restrains flexibility and cre 
     | 
|
| 439 | 
         | 
| 440 | 
         
             
            Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
         
     | 
| 441 | 
         | 
| 442 | 
         
            -
            A second one is the ability to fine-tune and pipeline these models into many other  
     | 
| 443 | 
         | 
| 444 | 
         | 
| 445 | 
         
             
            <div class="crumbs">
         
     | 
| 
         @@ -455,7 +465,7 @@ Talking about dependencies, we can take a look at the number of downloads for tr 
     | 
|
| 455 | 
         
             
            <HtmlEmbed src="transformers/model-visualisation.html" />
         
     | 
| 456 | 
         
             
            </div>
         
     | 
| 457 | 
         | 
| 458 | 
         
            -
            As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart  
     | 
| 459 | 
         | 
| 460 | 
         | 
| 461 | 
         
             
            In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
         
     | 
| 
         @@ -549,6 +559,6 @@ Being a good backend consumer requires a consistent public surface; modular shar 
     | 
|
| 549 | 
         | 
| 550 | 
         
             
            ## What is coming next
         
     | 
| 551 | 
         | 
| 552 | 
         
            -
            The next major version of `transformers` is just around the corner. When v5 is  
     | 
| 553 | 
         | 
| 554 | 
         
            -
             
     | 
| 
         | 
|
| 24 | 
         | 
| 25 | 
         
             
            This scale presents a monumental engineering challenge.
         
     | 
| 26 | 
         | 
| 27 | 
         
            +
            How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members. 
         
     | 
| 28 | 
         
            +
            We continue to support all new models and expect to do so for the foreseeable future.
         
     | 
| 29 | 
         | 
| 30 | 
         
            +
            This post dissects the design philosophy that makes this possible today. It's a continuation of our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently, and I recommend the read if it's not done yet, a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.
         
     | 
| 31 | 
         | 
| 32 | 
         
             
            We codify the "tenets" that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
         
     | 
| 33 | 
         | 
| 
         | 
|
| 55 | 
         
             
            <li class="tenet">
         
     | 
| 56 | 
         
             
            <a id="source-of-truth"></a>
         
     | 
| 57 | 
         
             
            <strong>Source of Truth</strong>
         
     | 
| 58 | 
         
            +
            <p>We aim to be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
         
     | 
| 59 | 
         
             
            <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
         
     | 
| 60 | 
         
             
            </li>
         
     | 
| 61 | 
         | 
| 
         | 
|
| 129 | 
         | 
| 130 | 
         
             
            This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
         
     | 
| 131 | 
         | 
| 132 | 
         
            +
            We needed to separate both principles that were so far intertwined, [repetition](#do-repeat-yourself) and [hackability](#one-model-one-file).
         
     | 
| 133 | 
         | 
| 134 | 
         
             
            What was the solution to this?
         
     | 
| 135 | 
         | 
| 
         | 
|
| 140 | 
         | 
| 141 | 
         
             
            ## <a id="modular"></a> Modular transformers
         
     | 
| 142 | 
         | 
| 143 | 
         
            +
            Transformers is an opinionated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
         
     | 
| 144 | 
         | 
| 145 | 
         
             
            We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
         
     | 
| 146 | 
         | 
| 147 | 
         
             
            It works as follows. In order to contribute a model, say for instance  define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_.
         
     | 
| 148 | 
         
            +
            This modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
         
     | 
| 149 | 
         | 
| 150 | 
         
             
            <summary id="generated-modeling">Auto-generated modeling code</summary>
         
     | 
| 151 | 
         | 
| 
         | 
|
| 159 | 
         | 
| 160 | 
         
             
            When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
         
     | 
| 161 | 
         | 
| 162 | 
         
            +
            What does that give us?
         
     | 
| 163 | 
         | 
| 164 | 
         
             
            <div class="crumbs">
         
     | 
| 165 | 
         
             
            A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">tenet kept</a>). Reviewers and contributors maintain the shard, not the repetition. <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
         
     | 
| 
         | 
|
| 168 | 
         | 
| 169 | 
         
             
            ### A maintainable control surface
         
     | 
| 170 | 
         | 
| 171 | 
         
            +
            The effect of modular can be measured in lines of code (LOC). If a model only has a modeling file, we add its LOC count.
         
     | 
| 
         | 
|
| 172 | 
         
             
            However, if a model has a modular_*.py and a corresponding automatically generated modeling_*/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.
         
     | 
| 173 | 
         | 
| 174 | 
         
             
            That gives an "effective LOC" curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.
         
     | 
| 175 | 
         | 
| 176 | 
         
            +
            Measured on git history, raw `modeling_*.py` grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after โ about **15ร lower**. The curve represents the **maintenance surface** today: what maintainers actually read and review.
         
     | 
| 177 | 
         | 
| 178 | 
         
            +
            Less code to hand-maintain means fewer places to break. LOC is not complexity, but they correlate in review effort and change risk.
         
     | 
| 179 | 
         | 
| 180 | 
         
             
            <HtmlEmbed src="transformers/loc-growth.html" />
         
     | 
| 181 | 
         | 
| 
         | 
|
| 195 | 
         | 
| 196 | 
         
             
            ### <a id="attention-classes"></a> External Attention classes
         
     | 
| 197 | 
         | 
| 198 | 
         
            +
            The solution of the "attention abstraction problem" we chose was to move to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allows the following:
         
     | 
| 199 | 
         | 
| 200 | 
         
            +
            We keep a `Callable` for the naive implementation of the attention, called "eager" computation. We thus name this Callable `eager_attention_forward`, and it can be run as long as the user had `torch` installed, which is a requirement in any case.
         
     | 
| 201 | 
         | 
| 202 | 
         
             
            In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
         
     | 
| 203 | 
         | 
| 
         | 
|
| 211 | 
         | 
| 212 | 
         
             
            A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
         
     | 
| 213 | 
         | 
| 214 | 
         
            +
            Hence, backend integrations sometimes require specific kwargs. We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future: 
         
     | 
| 215 | 
         | 
| 216 | 
         
             
            ```python
         
     | 
| 217 | 
         
             
            from typing import Annotated
         
     | 
| 
         | 
|
| 256 | 
         | 
| 257 | 
         
             
            ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
         
     | 
| 258 | 
         | 
| 259 | 
         
            +
            Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we define a mapping that can be then
         
     | 
| 260 | 
         | 
| 261 | 
         | 
| 262 | 
         
             
            ```python
         
     | 
| 
         | 
|
| 298 | 
         
             
                ...
         
     | 
| 299 | 
         
             
            ```
         
     | 
| 300 | 
         | 
| 301 | 
         
            +
            This also opens another contribution path: GPU specialists can contribute optimized kernels to the kernel hub, and have them usable in `transformers`. You can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
         
     | 
| 302 | 
         | 
| 303 | 
         
             
            Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
         
     | 
| 304 | 
         | 
| 
         | 
|
| 338 | 
         | 
| 339 | 
         
             
            ### Many models, but not enough yet, are alike
         
     | 
| 340 | 
         | 
| 341 | 
         
            +
            So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
         
     | 
| 342 | 
         
            +
             
     | 
| 343 | 
         | 
| 344 | 
         
             
            It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.
         
     | 
| 345 | 
         | 
| 346 | 
         
            +
            Zoom out below - it's full of models. You can click on a node to see its connections better, or use the text box to search for a model. 
         
     | 
| 347 | 
         
            +
             
     | 
| 348 | 
         
             
            <HtmlEmbed src="transformers/model-timeline.html" />
         
     | 
| 349 | 
         | 
| 350 | 
         
             
            If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
         
     | 
| 
         | 
|
| 357 | 
         | 
| 358 | 
         
             
            We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
         
     | 
| 359 | 
         | 
| 360 | 
         
            +
            For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
         
     | 
| 361 | 
         | 
| 362 | 
         
             
            ```python
         
     | 
| 363 | 
         
             
            class InputsEmbeddingMixerMixin(nn.Module):
         
     | 
| 
         | 
|
| 366 | 
         | 
| 367 | 
         
             
            But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
         
     | 
| 368 | 
         | 
| 369 | 
         
            +
            What is the current state of these โabstractionsโ across the codebase?
         
     | 
| 370 | 
         
            +
            You will see all the imports around a modeling file, here [Gemma3n](https://huggingface.co/google/gemma-3n-E4B-it). 
         
     | 
| 371 | 
         
            +
             
     | 
| 372 | 
         
            +
             Zoom and drag to explore.
         
     | 
| 373 | 
         
            +
             
     | 
| 374 | 
         
            +
            <HtmlEmbed src="transformers/modeling_gemma3n_graph.html" />
         
     | 
| 375 | 
         | 
| 376 | 
         
            +
            As you can see, the `GenerationMixin` node is already very heavy. It encompasses all of the utilities around `.generate`, it is second only to `nn.Module`. 
         
     | 
| 377 | 
         
            +
            That means every decision we make to abstract something else has to be extremely careful. 
         
     | 
| 378 | 
         | 
| 379 | 
         
             
            The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
         
     | 
| 380 | 
         | 
| 
         | 
|
| 429 | 
         | 
| 430 | 
         
             
            ### On image processing and processors
         
     | 
| 431 | 
         | 
| 432 | 
         
            +
            Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision` native inputs allowed up to speed up massively the processing time for each model.
         
     | 
| 433 | 
         | 
| 434 | 
         
             
            The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
         
     | 
| 435 | 
         | 
| 
         | 
|
| 449 | 
         | 
| 450 | 
         
             
            Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
         
     | 
| 451 | 
         | 
| 452 | 
         
            +
            A second one is the ability to fine-tune and pipeline these models into many other software. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
         
     | 
| 453 | 
         | 
| 454 | 
         | 
| 455 | 
         
             
            <div class="crumbs">
         
     | 
| 
         | 
|
| 465 | 
         
             
            <HtmlEmbed src="transformers/model-visualisation.html" />
         
     | 
| 466 | 
         
             
            </div>
         
     | 
| 467 | 
         | 
| 468 | 
         
            +
            As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart databases, like FAISS-based indexing rely on it, and thus indirectly on transformers.
         
     | 
| 469 | 
         | 
| 470 | 
         | 
| 471 | 
         
             
            In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
         
     | 
| 
         | 
|
| 559 | 
         | 
| 560 | 
         
             
            ## What is coming next
         
     | 
| 561 | 
         | 
| 562 | 
         
            +
            The next major version of `transformers` is just around the corner (and will have another blog post to its name when it comes out.). When v5 is released, we aim to keep [backwards compatibility](#backwards-compatibility) as solid as possible. The changes we make now are in service of that goal.
         
     | 
| 563 | 
         | 
| 564 | 
         
            +
            We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. Itโs better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
         
     | 
    	
        app/src/content/embeds/banner.html
    CHANGED
    
    | 
         @@ -33,7 +33,7 @@ 
     | 
|
| 33 | 
         
             
              font-weight: 600;
         
     | 
| 34 | 
         
             
              paint-order: stroke fill;
         
     | 
| 35 | 
         
             
              stroke: var(--page-bg, #ffffff);
         
     | 
| 36 | 
         
            -
              stroke-width:  
     | 
| 37 | 
         
             
              font-size: 30px;
         
     | 
| 38 | 
         
             
              font-family: 'Inter', system-ui, Arial, sans-serif;
         
     | 
| 39 | 
         
             
            }
         
     | 
| 
         @@ -76,8 +76,8 @@ const mask = defs.append('linearGradient') 
     | 
|
| 76 | 
         
             
              .attr('id','fadeX')
         
     | 
| 77 | 
         
             
              .attr('x1','0%').attr('x2','100%').attr('y1','0%').attr('y2','0%');
         
     | 
| 78 | 
         
             
            mask.append('stop').attr('offset','0%').attr('stop-color','white').attr('stop-opacity',0);
         
     | 
| 79 | 
         
            -
            mask.append('stop').attr('offset',' 
     | 
| 80 | 
         
            -
            mask.append('stop').attr('offset',' 
     | 
| 81 | 
         
             
            mask.append('stop').attr('offset','100%').attr('stop-color','white').attr('stop-opacity',0);
         
     | 
| 82 | 
         | 
| 83 | 
         
             
            // Background rect with mask applied (transparent fill; mask only affects children that reference it)
         
     | 
| 
         | 
|
| 33 | 
         
             
              font-weight: 600;
         
     | 
| 34 | 
         
             
              paint-order: stroke fill;
         
     | 
| 35 | 
         
             
              stroke: var(--page-bg, #ffffff);
         
     | 
| 36 | 
         
            +
              stroke-width: 1px;
         
     | 
| 37 | 
         
             
              font-size: 30px;
         
     | 
| 38 | 
         
             
              font-family: 'Inter', system-ui, Arial, sans-serif;
         
     | 
| 39 | 
         
             
            }
         
     | 
| 
         | 
|
| 76 | 
         
             
              .attr('id','fadeX')
         
     | 
| 77 | 
         
             
              .attr('x1','0%').attr('x2','100%').attr('y1','0%').attr('y2','0%');
         
     | 
| 78 | 
         
             
            mask.append('stop').attr('offset','0%').attr('stop-color','white').attr('stop-opacity',0);
         
     | 
| 79 | 
         
            +
            mask.append('stop').attr('offset','8%').attr('stop-color','white').attr('stop-opacity',1);
         
     | 
| 80 | 
         
            +
            mask.append('stop').attr('offset','92%').attr('stop-color','white').attr('stop-opacity',1);
         
     | 
| 81 | 
         
             
            mask.append('stop').attr('offset','100%').attr('stop-color','white').attr('stop-opacity',0);
         
     | 
| 82 | 
         | 
| 83 | 
         
             
            // Background rect with mask applied (transparent fill; mask only affects children that reference it)
         
     | 
    	
        app/src/content/embeds/transformers/modeling_gemma3n_graph.html
    ADDED
    
    | 
         The diff for this file is too large to render. 
		See raw diff 
     | 
| 
         |