Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Oct 3

Commit

c3004a2

1 Parent(s): f47d4d7

build duh

Browse files

Files changed (5) hide show

app/dist/_astro/{index.7hgRH84_.css → index.BxklM4ay.css} +0 -0
app/dist/_astro/{index.7hgRH84_.css.gz → index.BxklM4ay.css.gz} +1 -1
app/dist/images/transformers/classic_encoders.png +3 -0
app/dist/index.html +42 -38
app/dist/index.html.gz +2 -2

app/dist/_astro/{index.7hgRH84_.css → index.BxklM4ay.css} RENAMED Viewed

The diff for this file is too large to render. See raw diff

app/dist/_astro/{index.7hgRH84_.css.gz → index.BxklM4ay.css.gz} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b4e9905ab57ee091f7ff64f4356323970823bacad6cf723107e41b4696e3e6a9
 size 18473

 version https://git-lfs.github.com/spec/v1
+oid sha256:3bed21d41560dce79a059125f524f651cf2c3c9026807d6fe1037a37a6fe544a
 size 18473

app/dist/images/transformers/classic_encoders.png ADDED Viewed

Git LFS Details

SHA256: fd9a7c4300b8fcfdc8fe0aebe6f84f0131efcab8c8928783388e6c54148c4a68
Pointer size: 131 Bytes
Size of remote file: 532 kB

app/dist/index.html CHANGED Viewed

@@ -12,8 +12,8 @@
           document.documentElement.setAttribute("data-theme", theme);
         } catch {}
       })();
-    </script><script type="module" src="/scripts/color-palettes.js"></script><!-- TO MANAGE PROPERLY --><script src="https://cdn.plot.ly/plotly-3.0.0.min.js" charset="utf-8"></script><link rel="stylesheet" href="/_astro/index.7hgRH84_.css"><script type="module" src="/_astro/hoisted.DK-CdsVg.js"></script>
-<script type="module" src="/_astro/page.CH0W_C1Z.js"></script></head> <body> <button id="theme-toggle" aria-label="Toggle color theme" data-astro-cid-x3pjskd3> <svg class="icon light" width="20" height="20" viewBox="0 0 24 24" aria-hidden="true" focusable="false" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" data-astro-cid-x3pjskd3> <circle cx="12" cy="12" r="5" data-astro-cid-x3pjskd3></circle> <line x1="12" y1="1" x2="12" y2="4" data-astro-cid-x3pjskd3></line> <line x1="12" y1="20" x2="12" y2="23" data-astro-cid-x3pjskd3></line> <line x1="1" y1="12" x2="4" y2="12" data-astro-cid-x3pjskd3></line> <line x1="20" y1="12" x2="23" y2="12" data-astro-cid-x3pjskd3></line> <line x1="4.22" y1="4.22" x2="6.34" y2="6.34" data-astro-cid-x3pjskd3></line> <line x1="17.66" y1="17.66" x2="19.78" y2="19.78" data-astro-cid-x3pjskd3></line> <line x1="4.22" y1="19.78" x2="6.34" y2="17.66" data-astro-cid-x3pjskd3></line> <line x1="17.66" y1="6.34" x2="19.78" y2="4.22" data-astro-cid-x3pjskd3></line> </svg> <svg class="icon dark" width="20" height="20" viewBox="0 0 24 24" aria-hidden="true" focusable="false" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" data-astro-cid-x3pjskd3> <path d="M21 12.79A9 9 0 1 1 11.21 3 7 7 0 0 0 21 12.79z" data-astro-cid-x3pjskd3></path> </svg>  </button>  <section class="hero" data-astro-cid-bbe6dxrz> <h1 class="hero-title" data-astro-cid-bbe6dxrz>Maintain the unmaintainable:<br/>1M python loc, 400+ models</h1> <div class="hero-banner" data-astro-cid-bbe6dxrz> <figure class="html-embed"><div class="html-embed__card is-frameless"><div id="frag-xl9xb41cgvg"><style>
 @import url('https://fonts.googleapis.com/css2?family=Inter:wght@500;600&display=swap');
 .banner-container {
@@ -425,7 +425,7 @@ We continue to support all new models and expect to do so for the foreseeable fu
 <h2 id="the-core-tenets-of-transformers"><a href="#the-core-tenets-of-transformers">The core tenets of transformers</a></h2>
 <p>We summarize the foundations on which we’ve built everything, and write the “tenets” of the library.  They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
 <p>These principles were not decided in a vacuum. The library <em>evolved</em> towards them, and once they <em>emerged</em>, they were recognized as critical.</p>
-<div class="tenet-list"><ol><li class="tenet"><a id="source-of-truth"></a><strong>Source of Truth</strong><p>We aim to be a <a href="https://huggingface.co/blog/transformers-model-definition">source of truth for all model definitions</a>. This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they’ll be easily adopted by downstream libraries and projects. It’s much easier for a project to <em>always</em> refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.</p><em>This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.</em></li><li class="tenet"><a id="one-model-one-file"></a><strong>One Model, One File</strong><p>All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model’s hackability.</p><em>Every model should be completely understandable and hackable by reading a single file from top to bottom.</em></li><li class="tenet"><a id="code-is-product"></a><strong>Code is Product</strong><p>Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.</p><em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em></li><li class="tenet"><a id="standardize-dont-abstract"></a><strong>Standardize, Don’t Abstract</strong><p>If it’s model behavior, keep it in the file; abstractions only for generic infra.</p><em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em></li><li class="tenet"><a id="do-repeat-yourself"></a><strong>DRY* (DO Repeat Yourself)</strong><p>Copy when it helps users; keep successors in sync without centralizing behavior.</p><p><strong>Amendment:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.</p><em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em></li><li class="tenet"><a id="minimal-user-api"></a><strong>Minimal User API</strong><p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p><em>Keep the public interface simple and predictable, users should know what to expect.</em></li><li class="tenet"><a id="backwards-compatibility"></a><strong>Backwards Compatibility</strong><p>Evolve by additive standardization, never break public APIs.</p><p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies. If we do deprecate something, it’s with very long cycles beforehand.</p><em>Once something is public, it stays public, evolution through addition, not breaking changes.</em></li><li class="tenet"><a id="consistent-public-surface"></a><strong>Consistent Public Surface</strong><p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal as well as a tenet.</p><em>All models should feel familiar - consistent interfaces reduce cognitive load.</em></li></ol></div>
 <p>When a PR is merged, it is because the contribution is worthwhile, and because the <code>transformers</code> team finds the design of the contribution to be aligned with the tenets.</p>
 <p>Does all the code in the library strictly follow these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere, built by thousands of different workers. We <em>try</em> to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
 <p>To see what constitutes adherence to the tenets, let’s take the example of code repetition.</p>
@@ -445,10 +445,10 @@ We continue to support all new models and expect to do so for the foreseeable fu
 <h2 id="-modular-transformers"><a href="#-modular-transformers"><a id="modular"></a> Modular transformers</a></h2>
 <p>Transformers is an opinionated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers was introduced</a> to allow a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
 <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by progressively removing all pieces of code that were “copied from” another file.</p>
-<p>It works as follows. In order to contribute a model, let us take GLM for instance, we define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em> already existing in the libary.
 The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.</p>
 <summary id="generated-modeling">Auto-generated modeling code</summary>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-md3mnd0y02"><div class="code-compare" style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
     <div class="code-column" style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
         <div class="code-header" style="background: #f8f9fa; padding: 0.75rem 1rem; font-weight: 600; color: #495057; border-bottom: 1px solid #e2e8f0;">
             modular_glm.py
@@ -612,7 +612,7 @@ However, if a model has a modular_<em>.py and a corresponding automatically gene
 <p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
 <p>Measured on git history, raw <code>modeling_*.py</code> grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about <strong>15× lower</strong>. The effective curve (blue line below) represents the <strong>maintenance surface</strong> today: what maintainers actually read and review.</p>
 <p>Less code to hand-maintain means fewer places to break. Naturally LOC is not a direct measure of complexity, but they correlate in review effort and change risk.</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-xn06vuxqbr"><iframe
 src="https://molbap-loc-1.hf.space"
 style="width:100%; height:900px; border:0"
 allow="clipboard-read; clipboard-write; fullscreen"
@@ -624,7 +624,7 @@ If you zoom in, you’ll notice there’s a sharp drop near the end, it’s esse
 <p>We recently underwent a deep refactor of the attention implementation. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
 <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
 <p>However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>. Next section explains what we did.</p>
-<div class="crumbs"><p>Evidence: effective (i.e., maintenable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.</p><p><strong>Next:</strong> how the attention interface stays standard without hiding semantics.</p></div>
 <h3 id="-external-attention-classes"><a href="#-external-attention-classes"><a id="attention-classes"></a> External Attention classes</a></h3>
 <p>The solution for the “attention abstraction problem” was to move to a standard <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allows the following:</p>
 <p>The naive implementation of attention, called “eager”, is available by default. We use a <code>Callable</code> called <code>eager_attention_forward</code>, which can run as long as the user has PyTorch installed – which is a requirement any way.</p>
@@ -634,7 +634,8 @@ If you zoom in, you’ll notice there’s a sharp drop near the end, it’s esse
 <span class="line"><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">if</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> self</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">.config._attn_implementation </span><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">!=</span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF"> &quot;eager&quot;</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">:</span></span>
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">    attention_interface </span><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">=</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> ALL_ATTENTION_FUNCTIONS</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">[</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF">self</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">.config._attn_implementation]</span></span>
 <span class="line"></span></code></pre></div>
-<p>A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies.</p>
 <p>Backend integrations sometimes require specific kwargs.</p>
 <p>We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
 <p>We reduce that surface and document expectations; where flexibility is necessary, we plan to use <code>typing.Annotated</code> to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:</p>
@@ -650,7 +651,7 @@ If you zoom in, you’ll notice there’s a sharp drop near the end, it’s esse
 <p>Because we want to avoid code modifications that are unrelated to the model.</p>
 <p>We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
 <p>Hence, we want to touch the modeling code <a href="#minimal-user-api">minimally</a>, and only modify it when <em>architectural changes</em> are involved – not depending on the way you run it. For tensor parallelism, we simply specify a <code>tp_plan</code>:</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-8bjtx56ll89"><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
     base_model_tp_plan = {
         "layers.*.self_attn.q_proj": "colwise",
         "layers.*.self_attn.k_proj": "colwise",
@@ -722,21 +723,20 @@ So I wanted to take a look at the current <strong>state of modularity</strong> a
 <p>So what do we see?</p>
 <p>(Graph reading guide: nodes are models; edges are modular imports).</p>
 <p>Check out the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">full viewer here</a> (tab “dependency graph”, hit “build graph”) for better manipulation and exploration.</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-kf45zromzyi"><iframe
 src="https://molbap-dependencies-1.hf.space"
 style="width:100%; height:680px; border:0"
 allow="clipboard-read; clipboard-write; fullscreen"
 referrerpolicy="no-referrer-when-downgrade"
 ></iframe></div></div></figure>
-<p>Le’ts walk through some sections of this graph together.</p>
-<p>Llama is a basis and an influence for many models, and it shows.</p>
 <p><img src="/images/transformers/llama_center.png" alt="Llama in the center"/></p>
-<p>Radically different architectures such as mamba have spawned their own dependency subgraph.</p>
-<p>Audio models form sparser archipelagos, see for instance wav2vec2 which is a significant basis.</p>
 <p><img src="/images/transformers/cluster_wave2vec2.png" alt="Wav2vec2 influence"/></p>
-<p>In the case of VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
-)</p>
-<p>As you can see, there is a small DETR island:
 <img src="/images/transformers/detr_island.png" alt="DETR archipelago"/></p>
 <p>There is also a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
 <p>Another problem is, this visualization only shows <code>modular</code> models. Several models still do NOT have a modular file. If we zoom out significantly, we can see them,  the red nodes are models that do not have a modular file yet.</p>
@@ -746,19 +746,22 @@ referrerpolicy="no-referrer-when-downgrade"
 <strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.</p></div>
 <h3 id="many-models-but-not-enough-yet-are-alike"><a href="#many-models-but-not-enough-yet-are-alike">Many models, but not enough yet, are alike</a></h3>
 <p>I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.</p>
-<p>It is interesting, for our comparison, to look at <em>when</em> we deployed the modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points.</p>
 <p>Yet, we still have a lot of gaps to fill.</p>
 <p>Zoom out below - it’s full of models. You can click on a node to see its connections better, or use the text box to search for a model. You can use the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">full viewer</a> (tab “timeline”, hit “build timeline”) for better exploration.</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-oii2s57xyo">    <iframe
   src="https://molbap-timeline-1.hf.space"
   style="width:100%; height:680px; border:0"
   allow="clipboard-read; clipboard-write; fullscreen"
   referrerpolicy="no-referrer-when-downgrade"
 ></iframe></div></div></figure>
 <p>Let’s look at a few highly connected models. Let’s start by the foundational work of <a href="https://arxiv.org/abs/2304.08485">Llava</a>.</p>
-<p><img src="/images/transformers/timeline_llava.png" alt="DETR archipelago"/></p>
 <p>You see that <code>llava_video</code> is a red node, connected by a red edge to <code>llava</code>: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
-<div class="crumbs"><p>Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior. <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.</p></div>
 <h3 id="vlm-improvements-avoiding-abstraction"><a href="#vlm-improvements-avoiding-abstraction">VLM improvements, avoiding abstraction</a></h3>
 <p>We don’t yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.</p>
 <p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like</p>
@@ -813,11 +816,12 @@ That means every decision we make to abstract something else has to be extremely
 <span class="line"><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">        return</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8"> special_image_mask, special_video_mask</span></span>
 <span class="line"></span></code></pre></div>
 <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
 <div class="crumbs"><p>Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to <code>PreTrainedModel</code>.
 <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).</p></div>
 <h3 id="on-image-processing-and-processors"><a href="#on-image-processing-and-processors">On image processing and processors</a></h3>
-<p>Deciding to become a <code>torch</code>-first library meant relieving a tremendous amount of support for <code>jax </code> and <code>TensorFlow</code>, and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to accept. One of these is the <em>fast processing</em> of images. Where inputs were once minimally assumed to be ndarrays, enforcing native <code>torch</code> and <code>torchvision</code> inputs allowed us to massively improve processing speed for each model.</p>
-<p>The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, it allows to run the whole pipeline solely on GPU.</p>
 <p><img src="/images/transformers/fast_image_processors.png" alt="Fast Image Processors Performance"/>
 <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p></p>
 <div class="crumbs"><p>PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.</p><p><strong>Next:</strong> how this lowers friction for contributors and downstream users.</p></div>
@@ -825,12 +829,12 @@ That means every decision we make to abstract something else has to be extremely
 <p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
 <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
 <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
-<p>A second one is the ability to fine-tune and pipeline these models into many other software. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
-<div class="crumbs"><p>The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
-<strong>Next:</strong> power tools enabled by a consistent API.</p></div>
 <h3 id="-models-popularity"><a href="#-models-popularity"><a id="encoders-ftw"></a> Models popularity</a></h3>
-<p>Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tune-able.</p>
-<div><figure class="html-embed"><div class="html-embed__card"><div id="frag-ejk5kk4wtm"><html>
 <head><meta charset="utf-8" /></head>
 <body>
     <div>                        <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>
@@ -4723,11 +4727,12 @@ return Plotly;
 <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
 <div class="crumbs"><p>Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).</p><p><strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.</p></div>
 <h2 id="a-surgical-toolbox-for-model-development"><a href="#a-surgical-toolbox-for-model-development">A surgical toolbox for model development</a></h2>
 <h3 id="attention-visualisation"><a href="#attention-visualisation">Attention visualisation</a></h3>
 <p>All models have the same API for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>.</p>
 <p>This uniformity allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
 <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-1ihz7she8ee"><!-- Minimal HTML fragment: terminal-style ASCII attention masks -->
 <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
     <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
       <span style="width:10px; height:10px; background:#ef4444; border-radius:50%; display:inline-block;"></span>
@@ -4777,10 +4782,10 @@ return Plotly;
 <p>Because everything is PyTorch, we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our <a href="#source-of-truth">Source of Truth guideline</a>.</p>
 <p><img src="/images/transformers/model_debugger.png" alt="Model debugger interface"/></p>
-<div class="crumbs"><p>Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.” <strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.</p></div>
 <h3 id="cooking-faster-cuda-warmups"><a href="#cooking-faster-cuda-warmups">Cooking faster CUDA warmups</a></h3>
 <p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of a few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code>, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in <a href="https://github.com/huggingface/transformers/pull/36380">the PR</a>!</p>
-<figure class="html-embed"><div class="html-embed__card"><div id="frag-20u85e0tulx"><style>
 /* 1) Scope tokens to the widget */
 .warmup-demo{
   --page-bg:#ffffff;
@@ -5091,7 +5096,7 @@ return Plotly;
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">-H </span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF">&quot;Content-Type: application/json&quot;</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> \</span></span>
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">-d </span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF">&#39;{&quot;messages&quot;: [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;hello&quot;}], &quot;temperature&quot;: 0.9, &quot;max_tokens&quot;: 1000, &quot;stream&quot;: true, &quot;model&quot;: &quot;Qwen/Qwen2.5-0.5B-Instruct&quot;}&#39;</span></span>
 <span class="line"></span></code></pre></div>
-<p><code>transformers-serve</code> uses continuous batching (see <a href="https://github.com/huggingface/transformers/pull/38085">this PR</a> and also <a href="https://github.com/huggingface/transformers/pull/40426">this one</a>) for better GPU utilization, and is very much linked to the great work of vLLM with the <code>paged attention kernel</code> – a futher justification of <a href="#community-kernels">external kernels</a>.</p>
 <p><code>transformers-serve</code> is not meant for user-facing production services, tools like vLLM or SGLang are super optimized for that, but it’s useful for several use cases:</p>
 <ul>
 <li>Quickly verify that your model is compatible with continuous batching and paged attention.</li>
@@ -5099,18 +5104,17 @@ return Plotly;
 <li>Run evaluations efficiently, again without having to spend a lot of time engineering your infrastructure.</li>
 </ul>
 <p>For model deployment, check <a href="https://huggingface.co/docs/inference-providers/en/index">Inference Providers</a> or roll your solution using any of the excellent serving libraries.</p>
-<div class="crumbs"><p>OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
-<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.</p></div>
 <h2 id="community-reusability"><a href="#community-reusability">Community reusability</a></h2>
 <p>The transformers-serve CLI built on transformers, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
 <p>Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
-<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
 </ul>
-<p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>,  and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
-<div class="crumbs"><p>Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
-<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.</p></div>
 <h2 id="what-is-coming-next"><a href="#what-is-coming-next">What is coming next</a></h2>
 <p>The next major version of <code>transformers</code> is just around the corner (and will have another blog post to its name when it comes out). When v5 is released, we aim to keep <a href="#backwards-compatibility">backwards compatibility</a> as solid as possible. The changes we make now are in service of that goal.</p>
 <p>We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from <code>PreTrainedModel</code> and opt into Tensor Parallel, <code>from_pretrained</code>, sharding, <code>push_to_hub</code>, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.</p> </main> </section> <footer class="footer"> <div class="footer-inner"> <section class="citation-block"> <h3>Citation</h3> <p>For attribution, cite this work as</p> <pre class="citation short">Pablo Montalvo, Lysandre Debut, Pedro Cuenca, Yoni Gozlan (2025). &quot;Maintain the unmaintainable: 1M python loc, 400+ models&quot;.</pre> <p>BibTeX citation</p> <pre class="citation long">@misc{montalvo2025_maintain_the_unmaintaina,

           document.documentElement.setAttribute("data-theme", theme);
         } catch {}
       })();
+    </script><script type="module" src="/scripts/color-palettes.js"></script><!-- TO MANAGE PROPERLY --><script src="https://cdn.plot.ly/plotly-3.0.0.min.js" charset="utf-8"></script><link rel="stylesheet" href="/_astro/index.BxklM4ay.css"><script type="module" src="/_astro/hoisted.DK-CdsVg.js"></script>
+<script type="module" src="/_astro/page.CH0W_C1Z.js"></script></head> <body> <button id="theme-toggle" aria-label="Toggle color theme" data-astro-cid-x3pjskd3> <svg class="icon light" width="20" height="20" viewBox="0 0 24 24" aria-hidden="true" focusable="false" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" data-astro-cid-x3pjskd3> <circle cx="12" cy="12" r="5" data-astro-cid-x3pjskd3></circle> <line x1="12" y1="1" x2="12" y2="4" data-astro-cid-x3pjskd3></line> <line x1="12" y1="20" x2="12" y2="23" data-astro-cid-x3pjskd3></line> <line x1="1" y1="12" x2="4" y2="12" data-astro-cid-x3pjskd3></line> <line x1="20" y1="12" x2="23" y2="12" data-astro-cid-x3pjskd3></line> <line x1="4.22" y1="4.22" x2="6.34" y2="6.34" data-astro-cid-x3pjskd3></line> <line x1="17.66" y1="17.66" x2="19.78" y2="19.78" data-astro-cid-x3pjskd3></line> <line x1="4.22" y1="19.78" x2="6.34" y2="17.66" data-astro-cid-x3pjskd3></line> <line x1="17.66" y1="6.34" x2="19.78" y2="4.22" data-astro-cid-x3pjskd3></line> </svg> <svg class="icon dark" width="20" height="20" viewBox="0 0 24 24" aria-hidden="true" focusable="false" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" data-astro-cid-x3pjskd3> <path d="M21 12.79A9 9 0 1 1 11.21 3 7 7 0 0 0 21 12.79z" data-astro-cid-x3pjskd3></path> </svg>  </button>  <section class="hero" data-astro-cid-bbe6dxrz> <h1 class="hero-title" data-astro-cid-bbe6dxrz>Maintain the unmaintainable:<br/>1M python loc, 400+ models</h1> <div class="hero-banner" data-astro-cid-bbe6dxrz> <figure class="html-embed"><div class="html-embed__card is-frameless"><div id="frag-v8r960imx1"><style>
 @import url('https://fonts.googleapis.com/css2?family=Inter:wght@500;600&display=swap');
 .banner-container {
 <h2 id="the-core-tenets-of-transformers"><a href="#the-core-tenets-of-transformers">The core tenets of transformers</a></h2>
 <p>We summarize the foundations on which we’ve built everything, and write the “tenets” of the library.  They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
 <p>These principles were not decided in a vacuum. The library <em>evolved</em> towards them, and once they <em>emerged</em>, they were recognized as critical.</p>
+<div class="tenet-list"><ol><li class="tenet"><a id="source-of-truth"></a><strong>Source of Truth</strong><p>We aim to be a <a href="https://huggingface.co/blog/transformers-model-definition">source of truth for all model definitions</a>. This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they’ll be easily adopted by downstream libraries and projects. It’s much easier for a project to always refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.</p><em>This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.</em></li><li class="tenet"><a id="one-model-one-file"></a><strong>One Model, One File</strong><p>All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model’s hackability.</p><em>Every model should be completely understandable and hackable by reading a single file from top to bottom.</em></li><li class="tenet"><a id="code-is-product"></a><strong>Code is Product</strong><p>Optimize for reading, diffing, and tweaking, our users are power users. Variables should be explicit, full words, even several words, readability is primordial.</p><em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em></li><li class="tenet"><a id="standardize-dont-abstract"></a><strong>Standardize, Don’t Abstract</strong><p>If it’s model behavior, keep it in the file; use abstractions only for generic infra.</p><em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em></li><li class="tenet"><a id="do-repeat-yourself"></a><strong>DRY* (DO Repeat Yourself)</strong><p>Copy when it helps users; keep successors in sync without centralizing behavior.</p><p><strong>Evolution:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.</p><em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em></li><li class="tenet"><a id="minimal-user-api"></a><strong>Minimal User API</strong><p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p><em>Keep the public interface simple and predictable, users should know what to expect.</em></li><li class="tenet"><a id="backwards-compatibility"></a><strong>Backwards Compatibility</strong><p>Evolve by additive standardization, never break public APIs.</p><p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies. If we do deprecate something, it’s with very long cycles beforehand.</p><em>Once something is public, it stays public, evolution through addition, not breaking changes.</em></li><li class="tenet"><a id="consistent-public-surface"></a><strong>Consistent Public Surface</strong><p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal as well as a tenet.</p><em>All models should feel familiar - consistent interfaces reduce cognitive load.</em></li></ol></div>
 <p>When a PR is merged, it is because the contribution is worthwhile, and because the <code>transformers</code> team finds the design of the contribution to be aligned with the tenets.</p>
 <p>Does all the code in the library strictly follow these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere, built by thousands of different workers. We <em>try</em> to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
 <p>To see what constitutes adherence to the tenets, let’s take the example of code repetition.</p>
 <h2 id="-modular-transformers"><a href="#-modular-transformers"><a id="modular"></a> Modular transformers</a></h2>
 <p>Transformers is an opinionated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers was introduced</a> to allow a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
 <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by progressively removing all pieces of code that were “copied from” another file.</p>
+<p>It works as follows. In order to contribute a model, <code>GLM</code> for instance, we define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em> already existing in the library.
 The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.</p>
 <summary id="generated-modeling">Auto-generated modeling code</summary>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-k4pqlm0xlw"><div class="code-compare" style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
     <div class="code-column" style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
         <div class="code-header" style="background: #f8f9fa; padding: 0.75rem 1rem; font-weight: 600; color: #495057; border-bottom: 1px solid #e2e8f0;">
             modular_glm.py
 <p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
 <p>Measured on git history, raw <code>modeling_*.py</code> grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about <strong>15× lower</strong>. The effective curve (blue line below) represents the <strong>maintenance surface</strong> today: what maintainers actually read and review.</p>
 <p>Less code to hand-maintain means fewer places to break. Naturally LOC is not a direct measure of complexity, but they correlate in review effort and change risk.</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-8j1eajribg5"><iframe
 src="https://molbap-loc-1.hf.space"
 style="width:100%; height:900px; border:0"
 allow="clipboard-read; clipboard-write; fullscreen"
 <p>We recently underwent a deep refactor of the attention implementation. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
 <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
 <p>However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>. Next section explains what we did.</p>
+<div class="crumbs"><p>Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.</p><p><strong>Next:</strong> how the attention interface stays standard without hiding semantics.</p></div>
 <h3 id="-external-attention-classes"><a href="#-external-attention-classes"><a id="attention-classes"></a> External Attention classes</a></h3>
 <p>The solution for the “attention abstraction problem” was to move to a standard <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allows the following:</p>
 <p>The naive implementation of attention, called “eager”, is available by default. We use a <code>Callable</code> called <code>eager_attention_forward</code>, which can run as long as the user has PyTorch installed – which is a requirement any way.</p>
 <span class="line"><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">if</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> self</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">.config._attn_implementation </span><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">!=</span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF"> &quot;eager&quot;</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">:</span></span>
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">    attention_interface </span><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">=</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> ALL_ATTENTION_FUNCTIONS</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">[</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF">self</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">.config._attn_implementation]</span></span>
 <span class="line"></span></code></pre></div>
+<p>Having the attention interfaces functionalized allows to do dynamic switching of attentions as well, increasing their <a href="#code-is-product">hackability</a>.
+Another strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies.</p>
 <p>Backend integrations sometimes require specific kwargs.</p>
 <p>We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
 <p>We reduce that surface and document expectations; where flexibility is necessary, we plan to use <code>typing.Annotated</code> to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:</p>
 <p>Because we want to avoid code modifications that are unrelated to the model.</p>
 <p>We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
 <p>Hence, we want to touch the modeling code <a href="#minimal-user-api">minimally</a>, and only modify it when <em>architectural changes</em> are involved – not depending on the way you run it. For tensor parallelism, we simply specify a <code>tp_plan</code>:</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-6k929b8i4kg"><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
     base_model_tp_plan = {
         "layers.*.self_attn.q_proj": "colwise",
         "layers.*.self_attn.k_proj": "colwise",
 <p>So what do we see?</p>
 <p>(Graph reading guide: nodes are models; edges are modular imports).</p>
 <p>Check out the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">full viewer here</a> (tab “dependency graph”, hit “build graph”) for better manipulation and exploration.</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-mrjpwa984"><iframe
 src="https://molbap-dependencies-1.hf.space"
 style="width:100%; height:680px; border:0"
 allow="clipboard-read; clipboard-write; fullscreen"
 referrerpolicy="no-referrer-when-downgrade"
 ></iframe></div></div></figure>
+<p>Let’s walk through some sections of this graph together.
+First, Llama is a basis and an influence for many models, and it is very visible.</p>
 <p><img src="/images/transformers/llama_center.png" alt="Llama in the center"/></p>
+<p>The models linked sometimes pull components from other models than <code>llama</code> of course. Radically different architectures such as mamba have spawned their own dependency subgraph.</p>
+<p>Audio models form sparser archipelagos, see for instance wav2vec2 which is a significant basis for a dozen of them.</p>
 <p><img src="/images/transformers/cluster_wave2vec2.png" alt="Wav2vec2 influence"/></p>
+<p>In the case of VLMs which have massively grown in popularity since 2024, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.</p>
+<p>As you can see, there is a small <code>DETR</code> island:
 <img src="/images/transformers/detr_island.png" alt="DETR archipelago"/></p>
 <p>There is also a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
 <p>Another problem is, this visualization only shows <code>modular</code> models. Several models still do NOT have a modular file. If we zoom out significantly, we can see them,  the red nodes are models that do not have a modular file yet.</p>
 <strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.</p></div>
 <h3 id="many-models-but-not-enough-yet-are-alike"><a href="#many-models-but-not-enough-yet-are-alike">Many models, but not enough yet, are alike</a></h3>
 <p>I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.</p>
+<p>It is interesting, for our comparison, to look at <em>when</em> we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.</p>
 <p>Yet, we still have a lot of gaps to fill.</p>
 <p>Zoom out below - it’s full of models. You can click on a node to see its connections better, or use the text box to search for a model. You can use the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">full viewer</a> (tab “timeline”, hit “build timeline”) for better exploration.</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-5npq3tnzu9b">    <iframe
   src="https://molbap-timeline-1.hf.space"
   style="width:100%; height:680px; border:0"
   allow="clipboard-read; clipboard-write; fullscreen"
   referrerpolicy="no-referrer-when-downgrade"
 ></iframe></div></div></figure>
 <p>Let’s look at a few highly connected models. Let’s start by the foundational work of <a href="https://arxiv.org/abs/2304.08485">Llava</a>.</p>
+<p><img src="/images/transformers/timeline_llava.png" alt="Llava in its timeline"/></p>
 <p>You see that <code>llava_video</code> is a red node, connected by a red edge to <code>llava</code>: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
+<p>The same can be identified with the classical encoders family, centered on <code>BERT</code>:</p>
+<p>Here <code>roberta</code>, <code>xlm_roberta</code>, <code>ernie</code> are <code>modular</code>s of BERT, while models like <code>mobilebert</code> are likely candidates.
+<img src="/images/transformers/classic_encoders.png" alt="Classical encoders"/></p>
+<div class="crumbs"><p>Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.</p><p><strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.</p></div>
 <h3 id="vlm-improvements-avoiding-abstraction"><a href="#vlm-improvements-avoiding-abstraction">VLM improvements, avoiding abstraction</a></h3>
 <p>We don’t yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.</p>
 <p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like</p>
 <span class="line"><span style="--shiki-light:#D73A49;--shiki-dark:#F97583">        return</span><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8"> special_image_mask, special_video_mask</span></span>
 <span class="line"></span></code></pre></div>
 <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
+<p>What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of <code>Llama</code> for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.</p>
 <div class="crumbs"><p>Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to <code>PreTrainedModel</code>.
 <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).</p></div>
 <h3 id="on-image-processing-and-processors"><a href="#on-image-processing-and-processors">On image processing and processors</a></h3>
+<p>Deciding to become a <code>torch</code>-first library meant relieving a tremendous amount of support for <code>jax </code> and <code>TensorFlow</code>, and it also meant that we could be more lenient about the amount of torch-dependent utilities that we were able to accept. One of these is the <em>fast processing</em> of images. Where inputs were once minimally assumed to be ndarrays, enforcing native <code>torch</code> and <code>torchvision</code> inputs allowed us to massively improve processing speed for each model.</p>
+<p>The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, lets us run the whole pipeline solely on GPU.</p>
 <p><img src="/images/transformers/fast_image_processors.png" alt="Fast Image Processors Performance"/>
 <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p></p>
 <div class="crumbs"><p>PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.</p><p><strong>Next:</strong> how this lowers friction for contributors and downstream users.</p></div>
 <p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
 <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
 <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
+<p>These additions are immediately available for other models to use.</p>
+<p>Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
+<div class="crumbs"><p>The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.</p><p><strong>Next:</strong> power tools enabled by a consistent API.</p></div>
 <h3 id="-models-popularity"><a href="#-models-popularity"><a id="encoders-ftw"></a> Models popularity</a></h3>
+<p>Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tunable.</p>
+<div><figure class="html-embed"><div class="html-embed__card"><div id="frag-a6lvmrfnb0h"><html>
 <head><meta charset="utf-8" /></head>
 <body>
     <div>                        <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>
 <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
 <div class="crumbs"><p>Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).</p><p><strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.</p></div>
 <h2 id="a-surgical-toolbox-for-model-development"><a href="#a-surgical-toolbox-for-model-development">A surgical toolbox for model development</a></h2>
+<p>Transformers provides many tools that can help you add a new architecture, understand the inner workings of a model, as well as the library itself.</p>
 <h3 id="attention-visualisation"><a href="#attention-visualisation">Attention visualisation</a></h3>
 <p>All models have the same API for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>.</p>
 <p>This uniformity allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
 <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-5b30rmkobeu"><!-- Minimal HTML fragment: terminal-style ASCII attention masks -->
 <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
     <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
       <span style="width:10px; height:10px; background:#ef4444; border-radius:50%; display:inline-block;"></span>
 <p>Because everything is PyTorch, we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our <a href="#source-of-truth">Source of Truth guideline</a>.</p>
 <p><img src="/images/transformers/model_debugger.png" alt="Model debugger interface"/></p>
+<div class="crumbs"><p>Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.”</p><p><strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.</p></div>
 <h3 id="cooking-faster-cuda-warmups"><a href="#cooking-faster-cuda-warmups">Cooking faster CUDA warmups</a></h3>
 <p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of a few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code>, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in <a href="https://github.com/huggingface/transformers/pull/36380">the PR</a>!</p>
+<figure class="html-embed"><div class="html-embed__card"><div id="frag-2p0fwkmk3xw"><style>
 /* 1) Scope tokens to the widget */
 .warmup-demo{
   --page-bg:#ffffff;
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">-H </span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF">&quot;Content-Type: application/json&quot;</span><span style="--shiki-light:#005CC5;--shiki-dark:#79B8FF"> \</span></span>
 <span class="line"><span style="--shiki-light:#24292E;--shiki-dark:#E1E4E8">-d </span><span style="--shiki-light:#032F62;--shiki-dark:#9ECBFF">&#39;{&quot;messages&quot;: [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;hello&quot;}], &quot;temperature&quot;: 0.9, &quot;max_tokens&quot;: 1000, &quot;stream&quot;: true, &quot;model&quot;: &quot;Qwen/Qwen2.5-0.5B-Instruct&quot;}&#39;</span></span>
 <span class="line"></span></code></pre></div>
+<p><code>transformers-serve</code> uses continuous batching (see <a href="https://github.com/huggingface/transformers/pull/38085">this PR</a> and also <a href="https://github.com/huggingface/transformers/pull/40426">this one</a>) for better GPU utilization, and is very much linked to the great work of vLLM with the <code>paged attention kernel</code> – a further justification of <a href="#community-kernels">external kernels</a>.</p>
 <p><code>transformers-serve</code> is not meant for user-facing production services, tools like vLLM or SGLang are super optimized for that, but it’s useful for several use cases:</p>
 <ul>
 <li>Quickly verify that your model is compatible with continuous batching and paged attention.</li>
 <li>Run evaluations efficiently, again without having to spend a lot of time engineering your infrastructure.</li>
 </ul>
 <p>For model deployment, check <a href="https://huggingface.co/docs/inference-providers/en/index">Inference Providers</a> or roll your solution using any of the excellent serving libraries.</p>
+<div class="crumbs"><p>OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.</p><p><strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.</p></div>
 <h2 id="community-reusability"><a href="#community-reusability">Community reusability</a></h2>
 <p>The transformers-serve CLI built on transformers, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
 <p>Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
+<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In the case of vLLM, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of <em>existing</em> transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
+<li>being the reference code for implementations in MLX, llama.cpp and other libraries.</li>
 </ul>
+<p>This further cements the need for a <a href="#consistent-public-surface">consistent public surface</a>: we are a backend and a reference, and there’s more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a>, for instance.</p>
+<div class="crumbs"><p>Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.</p><p><strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.</p></div>
 <h2 id="what-is-coming-next"><a href="#what-is-coming-next">What is coming next</a></h2>
 <p>The next major version of <code>transformers</code> is just around the corner (and will have another blog post to its name when it comes out). When v5 is released, we aim to keep <a href="#backwards-compatibility">backwards compatibility</a> as solid as possible. The changes we make now are in service of that goal.</p>
 <p>We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from <code>PreTrainedModel</code> and opt into Tensor Parallel, <code>from_pretrained</code>, sharding, <code>push_to_hub</code>, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.</p> </main> </section> <footer class="footer"> <div class="footer-inner"> <section class="citation-block"> <h3>Citation</h3> <p>For attribution, cite this work as</p> <pre class="citation short">Pablo Montalvo, Lysandre Debut, Pedro Cuenca, Yoni Gozlan (2025). &quot;Maintain the unmaintainable: 1M python loc, 400+ models&quot;.</pre> <p>BibTeX citation</p> <pre class="citation long">@misc{montalvo2025_maintain_the_unmaintaina,

app/dist/index.html.gz CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6912a23ebe8db8e4eda9fcfcb65eb6cd7fa541f19286b11060733984d7a7f8ff
-size 1489962

 version https://git-lfs.github.com/spec/v1
+oid sha256:973b4220915eab6e03c01dfa087941db426463ec450fe0ad1a39a8f9b84380ae
+size 1490378