Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Sep 30, 2025

Commit

3a3c4d7

1 Parent(s): dfda82f

UPDATE

Browse files

Files changed (12) hide show

README.md +1 -1
config/app.json +3 -3
content/article.md +92 -40
dist/distill.bundle.js +1 -1
dist/distill.bundle.js.map +0 -0
dist/index.html +110 -88
dist/main.bundle.js +59 -68
dist/main.bundle.js.map +0 -0
src/distill.js +1 -1
src/index.js +1 -1
src/transformers-custom.css +57 -67
webpack.config.js +34 -23

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Scaling insanity
 emoji: 📚
 colorFrom: pink
 colorTo: indigo

 ---
+title: Maintain the unmaintainable
 emoji: 📚
 colorFrom: pink
 colorTo: indigo

config/app.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "title": "Scaling Insanity",
-  "subtitle": "maintaining hundreds of model definitions",
   "description": "A peek into software engineering for the transformers library",
-  "fullTitle": "Scaling Insanity: maintaining hundreds of model definitions"
 }

 {
+  "title": "Maintain the unmaintainable",
+  "subtitle": "1M python loc, 400+ models",
   "description": "A peek into software engineering for the transformers library",
+  "fullTitle": "Maintain the unmaintainable: 1M python loc, 400+ models"
 }

content/article.md CHANGED Viewed

@@ -1,5 +1,40 @@
-# Introduction
 One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
@@ -16,7 +51,7 @@ We codify the "tenets" that guide our development, demonstrate how they are impl
 For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
-### The core tenets of transformers
 We summarize the foundations on which we've built everything, and write the "tenets" of the library.  They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
@@ -28,7 +63,7 @@ Note that the library _evolved_ towards these principles, and that they _emerged
 <li class="tenet">
 <a id="source-of-truth"></a>
 <strong>Source of Truth</strong>
-<p>We aim be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
 <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
 </li>
@@ -73,7 +108,7 @@ Note that the library _evolved_ towards these principles, and that they _emerged
 <li class="tenet">
 <a id="consistent-public-surface"></a>
 <strong>Consistent Public Surface</strong>
-<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goalpost</p>
 <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
 </li>
 </ol>
@@ -96,9 +131,9 @@ def rotate_half(x):
 You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
-All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works. It is a core functionality.
-In the same way, we want all models to have a self-contained modeling code.
 This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
@@ -108,7 +143,6 @@ What was the solution to this?
 ## <a id="modular"></a> Modular transformers
 Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
 We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
@@ -127,7 +161,9 @@ What is the consequence? When adding a model, we do not need to go over the enti
 When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
-## A maintainable control surface
 The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
 If it only has a modeling file, we add its LOC count.
@@ -145,14 +181,23 @@ Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton no
 There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
-Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
-## <a id="attention-classes"></a> External Attention classes
-A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
-What will forever stay in the modeling code is the `eager_attention_forward` because it is a core part of the modeling,
 ```python
 attention_interface: Callable = eager_attention_forward
@@ -160,9 +205,7 @@ if self.config._attn_implementation != "eager":
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 ```
-We often read and understand that `kwargs` are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don''t use the same kwargs.
-It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
 For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
@@ -173,14 +216,23 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
 ```
-## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
-# TODO ADD LINK TO EXTERNAL BLOG POST
-We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
-It is written once in the config and passed to `.from_pretrained()`.
-The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 {{{fragment-tp-plan}}}
@@ -192,7 +244,7 @@ Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
 Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
-## <a id="layers-attentions-caches"></a> Layers, attentions and caches
 Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
@@ -221,7 +273,7 @@ and the configuration can be _explicit_ about which attention type is in which l
 This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
-## <a id="community-kernels"></a>Community Kernels
 The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
@@ -235,7 +287,7 @@ Plus, this opened another angle of contribution for the community. People who ar
 Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
-## The good modularity
 Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
@@ -257,7 +309,7 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
 Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
-## Many models, but not enough yet, are alike
 So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
@@ -329,7 +381,7 @@ The following [Pull request to standardize placeholder masking](https://github.c
         return special_image_mask, special_video_mask
 ```
-But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
 ### <a id="encoders-ftw"></a> Embedding models, now and forever.
@@ -344,38 +396,38 @@ As the codebase grows, with our friend codebase [Sentence Transformers](https://
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
-The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.
-![Fast Image Processors Performance](fast_image_processors.png)
 ## Reduce barrier to entry/contribution
-This is an overall objective: there's no `transformer` without its community.
-We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
-Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
-In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
-## A surgical toolbox for model development
 ### Attention visualisation
-If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
-machinery is the `attention mask`, cause of confusion.
-Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
 {{{fragment-attention-visualizer}}}
 ### Logging entire model activations
-Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
 It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
@@ -387,11 +439,11 @@ Having a clean _external_ API allows us to work on the true inner workings of tr
 {{{fragment-warmup_demo}}}
-It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, your iteration speed.
 ### Transformers-serve and continuous batching
-Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
 ```bash
 transformers serve
@@ -410,7 +462,7 @@ Continuous batching is in itself very much linked to the great work of vLLM with
 Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
 - having it immediately available to the community
-- having it immediately usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
 This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.

+## Introduction
 One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
 For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
+## The core tenets of transformers
 We summarize the foundations on which we've built everything, and write the "tenets" of the library.  They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
 <li class="tenet">
 <a id="source-of-truth"></a>
 <strong>Source of Truth</strong>
+<p>We aim be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
 <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
 </li>
 <li class="tenet">
 <a id="consistent-public-surface"></a>
 <strong>Consistent Public Surface</strong>
+<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal we have as well as a tenet.</p>
 <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
 </li>
 </ol>
 You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
+We want all models to have self-contained modeling code.
+Every core functionality _must_ be in the modeling code, every non-core functionality _can_ be outside of it.
 This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
 ## <a id="modular"></a> Modular transformers
 Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
 We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
 When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
+What does that gives us?
+### A maintainable control surface
 The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
 If it only has a modeling file, we add its LOC count.
 There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
+Of course, it is not only this effort that allowed to reduce the maintenance load.
+A related optimization was the following one. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
+The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
+However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
+### <a id="attention-classes"></a> External Attention classes
+Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
+We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
+We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
+In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.
 ```python
 attention_interface: Callable = eager_attention_forward
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 ```
+A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
 For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
 ```
+### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
+If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
+The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism) when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
+Why does it matter?
+Because we want to avoid code modifications that are unrelated to the model.
+We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
+Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
+The alternative would be to modify parent classes specific to their
+It is written once in the config and passed to `.from_pretrained()`.  The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 {{{fragment-tp-plan}}}
 Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
+### <a id="layers-attentions-caches"></a> Layers, attentions and caches
 Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
 This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
+### <a id="community-kernels"></a>Community Kernels
 The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
 Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
+## Modular developments
 Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
+### Many models, but not enough yet, are alike
 So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
         return special_image_mask, special_video_mask
 ```
+But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
 ### <a id="encoders-ftw"></a> Embedding models, now and forever.
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
+The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
+![Fast Image Processors Performance](static/fast_image_processors.png)
+<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
 ## Reduce barrier to entry/contribution
+This is an overall objective: there's no `transformers` without its community.
+Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
+Among the most valuable contributions to `transformers` is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
+In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
+So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
+### A surgical toolbox for model development
 ### Attention visualisation
+All models have the same API internally for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). it allows us to build cool tools to visualize the inner workings of the attention mechanism.
+One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
 {{{fragment-attention-visualizer}}}
 ### Logging entire model activations
+Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
 It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
 {{{fragment-warmup_demo}}}
+It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
 ### Transformers-serve and continuous batching
+Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
 ```bash
 transformers serve
 Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
 - having it immediately available to the community
+- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
 This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.

dist/distill.bundle.js CHANGED Viewed

@@ -2146,7 +2146,7 @@ function _arrayWithHoles(r) { if (Array.isArray(r)) return r; }
   function bylineTemplate(frontMatter) {
     return "\n    <div class=\"byline grid\">\n      <div>\n          <h3>Authors</h3>\n          <div>\n              ".concat(frontMatter.authors.map(function (author, i) {
       return "\n              <span class=\"author\">\n        ".concat(author.personalURL ? "\n          <a class=\"name\" href=\"".concat(author.personalURL, "\">").concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</a>" : "\n          <span class=\"name\">".concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</span>", "\n      </span>\n              ");
-    }).join(''), "\n          </div>\n      </div>\n      <div >\n          <h3>Affiliation</h3>\n          <div><a href=\"https://huggingface.co/\">Hugging Face</a>\n          </div>\n      </div>\n      <div >\n          <h3>Published</h3>\n          <div>August, 2025</div>\n      </div>\n    </div>\n\n");
   }
   var Byline = /*#__PURE__*/function (_HTMLElement4) {
     function Byline() {

   function bylineTemplate(frontMatter) {
     return "\n    <div class=\"byline grid\">\n      <div>\n          <h3>Authors</h3>\n          <div>\n              ".concat(frontMatter.authors.map(function (author, i) {
       return "\n              <span class=\"author\">\n        ".concat(author.personalURL ? "\n          <a class=\"name\" href=\"".concat(author.personalURL, "\">").concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</a>" : "\n          <span class=\"name\">".concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</span>", "\n      </span>\n              ");
+    }).join(''), "\n          </div>\n      </div>\n      <div >\n          <h3>Affiliation</h3>\n          <div><a href=\"https://huggingface.co/\">Hugging Face</a>\n          </div>\n      </div>\n      <div >\n          <h3>Published</h3>\n          <div>October, 2025</div>\n      </div>\n    </div>\n\n");
   }
   var Byline = /*#__PURE__*/function (_HTMLElement4) {
     function Byline() {

dist/distill.bundle.js.map CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/index.html CHANGED Viewed

@@ -8,21 +8,22 @@
     <script src="https://d3js.org/d3.v7.min.js"></script>
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
-    <title>Scaling Insanity: maintaining hundreds of model definitions</title>
     <link rel="stylesheet" href="style.css">
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
 </head>
 <body>
     <d-front-matter>
         <script id='distill-front-matter' type="text/json">{
-    "title": "Scaling Insanity: maintaining hundreds of model definitions",
     "description": "A peek into software engineering for the transformers library",
     "published": "Aug 21, 2025",
     "authors": [{"author": "Pablo Montalvo", "authorURL": "https://huggingface.co/Molbap"}]
   }</script>
     </d-front-matter>
     <d-title>
-        <h1>Scaling Insanity: maintaining hundreds of model definitions</h1>
         <p>A peek into software engineering for the transformers library</p>
     </d-title>
     <d-byline></d-byline>
@@ -48,33 +49,29 @@
             </nav>
         </d-contents>
         <h2>Introduction</h2>
-<p>The <code>transformers</code> library, built with <code>PyTorch</code>, supports all state-of-the-art LLMs, many VLMs, task-specific vision language models, video models, audio models, table models, classical encoders, to a global count of almost 400 models.<br>
-The name of the library itself is mostly majority driven as many models are not even transformers architectures, like Mamba, Zamba, RWKV, and convolution-based models.<br>
-Regardless, each of these is wrought by the research and engineering team that created them, then harmonized into a now famous interface, and callable with a simple <code>.from_pretrained</code> command.<br>
-Inference works for all models, training is functional for most. The library is a foundation for many machine learning courses, cookbooks, and overall, several thousands other open-source libraries depend on it. All models are tested as part of a daily CI ensuring their preservation and reproducibility. Most importantly, it is <em>open-source</em> and has been written by the community for a large part.<br>
-This isn’t really to brag but to set the stakes: what does it take to keep such a ship afloat, made of so many moving, unrelated parts?<br>
-The ML wave has not stopped, there’s more and more models being added, at a steadily growing rate. <code>Transformers</code> is widely used, and we read the feedback that users post online. Whether it’s about a function that had 300+ keyword arguments, duplicated code and helpers, and mentions of <code>Copied from ... </code> everywhere, along with optimisation concerns. Text-only models are relatively tamed, but multimodal models remain to be harmonized.<br>
-Here we will dissect what is the new design philosophy of transformers, as a continuation from the existing older <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and an accompanying <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post from 2022</a>.<br>
-More recently, and I recommend the read if it’s not done yet, a blog post about <a href="https://huggingface.co/blog/faster-transformers">recent upgrades to transformers</a> was written, explaining in particular what makes the library faster today.<br>
-Some time ago I dare not say how long, we discussed with transformers maintainers about the state of features in transformers. A lot of recent developments were satisfactory, but if we were only talking about these, self-congratulation would be the only goalpost.<br>
-Reflecting on this philosophy now, as models pile up, is essential and will drive new developments.</p>
-<h3>The core tenets of transformers</h3>
-<p>Every reader, whether an OSS maintainer, power user, or casual fine-tuner, will walk away knowing how to reason about the <code>transformers</code> code base, how to use it better, how to meaningfully contribute to it.
-This will also showcase new features you might have missed so you’ll be up-to-date.</p>
-<p>So, what are the principles of <code>transformers</code>? We will try to summarize the foundations on which we’ve built everything, and write the “tenets” of the library.  They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
 <div class="tenet-list">
 <ol>
 <li class="tenet">
 <a id="source-of-truth"></a>
 <strong>Source of Truth</strong>
-<p>We should be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
 <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
 </li>
 <li class="tenet">
 <a id="one-model-one-file"></a>
 <strong>One Model, One File</strong>
-<p>All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.</p>
-<em>Every model should be completely understandable by reading a single file from top to bottom.</em>
 </li>
 <li class="tenet">
 <a id="code-is-product"></a>
@@ -99,32 +96,26 @@ This will also showcase new features you might have missed so you’ll be up-to-
 <a id="minimal-user-api"></a>
 <strong>Minimal User API</strong>
 <p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
-<em>Keep the public interface simple and predictable - users should know what to expect.</em>
 </li>
 <li class="tenet">
 <a id="backwards-compatibility"></a>
 <strong>Backwards Compatibility</strong>
-<p>Evolve by additive standardization, <strong>never</strong> break public APIs.</p>
-<p><strong>Note:</strong> Some models are showing almost no use, we also stopped adding new features for non-torch frameworks. Still, we adapt to models existing on the hub.</p>
-<em>Once something is public, it stays public - evolution through addition, not breaking changes.</em>
 </li>
 <li class="tenet">
 <a id="consistent-public-surface"></a>
 <strong>Consistent Public Surface</strong>
-<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.</p>
 <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
 </li>
-<li class="tenet">
-<a id="modular-toolbox"></a>
-<strong>Modular Toolbox (Not A Framework)</strong>
-<p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
-<em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
-</li>
 </ol>
 </div>
 <p>When a PR is merged, it is because the contribution is worthwhile, and that the  <code>transformers</code> team finds the design of the contribution to be aligned with what is above.</p>
-<p>Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We <em>try</em> to make it so all the code added is inline, lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
-<p>For instance, one function essential to the implementation of <a href="https://huggingface.co/papers/2104.09864">Rotary Positional Embeddings</a> is identical in 70  <code>modeling_&lt;file&gt;.py</code> across <code>src/transformers/models/.</code>  Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We <a href="#do-repeat-yourself">do repeat ourselves</a>.</p>
 <pre><code class="language-python">def rotate_half(x):
     &quot;&quot;&quot;Rotates half the hidden dims of the input.&quot;&quot;&quot;
     x1 = x[..., : x.shape[-1] // 2]
@@ -132,11 +123,15 @@ This will also showcase new features you might have missed so you’ll be up-to-
     return torch.cat((-x2, x1), dim=-1)
 </code></pre>
 <p>You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that’s what I did (+ a hash to avoid quadraticity).</p>
-<p>So… why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each <em>view</em> of one of our cars to be able to function. Remove the clutch, you can’t drive. Remove the doors, might be uncomfortable but you’ll get there. So doors can go, but you <em>have</em> to keep the clutch, even though you know perfectly how it works.</p>
-<h2><a id="modular"></a> Going modular</h2>
-<p>It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at some drawbacks, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers">Transformers has gone modular</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>. If you’re familiar with this, you can <a href="#%5Eattention-classes">skip this section</a> and go to the next one.</p>
 <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
-<p>It is explained in details in the documentation above, but overall it works like this, you define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em>:</p>
 <summary>Auto-generated modeling code</summary>
 <p><div class=code-compare style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
     <div class=code-column style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
@@ -287,25 +282,49 @@ class GlmRMSNorm(nn.Module):
     <strong>Left:</strong> Clean modular definition with inheritance.
     <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.
 </p></p>
-<p>As you can see, we can now define any model as a <em>modular</em> of another. This isn’t strictly groundbreaking if you’ve done any programming, you might even think “well that’s just how inheritance works”. The crucial difference is that we do <em>visibly</em> what is essentially the <em>compiler</em>’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it <a href="#one-model-one-file">all in one piece</a>.</p>
-<h2><a id="attention-classes"></a> External Attention classes</h2>
-<p>A chronological iteration over <a href="#modular">modular</a>, and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
-<p>What will forever stay in the modeling code is the <code>eager_attention_forward</code> because it is a core part of the modeling,</p>
 <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
 if self.config._attn_implementation != &quot;eager&quot;:
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 </code></pre>
-<p>We often read and understand that <code>kwargs</code> are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don’'t use the same kwargs.</p>
-<p>It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
 <p>For better <em>information</em>, we plan to use <code>python</code> features such as <code>Annotated</code> for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):</p>
 <pre><code class="language-python">from typing import Annotated
 MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
 </code></pre>
-<h2><a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism</h2>
-<p>We want to touch minimally to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
-<p>It is written once in the config and passed to <code>.from_pretrained()</code>.</p>
-<p>The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
 <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
 base_model_tp_plan = {
     "layers.*.self_attn.q_proj": "colwise",
@@ -333,7 +352,7 @@ out = model(**inputs)</code></pre></p>
 <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
 <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
 <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
-<h2><a id="layers-attentions-caches"></a> Layers, attentions and caches</h2>
 <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
 <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
     &quot;full_attention&quot;,
@@ -352,8 +371,8 @@ out = model(**inputs)</code></pre></p>
     &quot;full_attention&quot;
   ],
 </code></pre>
-<p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also <a href="#modular-toolbox">easy to tweak</a>.</p>
-<h2><a id="community-kernels"></a>Community Kernels</h2>
 <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
 <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
 class GlmRMSNorm(nn.Module):
@@ -361,7 +380,7 @@ class GlmRMSNorm(nn.Module):
 </code></pre>
 <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
 <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
-<h2>The good modularity</h2>
 <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
 It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 So I wanted to take a look at the current <strong>state of modularity</strong> across the repository. How many models are defined using components of others?</p>
@@ -377,12 +396,12 @@ Radically different architectures such as mamba have spawned their own dependenc
 <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
 <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
-<h2>Many models, but not enough yet, are alike</h2>
 <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
 <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
 <p>    <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
 <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
-<h2>VLM improvements, avoiding abstraction</h2>
 <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
 <p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
 <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
@@ -432,16 +451,8 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
         return special_image_mask, special_video_mask
 </code></pre>
-<p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the self-contained logic of the model.</p>
-<h2>The weight of maintenance</h2>
-<p>The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
-<p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
-<p>Less code to hand-maintain means fewer places to break.</p>
-<p>Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.</p>
-<p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
-<p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
-<p>Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
-<h2><a id="encoders-ftw"></a> Embedding models, now and forever.</h2>
 <p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
 <p><html>
 <head><meta charset="utf-8" /></head>
@@ -4329,20 +4340,21 @@ return Plotly;
 </body>
 </html></p>
 <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
-<h2>On image processing and processors</h2>
 <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
-<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.</p>
-<p><img src="fast_image_processors.png" alt="Fast Image Processors Performance"></p>
 <h2>Reduce barrier to entry/contribution</h2>
-<p>This is an overall objective: there’s no <code>transformer</code> without its community.</p>
-<p>We didn’t want to make a toolbox, because <em>having a framework means forcing users into it</em>. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
-<p>Among the most valuable contributions to <code>transformers</code>is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
-<p>In that regard, we DO want to be a <a href="#modular-toolbox">modular toolbox</a>, being <a href="#minimal-user-api">minimal</a> enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
-<h2>A surgical toolbox for model development</h2>
 <h3>Attention visualisation</h3>
-<p>If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
-machinery is the <code>attention mask</code>, cause of confusion.</p>
-<p>Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
 <p>
 <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
     <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
@@ -4389,7 +4401,7 @@ machinery is the <code>attention mask</code>, cause of confusion.</p>
   </div>
   </p>
 <h3>Logging entire model activations</h3>
-<p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
 <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
 <h3>Cooking faster CUDA warmups</h3>
@@ -4445,9 +4457,9 @@ machinery is the <code>attention mask</code>, cause of confusion.</p>
 </div>
 <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
-<p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, your iteration speed.</p>
-<h2>Transformers-serve and continuous batching</h2>
-<p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.</p>
 <pre><code class="language-bash">transformers serve
 curl -X POST http://localhost:8000/v1/chat/completions \
@@ -4460,11 +4472,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 <p>Transformers-serve is transformers-first, for sure, but it’s not limited to that. Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
-<li>having it immediately usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
 </ul>
 <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>,  and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
 <h2>What is coming next</h2>
-<p>It sounds dumb, but it’s true: the future is very soon. One tenet that will be broken when the next major version is released, v5, <a href="#backwards-compatibility">backwards compatibility</a> will be heavily broken. Instead, what we aim to be is way more of a <a href="#modular-toolbox">modular toolbox</a>, while maintaining a <a href="#consistent-public-surface">consistent public surface</a>.</p>
     </d-article>
@@ -4492,28 +4505,27 @@ curl -X POST http://localhost:8000/v1/chat/completions \
                                 // Extract tenet text for tooltips
                                 const tenetTooltips = {
-                                    'source-of-truth': 'We aim to be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
-                                    'one-model-one-file': 'All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.',
                                     'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
                                     'standardize-dont-abstract': 'If it\'s model behavior, keep it in the file; abstractions only for generic infra.',
                                     'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
                                     'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
                                     'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
-                                    'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed.',
                                 };
-                                // Add smooth scrolling and active state
                                 const tocLinks = document.querySelectorAll('d-contents a');
                                 tocLinks.forEach(link => {
                                     const href = link.getAttribute('href');
                                     const anchor = href ? href.substring(1) : '';
-                                    // Add tooltip if this is a tenet link
                                     if (tenetTooltips[anchor]) {
-                                        link.setAttribute('title', tenetTooltips[anchor]);
                                         link.style.position = 'relative';
                                     }
                                     link.addEventListener('click', function(e) {
                                         e.preventDefault();
                                         const target = document.querySelector(this.getAttribute('href'));
@@ -4522,6 +4534,16 @@ curl -X POST http://localhost:8000/v1/chat/completions \
                                         }
                                     });
                                 });
                                 // Update active state on scroll
                                 window.addEventListener('scroll', function() {

     <script src="https://d3js.org/d3.v7.min.js"></script>
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
+    <title>Maintain the unmaintainable: 1M python loc, 400+ models</title>
     <link rel="stylesheet" href="style.css">
+    <link rel="stylesheet" href="transformers-custom.css">
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
 </head>
 <body>
     <d-front-matter>
         <script id='distill-front-matter' type="text/json">{
+    "title": "Maintain the unmaintainable: 1M python loc, 400+ models",
     "description": "A peek into software engineering for the transformers library",
     "published": "Aug 21, 2025",
     "authors": [{"author": "Pablo Montalvo", "authorURL": "https://huggingface.co/Molbap"}]
   }</script>
     </d-front-matter>
     <d-title>
+        <h1>Maintain the unmaintainable: 1M python loc, 400+ models</h1>
         <p>A peek into software engineering for the transformers library</p>
     </d-title>
     <d-byline></d-byline>
             </nav>
         </d-contents>
         <h2>Introduction</h2>
+<p>One million lines of <code>python</code> code. Through them, the <code>transformers</code> library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.</p>
+<p>Built on <code>PyTorch</code>, it’s a foundational tool for modern LLM usage, research, education, and tens of thousands of other open-source projects. Each AI model is added by the community, harmonized into a consistent interface, and tested daily on a CI to ensure reproducibility.</p>
+<p>This scale presents a monumental engineering challenge.</p>
+<p>How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library’s usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members. We continue supporting all models that come out and will continue to do so in the foreseeable future.</p>
+<p>This post dissects the design philosophy that makes this possible. It’s a continuation of our older principles, detailed on our previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, as well as its accompanying <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post from 2022</a>. More recently, and I recommend the read if it’s not done yet, a blog post about <a href="https://huggingface.co/blog/faster-transformers">recent upgrades to transformers</a> was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.</p>
+<p>We codify the “tenets” that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library’s sustainability and growth.</p>
+<p>For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon <code>transformers</code>, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.</p>
+<h2>The core tenets of transformers</h2>
+<p>We summarize the foundations on which we’ve built everything, and write the “tenets” of the library.  They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
+<p>Note that the library <em>evolved</em> towards these principles, and that they <em>emerged</em> from decisions taken, and once emerged they were recognized as critical.</p>
 <div class="tenet-list">
 <ol>
 <li class="tenet">
 <a id="source-of-truth"></a>
 <strong>Source of Truth</strong>
+<p>We aim be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
 <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
 </li>
 <li class="tenet">
 <a id="one-model-one-file"></a>
 <strong>One Model, One File</strong>
+<p>All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model's hackability.</p>
+<em>Every model should be completely understandable and hackable by reading a single file from top to bottom.</em>
 </li>
 <li class="tenet">
 <a id="code-is-product"></a>
 <a id="minimal-user-api"></a>
 <strong>Minimal User API</strong>
 <p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
+<em>Keep the public interface simple and predictable, users should know what to expect.</em>
 </li>
 <li class="tenet">
 <a id="backwards-compatibility"></a>
 <strong>Backwards Compatibility</strong>
+<p>Evolve by additive standardization, never break public APIs.</p>
+<p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies.
+<em>Once something is public, it stays public, evolution through addition, not breaking changes.</em>
 </li>
 <li class="tenet">
 <a id="consistent-public-surface"></a>
 <strong>Consistent Public Surface</strong>
+<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal we have as well as a tenet.</p>
 <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
 </li>
 </ol>
 </div>
 <p>When a PR is merged, it is because the contribution is worthwhile, and that the  <code>transformers</code> team finds the design of the contribution to be aligned with what is above.</p>
+<p>Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We <em>try</em> to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
+<p>For instance, one function essential to the implementation of <a href="https://huggingface.co/papers/2104.09864">Rotary Positional Embeddings</a> is identical in 70 <code>modeling_&lt;file&gt;.py</code> across <code>src/transformers/models/.</code>  Why keep it? Because we want all the model logic to be <a href="#one-model-one-file">contained in the modeling file</a>. In order to do that, we <a href="#do-repeat-yourself">do repeat ourselves</a>.</p>
 <pre><code class="language-python">def rotate_half(x):
     &quot;&quot;&quot;Rotates half the hidden dims of the input.&quot;&quot;&quot;
     x1 = x[..., : x.shape[-1] // 2]
     return torch.cat((-x2, x1), dim=-1)
 </code></pre>
 <p>You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that’s what I did (+ a hash to avoid quadraticity).</p>
+<p>We want all models to have self-contained modeling code.</p>
+<p>Every core functionality <em>must</em> be in the modeling code, every non-core functionality <em>can</em> be outside of it.</p>
+<p>This comes as a great cost. Enter the <code>#Copied from...</code> mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.</p>
+<p>We needed to separate both principles that were so far intertwined, <a href="#do-repeat-yourself">repetition</a> and <a href="#one-model-one-file">hackabilty</a>.</p>
+<p>What was the solution to this?</p>
+<h2><a id="modular"></a> Modular transformers</h2>
+<p>Transformers is an opiniated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers were introduced</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
 <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
+<p>It works as follows. In order to contribute a model, say for instance  define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em>.</p>
 <summary>Auto-generated modeling code</summary>
 <p><div class=code-compare style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
     <div class=code-column style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
     <strong>Left:</strong> Clean modular definition with inheritance.
     <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.
 </p></p>
+<p>As you can see, we can now define any model as a <em>modular</em> of another.</p>
+<p>You might think “well that’s just how inheritance works”. The crucial difference is that we do <em>visibly</em> what is essentially the <em>compiler</em>’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it <a href="#one-model-one-file">all in one piece</a>.</p>
+<p>What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.</p>
+<p>When <code>AutoModel.from_pretrained(...)</code> is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.</p>
+<p>What does that gives us?</p>
+<h3>A maintainable control surface</h3>
+<p>The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
+If it only has a modeling file, we add its LOC count.
+However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
+<p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
+<p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
+<p>Less code to hand-maintain means fewer places to break.</p>
+<p>Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.</p>
+<p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
+<p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
+<p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
+<p>A related optimization was the following one. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
+<p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
+<p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
+<h3><a id="attention-classes"></a> External Attention classes</h3>
+<p>Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
+<p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
+<p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
+<p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.</p>
 <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
 if self.config._attn_implementation != &quot;eager&quot;:
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 </code></pre>
+<p>A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
 <p>For better <em>information</em>, we plan to use <code>python</code> features such as <code>Annotated</code> for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):</p>
 <pre><code class="language-python">from typing import Annotated
 MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
 </code></pre>
+<h3><a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism</h3>
+<p>If you’re not familiar with the different flavours of parallelism, I recommend checking out <a href="https://huggingface.co/blog/accelerate-nd-parallel">this blog post</a> first, and of course a full <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">dive into the ultra-scale playbook</a> is always recommended.</p>
+<p>The essential part is that, as <a href="https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism">the documentation states</a> when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.</p>
+<p>Why does it matter?</p>
+<p>Because we want to avoid code modifications that are unrelated to the model.
+We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
+<p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
+<p>The alternative would be to modify parent classes specific to their</p>
+<p>It is written once in the config and passed to <code>.from_pretrained()</code>.  The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
 <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
 base_model_tp_plan = {
     "layers.*.self_attn.q_proj": "colwise",
 <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
 <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
 <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
+<h3><a id="layers-attentions-caches"></a> Layers, attentions and caches</h3>
 <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
 <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
     &quot;full_attention&quot;,
     &quot;full_attention&quot;
   ],
 </code></pre>
+<p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.</p>
+<h3><a id="community-kernels"></a>Community Kernels</h3>
 <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
 <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
 class GlmRMSNorm(nn.Module):
 </code></pre>
 <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
 <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
+<h2>Modular developments</h2>
 <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
 It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 So I wanted to take a look at the current <strong>state of modularity</strong> across the repository. How many models are defined using components of others?</p>
 <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
 <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
+<h3>Many models, but not enough yet, are alike</h3>
 <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
 <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
 <p>    <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
 <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
+<h3>VLM improvements, avoiding abstraction</h3>
 <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
 <p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
 <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
         return special_image_mask, special_video_mask
 </code></pre>
+<p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
+<h3><a id="encoders-ftw"></a> Embedding models, now and forever.</h3>
 <p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
 <p><html>
 <head><meta charset="utf-8" /></head>
 </body>
 </html></p>
 <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
+<h3>On image processing and processors</h3>
 <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
+<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
+<p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
+<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
 <h2>Reduce barrier to entry/contribution</h2>
+<p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
+<p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
+<p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
+<p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
+<p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
+<h3>A surgical toolbox for model development</h3>
 <h3>Attention visualisation</h3>
+<p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
+<p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
 <p>
 <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
     <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
   </div>
   </p>
 <h3>Logging entire model activations</h3>
+<p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
 <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
 <h3>Cooking faster CUDA warmups</h3>
 </div>
 <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
+<p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, as it’s the narrowest bottleneck for your iteration speed.</p>
+<h3>Transformers-serve and continuous batching</h3>
+<p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various <a href="https://huggingface.co/docs/inference-providers/en/index">inference providers</a> if you’re interested in model deployment in general.</p>
 <pre><code class="language-bash">transformers serve
 curl -X POST http://localhost:8000/v1/chat/completions \
 <p>Transformers-serve is transformers-first, for sure, but it’s not limited to that. Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
+<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
 </ul>
 <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>,  and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
 <h2>What is coming next</h2>
+<p>The next major version of <code>transformers</code> is just around the corner. When v5 is releasd, <a href="#backwards-compatibility">backwards compatibility</a> will try to stay as solid as possible. Changes we do now are to ensure this.</p>
+<p>Instead, what we aim to be is way more of a modular Toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM and other fine-tuning and fast inference options.</p>
     </d-article>
                                 // Extract tenet text for tooltips
                                 const tenetTooltips = {
+                                    'source-of-truth': 'We aim be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
+                                    'one-model-one-file': 'All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model\'s hackability.',
                                     'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
                                     'standardize-dont-abstract': 'If it\'s model behavior, keep it in the file; abstractions only for generic infra.',
                                     'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
                                     'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
                                     'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
+                                    'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.',
                                 };
+                                // Add smooth scrolling and custom tooltips to all tenet links (TOC and article)
                                 const tocLinks = document.querySelectorAll('d-contents a');
                                 tocLinks.forEach(link => {
                                     const href = link.getAttribute('href');
                                     const anchor = href ? href.substring(1) : '';
                                     if (tenetTooltips[anchor]) {
+                                        link.setAttribute('data-tooltip', tenetTooltips[anchor]);
                                         link.style.position = 'relative';
                                     }
                                     link.addEventListener('click', function(e) {
                                         e.preventDefault();
                                         const target = document.querySelector(this.getAttribute('href'));
                                         }
                                     });
                                 });
+                                // Add custom tooltips to tenet links in article content
+                                const articleLinks = document.querySelectorAll('d-article a[href^="#"]');
+                                articleLinks.forEach(link => {
+                                    const href = link.getAttribute('href');
+                                    const anchor = href ? href.substring(1) : '';
+                                    if (tenetTooltips[anchor]) {
+                                        link.setAttribute('data-tooltip', tenetTooltips[anchor]);
+                                    }
+                                });
                                 // Update active state on scroll
                                 window.addEventListener('scroll', function() {

dist/main.bundle.js CHANGED Viewed

@@ -1631,29 +1631,32 @@ p code, li code {
 /* Distill article improvements */
 d-article {
     max-width: none;
-    font-size: 18px; /* Increased from default ~16px */
-    line-height: 1.7;
 }
 d-article > * {
-    max-width: 1100px; /* Increased from 900px for more space */
-    margin-left: auto;
-    margin-right: auto;
 }
-/* Make content even wider on large screens when TOC is present */
-@media (min-width: 1400px) {
     d-article > * {
-        max-width: 1300px;
     }
 }
 /* Improve paragraph readability */
 d-article p {
-    font-size: 18px;
-    line-height: 1.8;
-    margin-bottom: 1.5rem;
-    color: #2d3748;
 }
 /* Improve heading sizes */
@@ -1668,7 +1671,8 @@ d-article h1 {
 d-article h2 {
     font-size: 2.5rem;
     line-height: 1.3;
-    margin: 2.5rem 0 1.5rem 0;
     color: #1a202c;
     font-weight: 650;
 }
@@ -1697,7 +1701,7 @@ d-article ol li {
     margin-bottom: 0.5rem;
 }
-/* Enhanced tenet reference styling with tooltips */
 a[href^="#source-of-truth"],
 a[href^="#one-model-one-file"],
 a[href^="#code-is-product"],
@@ -1713,7 +1717,6 @@ a[href^="#modular-toolbox"] {
     text-decoration: underline;
     text-decoration-color: rgba(102, 126, 234, 0.3);
     transition: all 0.3s ease;
-    cursor: help;
 }
 a[href^="#source-of-truth"]:hover,
@@ -1732,27 +1735,9 @@ a[href^="#modular-toolbox"]:hover {
     border-radius: 4px;
 }
-/* Tooltip content for each tenet */
-a[href^="#source-of-truth"]:after { content: "We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances."; }
-a[href^="#one-model-one-file"]:after { content: "All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom."; }
-a[href^="#code-is-product"]:after { content: "Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial."; }
-a[href^="#standardize-dont-abstract"]:after { content: "If it's model behavior, keep it in the file; abstractions only for generic infra."; }
-a[href^="#do-repeat-yourself"]:after { content: "Copy when it helps users; keep successors in sync without centralizing behavior."; }
-a[href^="#minimal-user-api"]:after { content: "Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths."; }
-a[href^="#backwards-compatibility"]:after { content: "Evolve by additive standardization, never break public APIs."; }
-a[href^="#consistent-public-surface"]:after { content: "Same argument names, same outputs, hidden states and attentions exposed."; }
-a[href^="#modular-toolbox"]:after { content: "Provide tools and utilities, but don't force users into a rigid framework."; }
-/* Universal tooltip styling for tenet references */
-a[href^="#source-of-truth"]:after,
-a[href^="#one-model-one-file"]:after,
-a[href^="#code-is-product"]:after,
-a[href^="#standardize-dont-abstract"]:after,
-a[href^="#do-repeat-yourself"]:after,
-a[href^="#minimal-user-api"]:after,
-a[href^="#backwards-compatibility"]:after,
-a[href^="#consistent-public-surface"]:after,
-a[href^="#modular-toolbox"]:after {
     position: absolute;
     bottom: 100%;
     left: 50%;
@@ -1775,16 +1760,7 @@ a[href^="#modular-toolbox"]:after {
     margin-bottom: 8px;
 }
-/* Tooltip arrows */
-a[href^="#source-of-truth"]:before,
-a[href^="#one-model-one-file"]:before,
-a[href^="#code-is-product"]:before,
-a[href^="#standardize-dont-abstract"]:before,
-a[href^="#do-repeat-yourself"]:before,
-a[href^="#minimal-user-api"]:before,
-a[href^="#backwards-compatibility"]:before,
-a[href^="#consistent-public-surface"]:before,
-a[href^="#modular-toolbox"]:before {
     content: '';
     position: absolute;
     bottom: 100%;
@@ -1798,25 +1774,8 @@ a[href^="#modular-toolbox"]:before {
     transition: opacity 0.3s ease, visibility 0.3s ease;
 }
-/* Show tooltips on hover */
-a[href^="#source-of-truth"]:hover:after,
-a[href^="#one-model-one-file"]:hover:after,
-a[href^="#code-is-product"]:hover:after,
-a[href^="#standardize-dont-abstract"]:hover:after,
-a[href^="#do-repeat-yourself"]:hover:after,
-a[href^="#minimal-user-api"]:hover:after,
-a[href^="#backwards-compatibility"]:hover:after,
-a[href^="#consistent-public-surface"]:hover:after,
-a[href^="#modular-toolbox"]:hover:after,
-a[href^="#source-of-truth"]:hover:before,
-a[href^="#one-model-one-file"]:hover:before,
-a[href^="#code-is-product"]:hover:before,
-a[href^="#standardize-dont-abstract"]:hover:before,
-a[href^="#do-repeat-yourself"]:hover:before,
-a[href^="#minimal-user-api"]:hover:before,
-a[href^="#backwards-compatibility"]:hover:before,
-a[href^="#consistent-public-surface"]:hover:before,
-a[href^="#modular-toolbox"]:hover:before {
     opacity: 1;
     visibility: visible;
 }
@@ -1834,6 +1793,36 @@ d-article blockquote {
     color: #4a5568;
 }
 /* Full width elements */
 d-article .code-compare,
 d-article .interactive-demo,
@@ -1858,11 +1847,13 @@ d-article .memory-chart-container {
     .tenet-list li.tenet {
         padding: 1rem;
     }
     .interactive-demo .demo-content {
         padding: 1rem;
     }
-}`, "",{"version":3,"sources":["webpack://./src/transformers-custom.css"],"names":[],"mappings":"AAAA,4CAA4C;;AAE5C,2BAA2B;AAC3B;IACI,aAAa;IACb,8BAA8B;IAC9B,WAAW;IACX,cAAc;IACd,kBAAkB;AACtB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,gBAAgB;IAChB,wCAAwC;AAC5C;;AAEA;IACI,mBAAmB;IACnB,qBAAqB;IACrB,gBAAgB;IAChB,cAAc;IACd,gCAAgC;IAChC,gBAAgB;AACpB;;AAEA;IACI,SAAS;IACT,aAAa;IACb,mBAAmB;IACnB,gBAAgB;IAChB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA;IACI,cAAc;AAClB;;AAEA,8CAA8C;AAC9C;IACI;QACI,0BAA0B;QAC1B,SAAS;IACb;AACJ;;AAEA,+DAA+D;AAC/D;IACI,cAAc;AAClB;;AAEA;IACI,+BAA+B,EAAE,iBAAiB;IAClD,gBAAgB;IAChB,eAAe;IACf,aAAa;IACb,0BAA0B;IAC1B,WAAW;IACX,gBAAgB;IAChB,cAAc;AAClB;;AAEA;IACI,gCAAgC;IAChC,6DAA6D;IAC7D,yBAAyB;IACzB,mBAAmB;IACnB,4BAA4B;IAC5B,SAAS;IACT,kBAAkB;IAClB,2CAA2C;IAC3C,yBAAyB;IACzB,eAAe;AACnB;;AAEA;IACI,uCAAuC;IACvC,2CAA2C;IAC3C,oCAAoC;IACpC,6DAA6D;AACjE;;AAEA,8BAA8B;AAC9B,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;;AAE1G;IACI,+BAA+B;IAC/B,kBAAkB;IAClB,UAAU;IACV,WAAW;IACX,YAAY;IACZ,WAAW;IACX,YAAY;IACZ,kBAAkB;IAClB,aAAa;IACb,mBAAmB;IACnB,uBAAuB;IACvB,gBAAgB;IAChB,iBAAiB;IACjB,0CAA0C;IAC1C,uBAAuB;AAC3B;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,iBAAiB;IACjB,kBAAkB;IAClB,cAAc;IACd,mBAAmB;IACnB,aAAa;IACb,+BAA+B;IAC/B,kBAAkB;IAClB,8BAA8B;AAClC;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,gBAAgB;AACpB;;AAEA,iDAAiD;AACjD;IACI,KAAK,0CAA0C,EAAE;IACjD,MAAM,0CAA0C,EAAE;IAClD,OAAO,0CAA0C,EAAE;AACvD;;AAEA;IACI,6CAA6C;AACjD;;AAEA,kCAAkC;AAClC;IACI,yBAAyB;IACzB,mBAAmB;IACnB,mBAAmB;IACnB,cAAc;IACd,gBAAgB;IAChB,yCAAyC;AAC7C;;AAEA,yCAAyC;AACzC;IACI,6BAA6B;IAC7B,mCAAmC;AACvC;;AAEA;IACI,6DAA6D;IAC7D,YAAY;IACZ,oBAAoB;IACpB,gBAAgB;AACpB;;AAEA;IACI,eAAe;AACnB;;AAEA;IACI,mBAAmB;IACnB,oBAAoB;IACpB,6BAA6B;IAC7B,cAAc;IACd,gBAAgB;AACpB;;AAEA,4CAA4C;AAC5C;IACI,6DAA6D;IAC7D,YAAY;IACZ,YAAY;IACZ,uBAAuB;IACvB,kBAAkB;IAClB,gBAAgB;IAChB,eAAe;IACf,2CAA2C;AAC/C;;AAEA;IACI,2BAA2B;IAC3B,+CAA+C;AACnD;;AAEA;IACI,YAAY;IACZ,mBAAmB;IACnB,eAAe;IACf,gBAAgB;AACpB;;AAEA,qBAAqB;AACrB;IACI,mBAAmB;IACnB,kBAAkB;IAClB,aAAa;IACb,cAAc;IACd,wDAAwD;IACxD,gBAAgB;AACpB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,cAAc;IACd,eAAe;IACf,kBAAkB;IAClB,WAAW;IACX,oBAAoB;AACxB;;AAEA;IACI,mBAAmB;IACnB,aAAa;IACb,kBAAkB;IAClB,qBAAqB;IACrB,qBAAqB;IACrB,iBAAiB;IACjB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA,oCAAoC;AACpC;IACI,sBAAsB;IACtB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;AAClB;;AAEA;IACI,sBAAsB;IACtB,gBAAgB;IAChB,kBAAkB;IAClB,eAAe;AACnB;;AAEA,yBAAyB;AACzB;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,aAAa;IACb,cAAc;AAClB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,YAAY;IACZ,kBAAkB;IAClB,yCAAyC;IACzC,gBAAgB;AACpB;;AAEA,kEAAkE;AAClE;IACI;QACI,4BAA4B;IAChC;;IAEA;QACI,4BAA4B;QAC5B,4BAA4B;QAC5B,+BAA+B;QAC/B,6BAA6B;QAC7B,kCAAkC;QAClC,4BAA4B;QAC5B,0BAA0B;QAC1B,6BAA6B;QAC7B,4BAA4B;QAC5B,mCAAmC,EAAE,eAAe;QACpD,2BAA2B;QAC3B,oBAAoB;QACpB,2BAA2B;QAC3B,qCAAqC;QACrC,gCAAgC;QAChC,+CAA+C;QAC/C,wBAAwB;QACxB,yBAAyB;QACzB,8BAA8B;IAClC;AACJ;;AAEA;IACI;QACI,wBAAwB;QACxB,4BAA4B;QAC5B,8BAA8B;QAC9B,4BAA4B;QAC5B,gCAAgC;QAChC,6BAA6B;QAC7B,+BAA+B;QAC/B,sDAAsD;QACtD,6BAA6B;QAC7B,qCAAqC;QACrC,gCAAgC;QAChC,wBAAwB;IAC5B;AACJ;;AAEA,0DAA0D;AAC1D;IACI,yBAAyB;IACzB,8BAA8B;IAC9B,qBAAqB;AACzB;;AAEA,2BAA2B;AAC3B;IACI,qBAAqB;IACrB,gCAAgC;IAChC,sBAAsB;AAC1B;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,WAAW;AACf;;AAEA;IACI,yBAAyB;IACzB,qBAAqB;IACrB,mBAAmB;IACnB,cAAc;IACd,iBAAiB;IACjB,gBAAgB;IAChB,gBAAgB;IAChB,2BAA2B;AAC/B;;AAEA;IACI,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,qBAAqB;AACzB;;AAEA,qBAAqB;AACrB;IACI,qBAAqB;IACrB,mDAAmD;AACvD;;AAEA;IACI,UAAU;AACd;;AAEA;IACI,uBAAuB;AAC3B;;AAEA;IACI,kCAAkC;IAClC,kBAAkB;AACtB;;AAEA;IACI,kCAAkC;AACtC;;AAEA,2CAA2C;AAC3C;IACI,kBAAkB;IAClB,YAAY;AAChB;;AAEA;IACI,cAAc;AAClB;;AAEA,8DAA8D;AAC9D;IACI,oBAAoB;IACpB,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;AAC7C;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,gCAAgC;IAChC,6BAA6B;IAC7B,2BAA2B;IAC3B,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI;QACI,UAAU;QACV,WAAW;QACX,kBAAkB;QAClB,YAAY;IAChB;;IAEA;QACI,UAAU;QACV,WAAW;QACX,+BAA+B;QAC/B,+BAA+B;QAC/B,0BAA0B;IAC9B;AACJ;;AAEA,gDAAgD;AAChD;IACI,8BAA8B;IAC9B,oCAAoC;IACpC,6BAA6B;IAC7B,0BAA0B;IAC1B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;AAC/B;;AAEA;IACI,2BAA2B;IAC3B,kFAAkF;IAClF,yBAAyB;AAC7B;;AAEA,gBAAgB;AAChB;IACI,8BAA8B;IAC9B,+BAA+B;IAC/B,6BAA6B;IAC7B,2BAA2B;IAC3B,yBAAyB;AAC7B;;AAEA,iCAAiC;AACjC;IACI,eAAe;IACf,eAAe,EAAE,iCAAiC;IAClD,gBAAgB;AACpB;;AAEA;IACI,iBAAiB,EAAE,wCAAwC;IAC3D,iBAAiB;IACjB,kBAAkB;AACtB;;AAEA,iEAAiE;AACjE;IACI;QACI,iBAAiB;IACrB;AACJ;;AAEA,kCAAkC;AAClC;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;AAClB;;AAEA,0BAA0B;AAC1B;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,uBAAuB;IACvB,cAAc;IACd,gBAAgB;AACpB;;AAEA,6BAA6B;AAC7B;;IAEI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;AACzB;;AAEA,mDAAmD;AACnD;;;;;;;;;IASI,kBAAkB;IAClB,cAAc;IACd,gBAAgB;IAChB,0BAA0B;IAC1B,+CAA+C;IAC/C,yBAAyB;IACzB,YAAY;AAChB;;AAEA;;;;;;;;;IASI,cAAc;IACd,8BAA8B;IAC9B,oCAAoC;IACpC,gBAAgB;IAChB,kBAAkB;AACtB;;AAEA,mCAAmC;AACnC,oCAAoC,uKAAuK,EAAE;AAC7M,uCAAuC,oHAAoH,EAAE;AAC7J,oCAAoC,wKAAwK,EAAE;AAC9M,8CAA8C,4FAA4F,EAAE;AAC5I,uCAAuC,2FAA2F,EAAE;AACpI,qCAAqC,8HAA8H,EAAE;AACrK,4CAA4C,uEAAuE,EAAE;AACrH,8CAA8C,mFAAmF,EAAE;AACnI,oCAAoC,qFAAqF,EAAE;;AAE3H,mDAAmD;AACnD;;;;;;;;;IASI,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,gBAAgB;IAChB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;IACzC,kBAAkB;AACtB;;AAEA,mBAAmB;AACnB;;;;;;;;;IASI,WAAW;IACX,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,6BAA6B;IAC7B,yBAAyB;IACzB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA,2BAA2B;AAC3B;;;;;;;;;;;;;;;;;;IAkBI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,gBAAgB;IAChB,oBAAoB;IACpB,cAAc;IACd,8BAA8B;IAC9B,4DAA4D;IAC5D,0BAA0B;IAC1B,kBAAkB;IAClB,cAAc;AAClB;;AAEA,wBAAwB;AACxB;;;IAGI,eAAe;IACf,WAAW;IACX,cAAc;IACd,eAAe;AACnB;;AAEA,mCAAmC;AACnC;IACI;;QAEI,cAAc;QACd,iBAAiB;QACjB,kBAAkB;IACtB;AACJ;;AAEA;IACI;QACI,aAAa;IACjB;;IAEA;QACI,aAAa;IACjB;AACJ","sourcesContent":["/* Transformers-specific styling additions */\n\n/* Code comparison layout */\n.code-compare {\n    display: grid;\n    grid-template-columns: 1fr 1fr;\n    gap: 1.5rem;\n    margin: 2rem 0;\n    align-items: start;\n}\n\n.code-compare .code-column {\n    background: #ffffff;\n    border: 1px solid #e2e8f0;\n    border-radius: 8px;\n    overflow: hidden;\n    box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);\n}\n\n.code-compare .code-header {\n    background: #f8f9fa;\n    padding: 0.75rem 1rem;\n    font-weight: 600;\n    color: #495057;\n    border-bottom: 1px solid #e2e8f0;\n    font-size: 0.9em;\n}\n\n.code-compare pre {\n    margin: 0;\n    padding: 1rem;\n    background: #ffffff;\n    overflow-x: auto;\n    font-size: 0.85em;\n    line-height: 1.4;\n}\n\n.code-compare pre code {\n    color: #374151;\n}\n\n/* Mobile responsiveness for code comparison */\n@media (max-width: 768px) {\n    .code-compare {\n        grid-template-columns: 1fr;\n        gap: 1rem;\n    }\n}\n\n/* Tenet styling - special highlighting for design principles */\n.tenet-list {\n    margin: 3rem 0;\n}\n\n.tenet-list ol {\n    counter-reset: tenet-counter -1; /* Start from 0 */\n    list-style: none;\n    padding-left: 0;\n    display: grid;\n    grid-template-columns: 1fr;\n    gap: 2.5rem;\n    max-width: 900px;\n    margin: 0 auto;\n}\n\n.tenet-list li.tenet {\n    counter-increment: tenet-counter;\n    background: linear-gradient(135deg, #ffffff 0%, #f8f9fa 100%);\n    border: 2px solid #e2e8f0;\n    border-radius: 16px;\n    padding: 2rem 2rem 2rem 4rem;\n    margin: 0;\n    position: relative;\n    box-shadow: 0 12px 35px rgba(0, 0, 0, 0.12);\n    transition: all 0.3s ease;\n    cursor: pointer;\n}\n\n.tenet-list li.tenet:hover {\n    transform: translateY(-8px) scale(1.02);\n    box-shadow: 0 20px 50px rgba(0, 0, 0, 0.25);\n    border-color: rgba(0, 123, 255, 0.5);\n    background: linear-gradient(135deg, #ffffff 0%, #f0f8ff 100%);\n}\n\n/* Colorful numbering system */\n.tenet-list li.tenet:nth-child(1):before { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); }\n.tenet-list li.tenet:nth-child(2):before { background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); }\n.tenet-list li.tenet:nth-child(3):before { background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); }\n.tenet-list li.tenet:nth-child(4):before { background: linear-gradient(135deg, #43e97b 0%, #38f9d7 100%); }\n.tenet-list li.tenet:nth-child(5):before { background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); }\n.tenet-list li.tenet:nth-child(6):before { background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); }\n.tenet-list li.tenet:nth-child(7):before { background: linear-gradient(135deg, #ff9a9e 0%, #fecfef 100%); }\n.tenet-list li.tenet:nth-child(8):before { background: linear-gradient(135deg, #a18cd1 0%, #fbc2eb 100%); }\n.tenet-list li.tenet:nth-child(9):before { background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); }\n\n.tenet-list li.tenet:before {\n    content: counter(tenet-counter);\n    position: absolute;\n    top: -12px;\n    left: -12px;\n    color: white;\n    width: 48px;\n    height: 48px;\n    border-radius: 50%;\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    font-size: 1.2em;\n    font-weight: bold;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);\n    border: 3px solid white;\n}\n\n.tenet-list li.tenet strong {\n    color: #1a202c;\n    font-size: 1.1em;\n    display: block;\n    margin-bottom: 0.5rem;\n}\n\n.tenet-list li.tenet em {\n    color: #4a5568;\n    font-size: 0.95em;\n    font-style: italic;\n    display: block;\n    margin-top: 0.75rem;\n    padding: 1rem;\n    background: rgba(0, 0, 0, 0.03);\n    border-radius: 8px;\n    border-left: 3px solid #e2e8f0;\n}\n\n.tenet-list li.tenet p {\n    color: #2d3748;\n    line-height: 1.6;\n    margin: 0.5rem 0;\n}\n\n/* Add a subtle pulse animation for the numbers */\n@keyframes pulse-glow {\n    0% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n    50% { box-shadow: 0 4px 20px rgba(0, 0, 0, 0.25); }\n    100% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n}\n\n.tenet-list li.tenet:hover:before {\n    animation: pulse-glow 2s ease-in-out infinite;\n}\n\n/* Interactive component styling */\n.interactive-demo {\n    border: 1px solid #e2e8f0;\n    border-radius: 12px;\n    background: #ffffff;\n    margin: 2rem 0;\n    overflow: hidden;\n    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.07);\n}\n\n/* Model visualization fragment styling */\n[id*=\"plot-model-visualisation\"] {\n    margin: 1rem -2rem !important;\n    width: calc(100% + 4rem) !important;\n}\n\n.interactive-demo .demo-header {\n    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n    color: white;\n    padding: 1rem 1.5rem;\n    font-weight: 600;\n}\n\n.interactive-demo .demo-content {\n    padding: 1.5rem;\n}\n\n.interactive-demo .demo-footer {\n    background: #f8f9fa;\n    padding: 1rem 1.5rem;\n    border-top: 1px solid #e2e8f0;\n    color: #6c757d;\n    font-size: 0.9em;\n}\n\n/* Button styling for interactive elements */\n.btn-primary {\n    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n    border: none;\n    color: white;\n    padding: 0.75rem 1.5rem;\n    border-radius: 6px;\n    font-weight: 500;\n    cursor: pointer;\n    transition: transform 0.2s, box-shadow 0.2s;\n}\n\n.btn-primary:hover {\n    transform: translateY(-1px);\n    box-shadow: 0 4px 12px rgba(102, 126, 234, 0.3);\n}\n\n.btn-primary:disabled {\n    opacity: 0.6;\n    cursor: not-allowed;\n    transform: none;\n    box-shadow: none;\n}\n\n/* Terminal styling */\n.terminal-container {\n    background: #1a202c;\n    border-radius: 8px;\n    padding: 1rem;\n    color: #e2e8f0;\n    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;\n    font-size: 0.9em;\n}\n\n.terminal-input {\n    background: #2d3748;\n    border: 1px solid #4a5568;\n    color: #e2e8f0;\n    padding: 0.5rem;\n    border-radius: 4px;\n    width: 100%;\n    font-family: inherit;\n}\n\n.terminal-output {\n    background: #0a0e1a;\n    padding: 1rem;\n    border-radius: 4px;\n    white-space: pre-wrap;\n    word-break: break-all;\n    min-height: 100px;\n    max-height: 300px;\n    overflow-y: auto;\n}\n\n/* Attention visualization styling */\n.attention-matrix {\n    font-family: monospace;\n    font-size: 0.8em;\n    border-collapse: collapse;\n    margin: 1rem 0;\n}\n\n.attention-matrix td {\n    border: 1px solid #ddd;\n    padding: 4px 8px;\n    text-align: center;\n    min-width: 50px;\n}\n\n/* Memory chart styling */\n.memory-chart-container {\n    background: #f8f9fa;\n    border: 2px solid #e9ecef;\n    border-radius: 8px;\n    padding: 1rem;\n    margin: 1rem 0;\n}\n\n/* Image styling improvements */\nimg {\n    max-width: 100%;\n    height: auto;\n    border-radius: 8px;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);\n    margin: 1.5rem 0;\n}\n\n/* Table of contents styling - Fixed positioning like ultrascale */\n@media (min-width: 1200px) {\n    d-article {\n        overflow: visible !important;\n    }\n    \n    d-contents {\n        align-self: start !important;\n        background: white !important;\n        grid-column-start: 1 !important;\n        grid-column-end: 4 !important;\n        grid-row: auto / span 6 !important;\n        justify-self: end !important;\n        margin-top: 0em !important;\n        padding-right: 3em !important;\n        padding-left: 2em !important;\n        position: -webkit-sticky !important; /* For Safari */\n        position: sticky !important;\n        top: 10px !important;\n        overflow-y: auto !important;\n        height: calc(100vh - 40px) !important;\n        scrollbar-width: none !important;\n        transition: max-height 0.3s ease-out !important;\n        z-index: -100 !important;\n        display: block !important;\n        visibility: visible !important;\n    }\n}\n\n@media (max-width: 1199px) {\n    d-contents {\n        display: none !important;\n        background: white !important;\n        justify-self: start !important;\n        align-self: start !important;\n        padding-bottom: 0.5em !important;\n        margin-bottom: 1em !important;\n        padding-left: 0.25em !important;\n        border-bottom: 1px solid rgba(0, 0, 0, 0.1) !important;\n        overflow-y: scroll !important;\n        height: calc(100vh - 40px) !important;\n        scrollbar-width: none !important;\n        z-index: -100 !important;\n    }\n}\n\n/* Force TOC to be visible and override distill defaults */\nd-contents {\n    display: block !important;\n    visibility: visible !important;\n    opacity: 1 !important;\n}\n\n/* TOC Navigation styling */\nd-contents .toc-header {\n    margin-bottom: 1.5rem;\n    border-bottom: 2px solid #007bff;\n    padding-bottom: 0.5rem;\n}\n\nd-contents .toc-title {\n    font-weight: bold;\n    font-size: 1.2em;\n    color: #333;\n}\n\nd-contents nav a {\n    color: rgba(0, 0, 0, 0.7);\n    text-decoration: none;\n    border-bottom: none;\n    display: block;\n    padding: 0.3rem 0;\n    font-size: 0.9em;\n    line-height: 1.4;\n    transition: color 0.2s ease;\n}\n\nd-contents nav a:hover {\n    color: #007bff;\n    text-decoration: none;\n}\n\nd-contents nav a.active {\n    color: #007bff;\n    font-weight: 600;\n}\n\nd-contents nav div {\n    margin-bottom: 0.2rem;\n}\n\n/* Smooth scrollbar */\nd-contents {\n    scrollbar-width: thin;\n    scrollbar-color: rgba(0, 123, 255, 0.3) transparent;\n}\n\nd-contents::-webkit-scrollbar {\n    width: 6px;\n}\n\nd-contents::-webkit-scrollbar-track {\n    background: transparent;\n}\n\nd-contents::-webkit-scrollbar-thumb {\n    background: rgba(0, 123, 255, 0.3);\n    border-radius: 3px;\n}\n\nd-contents::-webkit-scrollbar-thumb:hover {\n    background: rgba(0, 123, 255, 0.5);\n}\n\n/* Custom tooltip styling for tenet links */\nd-contents nav a[title] {\n    position: relative;\n    cursor: help;\n}\n\nd-contents nav a[title]:hover {\n    color: #667eea;\n}\n\n/* Enhanced tooltip using CSS (fallback for title attribute) */\nd-contents nav a[title]:after {\n    content: attr(title);\n    position: absolute;\n    left: 100%;\n    top: 50%;\n    transform: translateY(-50%);\n    background: #1a202c;\n    color: white;\n    padding: 0.75rem 1rem;\n    border-radius: 8px;\n    font-size: 0.85em;\n    white-space: normal;\n    width: 300px;\n    line-height: 1.4;\n    z-index: 1001;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n    pointer-events: none;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n}\n\nd-contents nav a[title]:before {\n    content: '';\n    position: absolute;\n    left: 100%;\n    top: 50%;\n    transform: translate(-8px, -50%);\n    border: 8px solid transparent;\n    border-right-color: #1a202c;\n    z-index: 1002;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\nd-contents nav a[title]:hover:after,\nd-contents nav a[title]:hover:before {\n    opacity: 1;\n    visibility: visible;\n}\n\n/* Adjust for smaller screens */\n@media (max-width: 1400px) {\n    d-contents nav a[title]:after {\n        left: auto;\n        right: 100%;\n        margin-right: 1rem;\n        width: 250px;\n    }\n    \n    d-contents nav a[title]:before {\n        left: auto;\n        right: 100%;\n        transform: translate(8px, -50%);\n        border-right-color: transparent;\n        border-left-color: #1a202c;\n    }\n}\n\n/* Improve code syntax highlighting with Prism */\npre[class*=\"language-\"] {\n    background: #f8f9fa !important;\n    border: 1px solid #e9ecef !important;\n    border-radius: 8px !important;\n    padding: 1.5rem !important;\n    margin: 1.5rem 0 !important;\n    overflow-x: auto !important;\n    font-size: 0.9em !important;\n    line-height: 1.5 !important;\n}\n\ncode[class*=\"language-\"] {\n    background: none !important;\n    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Courier New', monospace !important;\n    color: #383a42 !important;\n}\n\n/* Inline code */\np code, li code {\n    background: #f1f3f4 !important;\n    padding: 0.2em 0.4em !important;\n    border-radius: 3px !important;\n    font-size: 0.9em !important;\n    color: #d73a49 !important;\n}\n\n/* Distill article improvements */\nd-article {\n    max-width: none;\n    font-size: 18px; /* Increased from default ~16px */\n    line-height: 1.7;\n}\n\nd-article > * {\n    max-width: 1100px; /* Increased from 900px for more space */\n    margin-left: auto;\n    margin-right: auto;\n}\n\n/* Make content even wider on large screens when TOC is present */\n@media (min-width: 1400px) {\n    d-article > * {\n        max-width: 1300px;\n    }\n}\n\n/* Improve paragraph readability */\nd-article p {\n    font-size: 18px;\n    line-height: 1.8;\n    margin-bottom: 1.5rem;\n    color: #2d3748;\n}\n\n/* Improve heading sizes */\nd-article h1 {\n    font-size: 3rem;\n    line-height: 1.2;\n    margin: 3rem 0 2rem 0;\n    color: #1a202c;\n    font-weight: 700;\n}\n\nd-article h2 {\n    font-size: 2.5rem;\n    line-height: 1.3;\n    margin: 2.5rem 0 1.5rem 0;\n    color: #1a202c;\n    font-weight: 650;\n}\n\nd-article h3 {\n    font-size: 2rem;\n    line-height: 1.4;\n    margin: 2rem 0 1rem 0;\n    color: #1a202c;\n    font-weight: 600;\n}\n\nd-article h4 {\n    font-size: 1.5rem;\n    line-height: 1.4;\n    margin: 1.5rem 0 1rem 0;\n    color: #2d3748;\n    font-weight: 600;\n}\n\n/* Improve list readability */\nd-article ul li,\nd-article ol li {\n    font-size: 18px;\n    line-height: 1.7;\n    margin-bottom: 0.5rem;\n}\n\n/* Enhanced tenet reference styling with tooltips */\na[href^=\"#source-of-truth\"],\na[href^=\"#one-model-one-file\"],\na[href^=\"#code-is-product\"],\na[href^=\"#standardize-dont-abstract\"],\na[href^=\"#do-repeat-yourself\"],\na[href^=\"#minimal-user-api\"],\na[href^=\"#backwards-compatibility\"],\na[href^=\"#consistent-public-surface\"],\na[href^=\"#modular-toolbox\"] {\n    position: relative;\n    color: #667eea;\n    font-weight: 600;\n    text-decoration: underline;\n    text-decoration-color: rgba(102, 126, 234, 0.3);\n    transition: all 0.3s ease;\n    cursor: help;\n}\n\na[href^=\"#source-of-truth\"]:hover,\na[href^=\"#one-model-one-file\"]:hover,\na[href^=\"#code-is-product\"]:hover,\na[href^=\"#standardize-dont-abstract\"]:hover,\na[href^=\"#do-repeat-yourself\"]:hover,\na[href^=\"#minimal-user-api\"]:hover,\na[href^=\"#backwards-compatibility\"]:hover,\na[href^=\"#consistent-public-surface\"]:hover,\na[href^=\"#modular-toolbox\"]:hover {\n    color: #4c51bf;\n    text-decoration-color: #4c51bf;\n    background: rgba(102, 126, 234, 0.1);\n    padding: 2px 4px;\n    border-radius: 4px;\n}\n\n/* Tooltip content for each tenet */\na[href^=\"#source-of-truth\"]:after { content: \"We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.\"; }\na[href^=\"#one-model-one-file\"]:after { content: \"All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.\"; }\na[href^=\"#code-is-product\"]:after { content: \"Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.\"; }\na[href^=\"#standardize-dont-abstract\"]:after { content: \"If it's model behavior, keep it in the file; abstractions only for generic infra.\"; }\na[href^=\"#do-repeat-yourself\"]:after { content: \"Copy when it helps users; keep successors in sync without centralizing behavior.\"; }\na[href^=\"#minimal-user-api\"]:after { content: \"Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.\"; }\na[href^=\"#backwards-compatibility\"]:after { content: \"Evolve by additive standardization, never break public APIs.\"; }\na[href^=\"#consistent-public-surface\"]:after { content: \"Same argument names, same outputs, hidden states and attentions exposed.\"; }\na[href^=\"#modular-toolbox\"]:after { content: \"Provide tools and utilities, but don't force users into a rigid framework.\"; }\n\n/* Universal tooltip styling for tenet references */\na[href^=\"#source-of-truth\"]:after,\na[href^=\"#one-model-one-file\"]:after,\na[href^=\"#code-is-product\"]:after,\na[href^=\"#standardize-dont-abstract\"]:after,\na[href^=\"#do-repeat-yourself\"]:after,\na[href^=\"#minimal-user-api\"]:after,\na[href^=\"#backwards-compatibility\"]:after,\na[href^=\"#consistent-public-surface\"]:after,\na[href^=\"#modular-toolbox\"]:after {\n    position: absolute;\n    bottom: 100%;\n    left: 50%;\n    transform: translateX(-50%);\n    background: #1a202c;\n    color: white;\n    padding: 0.75rem 1rem;\n    border-radius: 8px;\n    font-size: 0.85em;\n    font-weight: 400;\n    white-space: normal;\n    width: 320px;\n    line-height: 1.4;\n    z-index: 1001;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n    pointer-events: none;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n    margin-bottom: 8px;\n}\n\n/* Tooltip arrows */\na[href^=\"#source-of-truth\"]:before,\na[href^=\"#one-model-one-file\"]:before,\na[href^=\"#code-is-product\"]:before,\na[href^=\"#standardize-dont-abstract\"]:before,\na[href^=\"#do-repeat-yourself\"]:before,\na[href^=\"#minimal-user-api\"]:before,\na[href^=\"#backwards-compatibility\"]:before,\na[href^=\"#consistent-public-surface\"]:before,\na[href^=\"#modular-toolbox\"]:before {\n    content: '';\n    position: absolute;\n    bottom: 100%;\n    left: 50%;\n    transform: translateX(-50%);\n    border: 8px solid transparent;\n    border-top-color: #1a202c;\n    z-index: 1002;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\n/* Show tooltips on hover */\na[href^=\"#source-of-truth\"]:hover:after,\na[href^=\"#one-model-one-file\"]:hover:after,\na[href^=\"#code-is-product\"]:hover:after,\na[href^=\"#standardize-dont-abstract\"]:hover:after,\na[href^=\"#do-repeat-yourself\"]:hover:after,\na[href^=\"#minimal-user-api\"]:hover:after,\na[href^=\"#backwards-compatibility\"]:hover:after,\na[href^=\"#consistent-public-surface\"]:hover:after,\na[href^=\"#modular-toolbox\"]:hover:after,\na[href^=\"#source-of-truth\"]:hover:before,\na[href^=\"#one-model-one-file\"]:hover:before,\na[href^=\"#code-is-product\"]:hover:before,\na[href^=\"#standardize-dont-abstract\"]:hover:before,\na[href^=\"#do-repeat-yourself\"]:hover:before,\na[href^=\"#minimal-user-api\"]:hover:before,\na[href^=\"#backwards-compatibility\"]:hover:before,\na[href^=\"#consistent-public-surface\"]:hover:before,\na[href^=\"#modular-toolbox\"]:hover:before {\n    opacity: 1;\n    visibility: visible;\n}\n\n/* Improve blockquote styling */\nd-article blockquote {\n    font-size: 19px;\n    line-height: 1.8;\n    padding: 1.5rem 2rem;\n    margin: 2rem 0;\n    border-left: 4px solid #667eea;\n    background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 50%);\n    border-radius: 0 8px 8px 0;\n    font-style: italic;\n    color: #4a5568;\n}\n\n/* Full width elements */\nd-article .code-compare,\nd-article .interactive-demo,\nd-article .memory-chart-container {\n    max-width: none;\n    width: 100%;\n    margin-left: 0;\n    margin-right: 0;\n}\n\n/* Responsive design improvements */\n@media (max-width: 1200px) {\n    d-article .code-compare,\n    d-article .interactive-demo {\n        max-width: 95%;\n        margin-left: auto;\n        margin-right: auto;\n    }\n}\n\n@media (max-width: 768px) {\n    .tenet-list li.tenet {\n        padding: 1rem;\n    }\n    \n    .interactive-demo .demo-content {\n        padding: 1rem;\n    }\n}"],"sourceRoot":""}]);
 // Exports
 /* harmony default export */ const __WEBPACK_DEFAULT_EXPORT__ = (___CSS_LOADER_EXPORT___);
@@ -1985,7 +1976,7 @@ var update = injectStylesIntoStyleTag_default()(style/* default */.A, options);
 // Import any additional functionality
-console.log('Scaling Insanity loaded');
 // Add any custom JavaScript functionality here
 document.addEventListener('DOMContentLoaded', function () {

 /* Distill article improvements */
 d-article {
     max-width: none;
+    font-size: 19px;
+    line-height: 1.7 !important;
+    color: #1a1a1a;
+    padding-top: 1rem !important;
+    grid-row-gap: 0 !important;
 }
 d-article > * {
+    grid-column: middle !important;
+    max-width: none;
 }
+/* Adjust for TOC on larger screens */
+@media (min-width: 1200px) {
     d-article > * {
+        grid-column: text / page-end !important;
+        max-width: none;
     }
 }
 /* Improve paragraph readability */
 d-article p {
+    font-size: 19px;
+    line-height: 1.5;
+    margin-top: 0 !important;
+    color: #1a1a1a;
 }
 /* Improve heading sizes */
 d-article h2 {
     font-size: 2.5rem;
     line-height: 1.3;
+    margin: 1.5rem 0 0.75rem 0 !important;
+    padding-bottom: 0.5rem !important;
     color: #1a202c;
     font-weight: 650;
 }
     margin-bottom: 0.5rem;
 }
+/* Enhanced tenet reference styling with custom tooltips */
 a[href^="#source-of-truth"],
 a[href^="#one-model-one-file"],
 a[href^="#code-is-product"],
     text-decoration: underline;
     text-decoration-color: rgba(102, 126, 234, 0.3);
     transition: all 0.3s ease;
 }
 a[href^="#source-of-truth"]:hover,
     border-radius: 4px;
 }
+/* Custom tooltip using data-tooltip attribute */
+a[data-tooltip]:after {
+    content: attr(data-tooltip);
     position: absolute;
     bottom: 100%;
     left: 50%;
     margin-bottom: 8px;
 }
+a[data-tooltip]:before {
     content: '';
     position: absolute;
     bottom: 100%;
     transition: opacity 0.3s ease, visibility 0.3s ease;
 }
+a[data-tooltip]:hover:after,
+a[data-tooltip]:hover:before {
     opacity: 1;
     visibility: visible;
 }
     color: #4a5568;
 }
+/* Link capsule styling - only for external HTTP(S) links */
+d-article a[href^="http://"],
+d-article a[href^="https://"] {
+    background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);
+    color: #1565c0;
+    text-decoration: none;
+    padding: 0.15em 0.5em;
+    border-radius: 12px;
+    border: 1px solid #90caf9;
+    display: inline-block;
+    transition: all 0.3s ease;
+    font-weight: 500;
+    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);
+}
+d-article a[href^="http://"]:hover,
+d-article a[href^="https://"]:hover {
+    background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);
+    color: white;
+    border-color: #1565c0;
+    transform: translateY(-1px);
+    box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);
+}
+d-article a[href^="http://"]:active,
+d-article a[href^="https://"]:active {
+    transform: translateY(0);
+    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);
+}
 /* Full width elements */
 d-article .code-compare,
 d-article .interactive-demo,
     .tenet-list li.tenet {
         padding: 1rem;
     }
     .interactive-demo .demo-content {
         padding: 1rem;
     }
+}
+`, "",{"version":3,"sources":["webpack://./src/transformers-custom.css"],"names":[],"mappings":"AAAA,4CAA4C;;AAE5C,2BAA2B;AAC3B;IACI,aAAa;IACb,8BAA8B;IAC9B,WAAW;IACX,cAAc;IACd,kBAAkB;AACtB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,gBAAgB;IAChB,wCAAwC;AAC5C;;AAEA;IACI,mBAAmB;IACnB,qBAAqB;IACrB,gBAAgB;IAChB,cAAc;IACd,gCAAgC;IAChC,gBAAgB;AACpB;;AAEA;IACI,SAAS;IACT,aAAa;IACb,mBAAmB;IACnB,gBAAgB;IAChB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA;IACI,cAAc;AAClB;;AAEA,8CAA8C;AAC9C;IACI;QACI,0BAA0B;QAC1B,SAAS;IACb;AACJ;;AAEA,+DAA+D;AAC/D;IACI,cAAc;AAClB;;AAEA;IACI,+BAA+B,EAAE,iBAAiB;IAClD,gBAAgB;IAChB,eAAe;IACf,aAAa;IACb,0BAA0B;IAC1B,WAAW;IACX,gBAAgB;IAChB,cAAc;AAClB;;AAEA;IACI,gCAAgC;IAChC,6DAA6D;IAC7D,yBAAyB;IACzB,mBAAmB;IACnB,4BAA4B;IAC5B,SAAS;IACT,kBAAkB;IAClB,2CAA2C;IAC3C,yBAAyB;IACzB,eAAe;AACnB;;AAEA;IACI,uCAAuC;IACvC,2CAA2C;IAC3C,oCAAoC;IACpC,6DAA6D;AACjE;;AAEA,8BAA8B;AAC9B,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;;AAE1G;IACI,+BAA+B;IAC/B,kBAAkB;IAClB,UAAU;IACV,WAAW;IACX,YAAY;IACZ,WAAW;IACX,YAAY;IACZ,kBAAkB;IAClB,aAAa;IACb,mBAAmB;IACnB,uBAAuB;IACvB,gBAAgB;IAChB,iBAAiB;IACjB,0CAA0C;IAC1C,uBAAuB;AAC3B;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,iBAAiB;IACjB,kBAAkB;IAClB,cAAc;IACd,mBAAmB;IACnB,aAAa;IACb,+BAA+B;IAC/B,kBAAkB;IAClB,8BAA8B;AAClC;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,gBAAgB;AACpB;;AAEA,iDAAiD;AACjD;IACI,KAAK,0CAA0C,EAAE;IACjD,MAAM,0CAA0C,EAAE;IAClD,OAAO,0CAA0C,EAAE;AACvD;;AAEA;IACI,6CAA6C;AACjD;;AAEA,kCAAkC;AAClC;IACI,yBAAyB;IACzB,mBAAmB;IACnB,mBAAmB;IACnB,cAAc;IACd,gBAAgB;IAChB,yCAAyC;AAC7C;;AAEA,yCAAyC;AACzC;IACI,6BAA6B;IAC7B,mCAAmC;AACvC;;AAEA;IACI,6DAA6D;IAC7D,YAAY;IACZ,oBAAoB;IACpB,gBAAgB;AACpB;;AAEA;IACI,eAAe;AACnB;;AAEA;IACI,mBAAmB;IACnB,oBAAoB;IACpB,6BAA6B;IAC7B,cAAc;IACd,gBAAgB;AACpB;;AAEA,4CAA4C;AAC5C;IACI,6DAA6D;IAC7D,YAAY;IACZ,YAAY;IACZ,uBAAuB;IACvB,kBAAkB;IAClB,gBAAgB;IAChB,eAAe;IACf,2CAA2C;AAC/C;;AAEA;IACI,2BAA2B;IAC3B,+CAA+C;AACnD;;AAEA;IACI,YAAY;IACZ,mBAAmB;IACnB,eAAe;IACf,gBAAgB;AACpB;;AAEA,qBAAqB;AACrB;IACI,mBAAmB;IACnB,kBAAkB;IAClB,aAAa;IACb,cAAc;IACd,wDAAwD;IACxD,gBAAgB;AACpB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,cAAc;IACd,eAAe;IACf,kBAAkB;IAClB,WAAW;IACX,oBAAoB;AACxB;;AAEA;IACI,mBAAmB;IACnB,aAAa;IACb,kBAAkB;IAClB,qBAAqB;IACrB,qBAAqB;IACrB,iBAAiB;IACjB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA,oCAAoC;AACpC;IACI,sBAAsB;IACtB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;AAClB;;AAEA;IACI,sBAAsB;IACtB,gBAAgB;IAChB,kBAAkB;IAClB,eAAe;AACnB;;AAEA,yBAAyB;AACzB;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,aAAa;IACb,cAAc;AAClB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,YAAY;IACZ,kBAAkB;IAClB,yCAAyC;IACzC,gBAAgB;AACpB;;AAEA,kEAAkE;AAClE;IACI;QACI,4BAA4B;IAChC;;IAEA;QACI,4BAA4B;QAC5B,4BAA4B;QAC5B,+BAA+B;QAC/B,6BAA6B;QAC7B,kCAAkC;QAClC,4BAA4B;QAC5B,0BAA0B;QAC1B,6BAA6B;QAC7B,4BAA4B;QAC5B,mCAAmC,EAAE,eAAe;QACpD,2BAA2B;QAC3B,oBAAoB;QACpB,2BAA2B;QAC3B,qCAAqC;QACrC,gCAAgC;QAChC,+CAA+C;QAC/C,wBAAwB;QACxB,yBAAyB;QACzB,8BAA8B;IAClC;AACJ;;AAEA;IACI;QACI,wBAAwB;QACxB,4BAA4B;QAC5B,8BAA8B;QAC9B,4BAA4B;QAC5B,gCAAgC;QAChC,6BAA6B;QAC7B,+BAA+B;QAC/B,sDAAsD;QACtD,6BAA6B;QAC7B,qCAAqC;QACrC,gCAAgC;QAChC,wBAAwB;IAC5B;AACJ;;AAEA,0DAA0D;AAC1D;IACI,yBAAyB;IACzB,8BAA8B;IAC9B,qBAAqB;AACzB;;AAEA,2BAA2B;AAC3B;IACI,qBAAqB;IACrB,gCAAgC;IAChC,sBAAsB;AAC1B;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,WAAW;AACf;;AAEA;IACI,yBAAyB;IACzB,qBAAqB;IACrB,mBAAmB;IACnB,cAAc;IACd,iBAAiB;IACjB,gBAAgB;IAChB,gBAAgB;IAChB,2BAA2B;AAC/B;;AAEA;IACI,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,qBAAqB;AACzB;;AAEA,qBAAqB;AACrB;IACI,qBAAqB;IACrB,mDAAmD;AACvD;;AAEA;IACI,UAAU;AACd;;AAEA;IACI,uBAAuB;AAC3B;;AAEA;IACI,kCAAkC;IAClC,kBAAkB;AACtB;;AAEA;IACI,kCAAkC;AACtC;;AAEA,2CAA2C;AAC3C;IACI,kBAAkB;IAClB,YAAY;AAChB;;AAEA;IACI,cAAc;AAClB;;AAEA,8DAA8D;AAC9D;IACI,oBAAoB;IACpB,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;AAC7C;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,gCAAgC;IAChC,6BAA6B;IAC7B,2BAA2B;IAC3B,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI;QACI,UAAU;QACV,WAAW;QACX,kBAAkB;QAClB,YAAY;IAChB;;IAEA;QACI,UAAU;QACV,WAAW;QACX,+BAA+B;QAC/B,+BAA+B;QAC/B,0BAA0B;IAC9B;AACJ;;AAEA,gDAAgD;AAChD;IACI,8BAA8B;IAC9B,oCAAoC;IACpC,6BAA6B;IAC7B,0BAA0B;IAC1B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;AAC/B;;AAEA;IACI,2BAA2B;IAC3B,kFAAkF;IAClF,yBAAyB;AAC7B;;AAEA,gBAAgB;AAChB;IACI,8BAA8B;IAC9B,+BAA+B;IAC/B,6BAA6B;IAC7B,2BAA2B;IAC3B,yBAAyB;AAC7B;;AAEA,iCAAiC;AACjC;IACI,eAAe;IACf,eAAe;IACf,2BAA2B;IAC3B,cAAc;IACd,4BAA4B;IAC5B,0BAA0B;AAC9B;;AAEA;IACI,8BAA8B;IAC9B,eAAe;AACnB;;AAEA,qCAAqC;AACrC;IACI;QACI,uCAAuC;QACvC,eAAe;IACnB;AACJ;;AAEA,kCAAkC;AAClC;IACI,eAAe;IACf,gBAAgB;IAChB,wBAAwB;IACxB,cAAc;AAClB;;AAEA,0BAA0B;AAC1B;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,qCAAqC;IACrC,iCAAiC;IACjC,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,uBAAuB;IACvB,cAAc;IACd,gBAAgB;AACpB;;AAEA,6BAA6B;AAC7B;;IAEI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;AACzB;;AAEA,0DAA0D;AAC1D;;;;;;;;;IASI,kBAAkB;IAClB,cAAc;IACd,gBAAgB;IAChB,0BAA0B;IAC1B,+CAA+C;IAC/C,yBAAyB;AAC7B;;AAEA;;;;;;;;;IASI,cAAc;IACd,8BAA8B;IAC9B,oCAAoC;IACpC,gBAAgB;IAChB,kBAAkB;AACtB;;AAEA,gDAAgD;AAChD;IACI,2BAA2B;IAC3B,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,gBAAgB;IAChB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;IACzC,kBAAkB;AACtB;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,6BAA6B;IAC7B,yBAAyB;IACzB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,gBAAgB;IAChB,oBAAoB;IACpB,cAAc;IACd,8BAA8B;IAC9B,4DAA4D;IAC5D,0BAA0B;IAC1B,kBAAkB;IAClB,cAAc;AAClB;;AAEA,2DAA2D;AAC3D;;IAEI,6DAA6D;IAC7D,cAAc;IACd,qBAAqB;IACrB,qBAAqB;IACrB,mBAAmB;IACnB,yBAAyB;IACzB,qBAAqB;IACrB,yBAAyB;IACzB,gBAAgB;IAChB,8CAA8C;AAClD;;AAEA;;IAEI,6DAA6D;IAC7D,YAAY;IACZ,qBAAqB;IACrB,2BAA2B;IAC3B,8CAA8C;AAClD;;AAEA;;IAEI,wBAAwB;IACxB,6CAA6C;AACjD;;AAEA,wBAAwB;AACxB;;;IAGI,eAAe;IACf,WAAW;IACX,cAAc;IACd,eAAe;AACnB;;AAEA,mCAAmC;AACnC;IACI;;QAEI,cAAc;QACd,iBAAiB;QACjB,kBAAkB;IACtB;AACJ;;AAEA;IACI;QACI,aAAa;IACjB;;IAEA;QACI,aAAa;IACjB;AACJ","sourcesContent":["/* Transformers-specific styling additions */\n\n/* Code comparison layout */\n.code-compare {\n    display: grid;\n    grid-template-columns: 1fr 1fr;\n    gap: 1.5rem;\n    margin: 2rem 0;\n    align-items: start;\n}\n\n.code-compare .code-column {\n    background: #ffffff;\n    border: 1px solid #e2e8f0;\n    border-radius: 8px;\n    overflow: hidden;\n    box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);\n}\n\n.code-compare .code-header {\n    background: #f8f9fa;\n    padding: 0.75rem 1rem;\n    font-weight: 600;\n    color: #495057;\n    border-bottom: 1px solid #e2e8f0;\n    font-size: 0.9em;\n}\n\n.code-compare pre {\n    margin: 0;\n    padding: 1rem;\n    background: #ffffff;\n    overflow-x: auto;\n    font-size: 0.85em;\n    line-height: 1.4;\n}\n\n.code-compare pre code {\n    color: #374151;\n}\n\n/* Mobile responsiveness for code comparison */\n@media (max-width: 768px) {\n    .code-compare {\n        grid-template-columns: 1fr;\n        gap: 1rem;\n    }\n}\n\n/* Tenet styling - special highlighting for design principles */\n.tenet-list {\n    margin: 3rem 0;\n}\n\n.tenet-list ol {\n    counter-reset: tenet-counter -1; /* Start from 0 */\n    list-style: none;\n    padding-left: 0;\n    display: grid;\n    grid-template-columns: 1fr;\n    gap: 2.5rem;\n    max-width: 900px;\n    margin: 0 auto;\n}\n\n.tenet-list li.tenet {\n    counter-increment: tenet-counter;\n    background: linear-gradient(135deg, #ffffff 0%, #f8f9fa 100%);\n    border: 2px solid #e2e8f0;\n    border-radius: 16px;\n    padding: 2rem 2rem 2rem 4rem;\n    margin: 0;\n    position: relative;\n    box-shadow: 0 12px 35px rgba(0, 0, 0, 0.12);\n    transition: all 0.3s ease;\n    cursor: pointer;\n}\n\n.tenet-list li.tenet:hover {\n    transform: translateY(-8px) scale(1.02);\n    box-shadow: 0 20px 50px rgba(0, 0, 0, 0.25);\n    border-color: rgba(0, 123, 255, 0.5);\n    background: linear-gradient(135deg, #ffffff 0%, #f0f8ff 100%);\n}\n\n/* Colorful numbering system */\n.tenet-list li.tenet:nth-child(1):before { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); }\n.tenet-list li.tenet:nth-child(2):before { background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); }\n.tenet-list li.tenet:nth-child(3):before { background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); }\n.tenet-list li.tenet:nth-child(4):before { background: linear-gradient(135deg, #43e97b 0%, #38f9d7 100%); }\n.tenet-list li.tenet:nth-child(5):before { background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); }\n.tenet-list li.tenet:nth-child(6):before { background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); }\n.tenet-list li.tenet:nth-child(7):before { background: linear-gradient(135deg, #ff9a9e 0%, #fecfef 100%); }\n.tenet-list li.tenet:nth-child(8):before { background: linear-gradient(135deg, #a18cd1 0%, #fbc2eb 100%); }\n.tenet-list li.tenet:nth-child(9):before { background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); }\n\n.tenet-list li.tenet:before {\n    content: counter(tenet-counter);\n    position: absolute;\n    top: -12px;\n    left: -12px;\n    color: white;\n    width: 48px;\n    height: 48px;\n    border-radius: 50%;\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    font-size: 1.2em;\n    font-weight: bold;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);\n    border: 3px solid white;\n}\n\n.tenet-list li.tenet strong {\n    color: #1a202c;\n    font-size: 1.1em;\n    display: block;\n    margin-bottom: 0.5rem;\n}\n\n.tenet-list li.tenet em {\n    color: #4a5568;\n    font-size: 0.95em;\n    font-style: italic;\n    display: block;\n    margin-top: 0.75rem;\n    padding: 1rem;\n    background: rgba(0, 0, 0, 0.03);\n    border-radius: 8px;\n    border-left: 3px solid #e2e8f0;\n}\n\n.tenet-list li.tenet p {\n    color: #2d3748;\n    line-height: 1.6;\n    margin: 0.5rem 0;\n}\n\n/* Add a subtle pulse animation for the numbers */\n@keyframes pulse-glow {\n    0% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n    50% { box-shadow: 0 4px 20px rgba(0, 0, 0, 0.25); }\n    100% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n}\n\n.tenet-list li.tenet:hover:before {\n    animation: pulse-glow 2s ease-in-out infinite;\n}\n\n/* Interactive component styling */\n.interactive-demo {\n    border: 1px solid #e2e8f0;\n    border-radius: 12px;\n    background: #ffffff;\n    margin: 2rem 0;\n    overflow: hidden;\n    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.07);\n}\n\n/* Model visualization fragment styling */\n[id*=\"plot-model-visualisation\"] {\n    margin: 1rem -2rem !important;\n    width: calc(100% + 4rem) !important;\n}\n\n.interactive-demo .demo-header {\n    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n    color: white;\n    padding: 1rem 1.5rem;\n    font-weight: 600;\n}\n\n.interactive-demo .demo-content {\n    padding: 1.5rem;\n}\n\n.interactive-demo .demo-footer {\n    background: #f8f9fa;\n    padding: 1rem 1.5rem;\n    border-top: 1px solid #e2e8f0;\n    color: #6c757d;\n    font-size: 0.9em;\n}\n\n/* Button styling for interactive elements */\n.btn-primary {\n    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n    border: none;\n    color: white;\n    padding: 0.75rem 1.5rem;\n    border-radius: 6px;\n    font-weight: 500;\n    cursor: pointer;\n    transition: transform 0.2s, box-shadow 0.2s;\n}\n\n.btn-primary:hover {\n    transform: translateY(-1px);\n    box-shadow: 0 4px 12px rgba(102, 126, 234, 0.3);\n}\n\n.btn-primary:disabled {\n    opacity: 0.6;\n    cursor: not-allowed;\n    transform: none;\n    box-shadow: none;\n}\n\n/* Terminal styling */\n.terminal-container {\n    background: #1a202c;\n    border-radius: 8px;\n    padding: 1rem;\n    color: #e2e8f0;\n    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;\n    font-size: 0.9em;\n}\n\n.terminal-input {\n    background: #2d3748;\n    border: 1px solid #4a5568;\n    color: #e2e8f0;\n    padding: 0.5rem;\n    border-radius: 4px;\n    width: 100%;\n    font-family: inherit;\n}\n\n.terminal-output {\n    background: #0a0e1a;\n    padding: 1rem;\n    border-radius: 4px;\n    white-space: pre-wrap;\n    word-break: break-all;\n    min-height: 100px;\n    max-height: 300px;\n    overflow-y: auto;\n}\n\n/* Attention visualization styling */\n.attention-matrix {\n    font-family: monospace;\n    font-size: 0.8em;\n    border-collapse: collapse;\n    margin: 1rem 0;\n}\n\n.attention-matrix td {\n    border: 1px solid #ddd;\n    padding: 4px 8px;\n    text-align: center;\n    min-width: 50px;\n}\n\n/* Memory chart styling */\n.memory-chart-container {\n    background: #f8f9fa;\n    border: 2px solid #e9ecef;\n    border-radius: 8px;\n    padding: 1rem;\n    margin: 1rem 0;\n}\n\n/* Image styling improvements */\nimg {\n    max-width: 100%;\n    height: auto;\n    border-radius: 8px;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);\n    margin: 1.5rem 0;\n}\n\n/* Table of contents styling - Fixed positioning like ultrascale */\n@media (min-width: 1200px) {\n    d-article {\n        overflow: visible !important;\n    }\n    \n    d-contents {\n        align-self: start !important;\n        background: white !important;\n        grid-column-start: 1 !important;\n        grid-column-end: 4 !important;\n        grid-row: auto / span 6 !important;\n        justify-self: end !important;\n        margin-top: 0em !important;\n        padding-right: 3em !important;\n        padding-left: 2em !important;\n        position: -webkit-sticky !important; /* For Safari */\n        position: sticky !important;\n        top: 10px !important;\n        overflow-y: auto !important;\n        height: calc(100vh - 40px) !important;\n        scrollbar-width: none !important;\n        transition: max-height 0.3s ease-out !important;\n        z-index: -100 !important;\n        display: block !important;\n        visibility: visible !important;\n    }\n}\n\n@media (max-width: 1199px) {\n    d-contents {\n        display: none !important;\n        background: white !important;\n        justify-self: start !important;\n        align-self: start !important;\n        padding-bottom: 0.5em !important;\n        margin-bottom: 1em !important;\n        padding-left: 0.25em !important;\n        border-bottom: 1px solid rgba(0, 0, 0, 0.1) !important;\n        overflow-y: scroll !important;\n        height: calc(100vh - 40px) !important;\n        scrollbar-width: none !important;\n        z-index: -100 !important;\n    }\n}\n\n/* Force TOC to be visible and override distill defaults */\nd-contents {\n    display: block !important;\n    visibility: visible !important;\n    opacity: 1 !important;\n}\n\n/* TOC Navigation styling */\nd-contents .toc-header {\n    margin-bottom: 1.5rem;\n    border-bottom: 2px solid #007bff;\n    padding-bottom: 0.5rem;\n}\n\nd-contents .toc-title {\n    font-weight: bold;\n    font-size: 1.2em;\n    color: #333;\n}\n\nd-contents nav a {\n    color: rgba(0, 0, 0, 0.7);\n    text-decoration: none;\n    border-bottom: none;\n    display: block;\n    padding: 0.3rem 0;\n    font-size: 0.9em;\n    line-height: 1.4;\n    transition: color 0.2s ease;\n}\n\nd-contents nav a:hover {\n    color: #007bff;\n    text-decoration: none;\n}\n\nd-contents nav a.active {\n    color: #007bff;\n    font-weight: 600;\n}\n\nd-contents nav div {\n    margin-bottom: 0.2rem;\n}\n\n/* Smooth scrollbar */\nd-contents {\n    scrollbar-width: thin;\n    scrollbar-color: rgba(0, 123, 255, 0.3) transparent;\n}\n\nd-contents::-webkit-scrollbar {\n    width: 6px;\n}\n\nd-contents::-webkit-scrollbar-track {\n    background: transparent;\n}\n\nd-contents::-webkit-scrollbar-thumb {\n    background: rgba(0, 123, 255, 0.3);\n    border-radius: 3px;\n}\n\nd-contents::-webkit-scrollbar-thumb:hover {\n    background: rgba(0, 123, 255, 0.5);\n}\n\n/* Custom tooltip styling for tenet links */\nd-contents nav a[title] {\n    position: relative;\n    cursor: help;\n}\n\nd-contents nav a[title]:hover {\n    color: #667eea;\n}\n\n/* Enhanced tooltip using CSS (fallback for title attribute) */\nd-contents nav a[title]:after {\n    content: attr(title);\n    position: absolute;\n    left: 100%;\n    top: 50%;\n    transform: translateY(-50%);\n    background: #1a202c;\n    color: white;\n    padding: 0.75rem 1rem;\n    border-radius: 8px;\n    font-size: 0.85em;\n    white-space: normal;\n    width: 300px;\n    line-height: 1.4;\n    z-index: 1001;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n    pointer-events: none;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n}\n\nd-contents nav a[title]:before {\n    content: '';\n    position: absolute;\n    left: 100%;\n    top: 50%;\n    transform: translate(-8px, -50%);\n    border: 8px solid transparent;\n    border-right-color: #1a202c;\n    z-index: 1002;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\nd-contents nav a[title]:hover:after,\nd-contents nav a[title]:hover:before {\n    opacity: 1;\n    visibility: visible;\n}\n\n/* Adjust for smaller screens */\n@media (max-width: 1400px) {\n    d-contents nav a[title]:after {\n        left: auto;\n        right: 100%;\n        margin-right: 1rem;\n        width: 250px;\n    }\n    \n    d-contents nav a[title]:before {\n        left: auto;\n        right: 100%;\n        transform: translate(8px, -50%);\n        border-right-color: transparent;\n        border-left-color: #1a202c;\n    }\n}\n\n/* Improve code syntax highlighting with Prism */\npre[class*=\"language-\"] {\n    background: #f8f9fa !important;\n    border: 1px solid #e9ecef !important;\n    border-radius: 8px !important;\n    padding: 1.5rem !important;\n    margin: 1.5rem 0 !important;\n    overflow-x: auto !important;\n    font-size: 0.9em !important;\n    line-height: 1.5 !important;\n}\n\ncode[class*=\"language-\"] {\n    background: none !important;\n    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Courier New', monospace !important;\n    color: #383a42 !important;\n}\n\n/* Inline code */\np code, li code {\n    background: #f1f3f4 !important;\n    padding: 0.2em 0.4em !important;\n    border-radius: 3px !important;\n    font-size: 0.9em !important;\n    color: #d73a49 !important;\n}\n\n/* Distill article improvements */\nd-article {\n    max-width: none;\n    font-size: 19px;\n    line-height: 1.7 !important;\n    color: #1a1a1a;\n    padding-top: 1rem !important;\n    grid-row-gap: 0 !important;\n}\n\nd-article > * {\n    grid-column: middle !important;\n    max-width: none;\n}\n\n/* Adjust for TOC on larger screens */\n@media (min-width: 1200px) {\n    d-article > * {\n        grid-column: text / page-end !important;\n        max-width: none;\n    }\n}\n\n/* Improve paragraph readability */\nd-article p {\n    font-size: 19px;\n    line-height: 1.5;\n    margin-top: 0 !important;\n    color: #1a1a1a;\n}\n\n/* Improve heading sizes */\nd-article h1 {\n    font-size: 3rem;\n    line-height: 1.2;\n    margin: 3rem 0 2rem 0;\n    color: #1a202c;\n    font-weight: 700;\n}\n\nd-article h2 {\n    font-size: 2.5rem;\n    line-height: 1.3;\n    margin: 1.5rem 0 0.75rem 0 !important;\n    padding-bottom: 0.5rem !important;\n    color: #1a202c;\n    font-weight: 650;\n}\n\nd-article h3 {\n    font-size: 2rem;\n    line-height: 1.4;\n    margin: 2rem 0 1rem 0;\n    color: #1a202c;\n    font-weight: 600;\n}\n\nd-article h4 {\n    font-size: 1.5rem;\n    line-height: 1.4;\n    margin: 1.5rem 0 1rem 0;\n    color: #2d3748;\n    font-weight: 600;\n}\n\n/* Improve list readability */\nd-article ul li,\nd-article ol li {\n    font-size: 18px;\n    line-height: 1.7;\n    margin-bottom: 0.5rem;\n}\n\n/* Enhanced tenet reference styling with custom tooltips */\na[href^=\"#source-of-truth\"],\na[href^=\"#one-model-one-file\"],\na[href^=\"#code-is-product\"],\na[href^=\"#standardize-dont-abstract\"],\na[href^=\"#do-repeat-yourself\"],\na[href^=\"#minimal-user-api\"],\na[href^=\"#backwards-compatibility\"],\na[href^=\"#consistent-public-surface\"],\na[href^=\"#modular-toolbox\"] {\n    position: relative;\n    color: #667eea;\n    font-weight: 600;\n    text-decoration: underline;\n    text-decoration-color: rgba(102, 126, 234, 0.3);\n    transition: all 0.3s ease;\n}\n\na[href^=\"#source-of-truth\"]:hover,\na[href^=\"#one-model-one-file\"]:hover,\na[href^=\"#code-is-product\"]:hover,\na[href^=\"#standardize-dont-abstract\"]:hover,\na[href^=\"#do-repeat-yourself\"]:hover,\na[href^=\"#minimal-user-api\"]:hover,\na[href^=\"#backwards-compatibility\"]:hover,\na[href^=\"#consistent-public-surface\"]:hover,\na[href^=\"#modular-toolbox\"]:hover {\n    color: #4c51bf;\n    text-decoration-color: #4c51bf;\n    background: rgba(102, 126, 234, 0.1);\n    padding: 2px 4px;\n    border-radius: 4px;\n}\n\n/* Custom tooltip using data-tooltip attribute */\na[data-tooltip]:after {\n    content: attr(data-tooltip);\n    position: absolute;\n    bottom: 100%;\n    left: 50%;\n    transform: translateX(-50%);\n    background: #1a202c;\n    color: white;\n    padding: 0.75rem 1rem;\n    border-radius: 8px;\n    font-size: 0.85em;\n    font-weight: 400;\n    white-space: normal;\n    width: 320px;\n    line-height: 1.4;\n    z-index: 1001;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n    pointer-events: none;\n    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n    margin-bottom: 8px;\n}\n\na[data-tooltip]:before {\n    content: '';\n    position: absolute;\n    bottom: 100%;\n    left: 50%;\n    transform: translateX(-50%);\n    border: 8px solid transparent;\n    border-top-color: #1a202c;\n    z-index: 1002;\n    opacity: 0;\n    visibility: hidden;\n    transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\na[data-tooltip]:hover:after,\na[data-tooltip]:hover:before {\n    opacity: 1;\n    visibility: visible;\n}\n\n/* Improve blockquote styling */\nd-article blockquote {\n    font-size: 19px;\n    line-height: 1.8;\n    padding: 1.5rem 2rem;\n    margin: 2rem 0;\n    border-left: 4px solid #667eea;\n    background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 50%);\n    border-radius: 0 8px 8px 0;\n    font-style: italic;\n    color: #4a5568;\n}\n\n/* Link capsule styling - only for external HTTP(S) links */\nd-article a[href^=\"http://\"],\nd-article a[href^=\"https://\"] {\n    background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);\n    color: #1565c0;\n    text-decoration: none;\n    padding: 0.15em 0.5em;\n    border-radius: 12px;\n    border: 1px solid #90caf9;\n    display: inline-block;\n    transition: all 0.3s ease;\n    font-weight: 500;\n    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);\n}\n\nd-article a[href^=\"http://\"]:hover,\nd-article a[href^=\"https://\"]:hover {\n    background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);\n    color: white;\n    border-color: #1565c0;\n    transform: translateY(-1px);\n    box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);\n}\n\nd-article a[href^=\"http://\"]:active,\nd-article a[href^=\"https://\"]:active {\n    transform: translateY(0);\n    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);\n}\n\n/* Full width elements */\nd-article .code-compare,\nd-article .interactive-demo,\nd-article .memory-chart-container {\n    max-width: none;\n    width: 100%;\n    margin-left: 0;\n    margin-right: 0;\n}\n\n/* Responsive design improvements */\n@media (max-width: 1200px) {\n    d-article .code-compare,\n    d-article .interactive-demo {\n        max-width: 95%;\n        margin-left: auto;\n        margin-right: auto;\n    }\n}\n\n@media (max-width: 768px) {\n    .tenet-list li.tenet {\n        padding: 1rem;\n    }\n\n    .interactive-demo .demo-content {\n        padding: 1rem;\n    }\n}\n\n"],"sourceRoot":""}]);
 // Exports
 /* harmony default export */ const __WEBPACK_DEFAULT_EXPORT__ = (___CSS_LOADER_EXPORT___);
 // Import any additional functionality
+console.log('blog loaded');
 // Add any custom JavaScript functionality here
 document.addEventListener('DOMContentLoaded', function () {

dist/main.bundle.js.map CHANGED Viewed

The diff for this file is too large to render. See raw diff

src/distill.js CHANGED Viewed

@@ -2102,7 +2102,7 @@ d-appendix > distill-appendix {
       </div>
       <div >
           <h3>Published</h3>
-          <div>August, 2025</div>
       </div>
     </div>

       </div>
       <div >
           <h3>Published</h3>
+          <div>October, 2025</div>
       </div>
     </div>

src/index.js CHANGED Viewed

@@ -2,7 +2,7 @@
 import './style.css';
 // Import any additional functionality
-console.log('Scaling Insanity loaded');
 // Add any custom JavaScript functionality here
 document.addEventListener('DOMContentLoaded', function() {

 import './style.css';
 // Import any additional functionality
+console.log('blog loaded');
 // Add any custom JavaScript functionality here
 document.addEventListener('DOMContentLoaded', function() {

src/transformers-custom.css CHANGED Viewed

@@ -486,29 +486,32 @@ p code, li code {
 /* Distill article improvements */
 d-article {
     max-width: none;
-    font-size: 18px; /* Increased from default ~16px */
-    line-height: 1.7;
 }
 d-article > * {
-    max-width: 1100px; /* Increased from 900px for more space */
-    margin-left: auto;
-    margin-right: auto;
 }
-/* Make content even wider on large screens when TOC is present */
-@media (min-width: 1400px) {
     d-article > * {
-        max-width: 1300px;
     }
 }
 /* Improve paragraph readability */
 d-article p {
-    font-size: 18px;
-    line-height: 1.8;
-    margin-bottom: 1.5rem;
-    color: #2d3748;
 }
 /* Improve heading sizes */
@@ -523,7 +526,8 @@ d-article h1 {
 d-article h2 {
     font-size: 2.5rem;
     line-height: 1.3;
-    margin: 2.5rem 0 1.5rem 0;
     color: #1a202c;
     font-weight: 650;
 }
@@ -552,7 +556,7 @@ d-article ol li {
     margin-bottom: 0.5rem;
 }
-/* Enhanced tenet reference styling with tooltips */
 a[href^="#source-of-truth"],
 a[href^="#one-model-one-file"],
 a[href^="#code-is-product"],
@@ -568,7 +572,6 @@ a[href^="#modular-toolbox"] {
     text-decoration: underline;
     text-decoration-color: rgba(102, 126, 234, 0.3);
     transition: all 0.3s ease;
-    cursor: help;
 }
 a[href^="#source-of-truth"]:hover,
@@ -587,27 +590,9 @@ a[href^="#modular-toolbox"]:hover {
     border-radius: 4px;
 }
-/* Tooltip content for each tenet */
-a[href^="#source-of-truth"]:after { content: "We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances."; }
-a[href^="#one-model-one-file"]:after { content: "All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom."; }
-a[href^="#code-is-product"]:after { content: "Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial."; }
-a[href^="#standardize-dont-abstract"]:after { content: "If it's model behavior, keep it in the file; abstractions only for generic infra."; }
-a[href^="#do-repeat-yourself"]:after { content: "Copy when it helps users; keep successors in sync without centralizing behavior."; }
-a[href^="#minimal-user-api"]:after { content: "Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths."; }
-a[href^="#backwards-compatibility"]:after { content: "Evolve by additive standardization, never break public APIs."; }
-a[href^="#consistent-public-surface"]:after { content: "Same argument names, same outputs, hidden states and attentions exposed."; }
-a[href^="#modular-toolbox"]:after { content: "Provide tools and utilities, but don't force users into a rigid framework."; }
-/* Universal tooltip styling for tenet references */
-a[href^="#source-of-truth"]:after,
-a[href^="#one-model-one-file"]:after,
-a[href^="#code-is-product"]:after,
-a[href^="#standardize-dont-abstract"]:after,
-a[href^="#do-repeat-yourself"]:after,
-a[href^="#minimal-user-api"]:after,
-a[href^="#backwards-compatibility"]:after,
-a[href^="#consistent-public-surface"]:after,
-a[href^="#modular-toolbox"]:after {
     position: absolute;
     bottom: 100%;
     left: 50%;
@@ -630,16 +615,7 @@ a[href^="#modular-toolbox"]:after {
     margin-bottom: 8px;
 }
-/* Tooltip arrows */
-a[href^="#source-of-truth"]:before,
-a[href^="#one-model-one-file"]:before,
-a[href^="#code-is-product"]:before,
-a[href^="#standardize-dont-abstract"]:before,
-a[href^="#do-repeat-yourself"]:before,
-a[href^="#minimal-user-api"]:before,
-a[href^="#backwards-compatibility"]:before,
-a[href^="#consistent-public-surface"]:before,
-a[href^="#modular-toolbox"]:before {
     content: '';
     position: absolute;
     bottom: 100%;
@@ -653,25 +629,8 @@ a[href^="#modular-toolbox"]:before {
     transition: opacity 0.3s ease, visibility 0.3s ease;
 }
-/* Show tooltips on hover */
-a[href^="#source-of-truth"]:hover:after,
-a[href^="#one-model-one-file"]:hover:after,
-a[href^="#code-is-product"]:hover:after,
-a[href^="#standardize-dont-abstract"]:hover:after,
-a[href^="#do-repeat-yourself"]:hover:after,
-a[href^="#minimal-user-api"]:hover:after,
-a[href^="#backwards-compatibility"]:hover:after,
-a[href^="#consistent-public-surface"]:hover:after,
-a[href^="#modular-toolbox"]:hover:after,
-a[href^="#source-of-truth"]:hover:before,
-a[href^="#one-model-one-file"]:hover:before,
-a[href^="#code-is-product"]:hover:before,
-a[href^="#standardize-dont-abstract"]:hover:before,
-a[href^="#do-repeat-yourself"]:hover:before,
-a[href^="#minimal-user-api"]:hover:before,
-a[href^="#backwards-compatibility"]:hover:before,
-a[href^="#consistent-public-surface"]:hover:before,
-a[href^="#modular-toolbox"]:hover:before {
     opacity: 1;
     visibility: visible;
 }
@@ -689,6 +648,36 @@ d-article blockquote {
     color: #4a5568;
 }
 /* Full width elements */
 d-article .code-compare,
 d-article .interactive-demo,
@@ -713,8 +702,9 @@ d-article .memory-chart-container {
     .tenet-list li.tenet {
         padding: 1rem;
     }
     .interactive-demo .demo-content {
         padding: 1rem;
     }
-}

 /* Distill article improvements */
 d-article {
     max-width: none;
+    font-size: 19px;
+    line-height: 1.7 !important;
+    color: #1a1a1a;
+    padding-top: 1rem !important;
+    grid-row-gap: 0 !important;
 }
 d-article > * {
+    grid-column: middle !important;
+    max-width: none;
 }
+/* Adjust for TOC on larger screens */
+@media (min-width: 1200px) {
     d-article > * {
+        grid-column: text / page-end !important;
+        max-width: none;
     }
 }
 /* Improve paragraph readability */
 d-article p {
+    font-size: 19px;
+    line-height: 1.5;
+    margin-top: 0 !important;
+    color: #1a1a1a;
 }
 /* Improve heading sizes */
 d-article h2 {
     font-size: 2.5rem;
     line-height: 1.3;
+    margin: 1.5rem 0 0.75rem 0 !important;
+    padding-bottom: 0.5rem !important;
     color: #1a202c;
     font-weight: 650;
 }
     margin-bottom: 0.5rem;
 }
+/* Enhanced tenet reference styling with custom tooltips */
 a[href^="#source-of-truth"],
 a[href^="#one-model-one-file"],
 a[href^="#code-is-product"],
     text-decoration: underline;
     text-decoration-color: rgba(102, 126, 234, 0.3);
     transition: all 0.3s ease;
 }
 a[href^="#source-of-truth"]:hover,
     border-radius: 4px;
 }
+/* Custom tooltip using data-tooltip attribute */
+a[data-tooltip]:after {
+    content: attr(data-tooltip);
     position: absolute;
     bottom: 100%;
     left: 50%;
     margin-bottom: 8px;
 }
+a[data-tooltip]:before {
     content: '';
     position: absolute;
     bottom: 100%;
     transition: opacity 0.3s ease, visibility 0.3s ease;
 }
+a[data-tooltip]:hover:after,
+a[data-tooltip]:hover:before {
     opacity: 1;
     visibility: visible;
 }
     color: #4a5568;
 }
+/* Link capsule styling - only for external HTTP(S) links */
+d-article a[href^="http://"],
+d-article a[href^="https://"] {
+    background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);
+    color: #1565c0;
+    text-decoration: none;
+    padding: 0.15em 0.5em;
+    border-radius: 12px;
+    border: 1px solid #90caf9;
+    display: inline-block;
+    transition: all 0.3s ease;
+    font-weight: 500;
+    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);
+}
+d-article a[href^="http://"]:hover,
+d-article a[href^="https://"]:hover {
+    background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);
+    color: white;
+    border-color: #1565c0;
+    transform: translateY(-1px);
+    box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);
+}
+d-article a[href^="http://"]:active,
+d-article a[href^="https://"]:active {
+    transform: translateY(0);
+    box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);
+}
 /* Full width elements */
 d-article .code-compare,
 d-article .interactive-demo,
     .tenet-list li.tenet {
         padding: 1rem;
     }
     .interactive-demo .demo-content {
         padding: 1rem;
     }
+}

webpack.config.js CHANGED Viewed

@@ -26,23 +26,24 @@ const loadFragmentsMap = (() => {
                     if (fs.statSync(filePath).isDirectory()) {
                         await walkDir(filePath, relativePath);
                     } else {
-                        // Remove the .html extension before creating the dotted path
                         const nameWithoutExt = relativePath.replace(/\.html$/, '');
                         const dottedPath = 'fragment-' + nameWithoutExt.replace(/\\/g, '-').replace(/\//g, '-').replace(/\./g, '-');
                         const content = fs.readFileSync(filePath, "utf8");
-                        // Minify the HTML content using swcMinifyFragment
                         let minifiedContent;
-                        try {
-                            const minifiedRes = await HtmlMinimizerPlugin.swcMinifyFragment({"tmp.html": content})
-                            if (minifiedRes.errors) {
-                                console.warn("HTML minification warnings:", minifiedRes.errors);
-                                minifiedContent = content; // Use original content if errors
-                            } else {
-                                minifiedContent = minifiedRes.code;
                             }
-                        } catch (error) {
-                            console.warn(`Failed to minify fragment ${filePath}, using original content:`, error.message);
-                            minifiedContent = content; // Fallback to original content
                         }
                         cachedFragments[dottedPath] = minifiedContent;
                     }
@@ -94,8 +95,7 @@ module.exports = {
                         presets: ["@babel/preset-env"],
                     },
                 },
-            },
-            {}
         ],
     },
     plugins: [
@@ -104,6 +104,7 @@ module.exports = {
             patterns: [
                 { from: "src/fragments/*", to: "fragments/[name].html" },
                 { from: "src/style.css", to: "style.css" },
                 { from: "content/*.png", to: "static/[name][ext]" },
                 { from: "content/*.svg", to: "static/[name][ext]" },
                 { from: "content/*.html", to: "static/[name][ext]" },
@@ -150,28 +151,27 @@ module.exports = {
                                 // Extract tenet text for tooltips
                                 const tenetTooltips = {
-                                    'source-of-truth': 'We aim to be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
-                                    'one-model-one-file': 'All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.',
                                     'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
                                     'standardize-dont-abstract': 'If it\\'s model behavior, keep it in the file; abstractions only for generic infra.',
                                     'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
                                     'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
                                     'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
-                                    'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed.',
                                 };
-                                // Add smooth scrolling and active state
                                 const tocLinks = document.querySelectorAll('d-contents a');
                                 tocLinks.forEach(link => {
                                     const href = link.getAttribute('href');
                                     const anchor = href ? href.substring(1) : '';
-                                    // Add tooltip if this is a tenet link
                                     if (tenetTooltips[anchor]) {
-                                        link.setAttribute('title', tenetTooltips[anchor]);
                                         link.style.position = 'relative';
                                     }
                                     link.addEventListener('click', function(e) {
                                         e.preventDefault();
                                         const target = document.querySelector(this.getAttribute('href'));
@@ -180,6 +180,16 @@ module.exports = {
                                         }
                                     });
                                 });
                                 // Update active state on scroll
                                 window.addEventListener('scroll', function() {
@@ -224,7 +234,7 @@ module.exports = {
                             initializeSyntaxHighlighting();
                         }, 1000);
                         </script>`;
                         // Create full HTML document with distill template
                         const template = `<!DOCTYPE html>
 <html>
@@ -238,6 +248,7 @@ module.exports = {
     <meta charset="utf8">
     <title>${appConfig.fullTitle}</title>
     <link rel="stylesheet" href="style.css">
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
 </head>
 <body>

                     if (fs.statSync(filePath).isDirectory()) {
                         await walkDir(filePath, relativePath);
                     } else {
                         const nameWithoutExt = relativePath.replace(/\.html$/, '');
                         const dottedPath = 'fragment-' + nameWithoutExt.replace(/\\/g, '-').replace(/\//g, '-').replace(/\./g, '-');
                         const content = fs.readFileSync(filePath, "utf8");
                         let minifiedContent;
+                        if (content.trim().startsWith('<!DOCTYPE') || content.trim().startsWith('<html')) {
+                            minifiedContent = content;
+                        } else {
+                            try {
+                                const minifiedRes = await HtmlMinimizerPlugin.swcMinifyFragment({"tmp.html": content})
+                                if (minifiedRes.errors) {
+                                    minifiedContent = content;
+                                } else {
+                                    minifiedContent = minifiedRes.code;
+                                }
+                            } catch (error) {
+                                minifiedContent = content;
                             }
                         }
                         cachedFragments[dottedPath] = minifiedContent;
                     }
                         presets: ["@babel/preset-env"],
                     },
                 },
+            }
         ],
     },
     plugins: [
             patterns: [
                 { from: "src/fragments/*", to: "fragments/[name].html" },
                 { from: "src/style.css", to: "style.css" },
+                { from: "src/transformers-custom.css", to: "transformers-custom.css" },
                 { from: "content/*.png", to: "static/[name][ext]" },
                 { from: "content/*.svg", to: "static/[name][ext]" },
                 { from: "content/*.html", to: "static/[name][ext]" },
                                 // Extract tenet text for tooltips
                                 const tenetTooltips = {
+                                    'source-of-truth': 'We aim be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
+                                    'one-model-one-file': 'All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model\\'s hackability.',
                                     'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
                                     'standardize-dont-abstract': 'If it\\'s model behavior, keep it in the file; abstractions only for generic infra.',
                                     'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
                                     'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
                                     'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
+                                    'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.',
                                 };
+                                // Add smooth scrolling and custom tooltips to all tenet links (TOC and article)
                                 const tocLinks = document.querySelectorAll('d-contents a');
                                 tocLinks.forEach(link => {
                                     const href = link.getAttribute('href');
                                     const anchor = href ? href.substring(1) : '';
                                     if (tenetTooltips[anchor]) {
+                                        link.setAttribute('data-tooltip', tenetTooltips[anchor]);
                                         link.style.position = 'relative';
                                     }
                                     link.addEventListener('click', function(e) {
                                         e.preventDefault();
                                         const target = document.querySelector(this.getAttribute('href'));
                                         }
                                     });
                                 });
+                                // Add custom tooltips to tenet links in article content
+                                const articleLinks = document.querySelectorAll('d-article a[href^="#"]');
+                                articleLinks.forEach(link => {
+                                    const href = link.getAttribute('href');
+                                    const anchor = href ? href.substring(1) : '';
+                                    if (tenetTooltips[anchor]) {
+                                        link.setAttribute('data-tooltip', tenetTooltips[anchor]);
+                                    }
+                                });
                                 // Update active state on scroll
                                 window.addEventListener('scroll', function() {
                             initializeSyntaxHighlighting();
                         }, 1000);
                         </script>`;
                         // Create full HTML document with distill template
                         const template = `<!DOCTYPE html>
 <html>
     <meta charset="utf8">
     <title>${appConfig.fullTitle}</title>
     <link rel="stylesheet" href="style.css">
+    <link rel="stylesheet" href="transformers-custom.css">
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
 </head>
 <body>