Molbap HF Staff commited on
Commit
3a3c4d7
·
1 Parent(s): dfda82f
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Scaling insanity
3
  emoji: 📚
4
  colorFrom: pink
5
  colorTo: indigo
 
1
  ---
2
+ title: Maintain the unmaintainable
3
  emoji: 📚
4
  colorFrom: pink
5
  colorTo: indigo
config/app.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
- "title": "Scaling Insanity",
3
- "subtitle": "maintaining hundreds of model definitions",
4
  "description": "A peek into software engineering for the transformers library",
5
- "fullTitle": "Scaling Insanity: maintaining hundreds of model definitions"
6
  }
 
1
  {
2
+ "title": "Maintain the unmaintainable",
3
+ "subtitle": "1M python loc, 400+ models",
4
  "description": "A peek into software engineering for the transformers library",
5
+ "fullTitle": "Maintain the unmaintainable: 1M python loc, 400+ models"
6
  }
content/article.md CHANGED
@@ -1,5 +1,40 @@
1
 
2
- # Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
5
 
@@ -16,7 +51,7 @@ We codify the "tenets" that guide our development, demonstrate how they are impl
16
  For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
17
 
18
 
19
- ### The core tenets of transformers
20
 
21
 
22
  We summarize the foundations on which we've built everything, and write the "tenets" of the library. They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
@@ -28,7 +63,7 @@ Note that the library _evolved_ towards these principles, and that they _emerged
28
  <li class="tenet">
29
  <a id="source-of-truth"></a>
30
  <strong>Source of Truth</strong>
31
- <p>We aim be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
32
  <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
33
  </li>
34
 
@@ -73,7 +108,7 @@ Note that the library _evolved_ towards these principles, and that they _emerged
73
  <li class="tenet">
74
  <a id="consistent-public-surface"></a>
75
  <strong>Consistent Public Surface</strong>
76
- <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goalpost</p>
77
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
78
  </li>
79
  </ol>
@@ -96,9 +131,9 @@ def rotate_half(x):
96
 
97
  You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
98
 
99
- All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works. It is a core functionality.
100
 
101
- In the same way, we want all models to have a self-contained modeling code.
102
 
103
  This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
104
 
@@ -108,7 +143,6 @@ What was the solution to this?
108
 
109
  ## <a id="modular"></a> Modular transformers
110
 
111
-
112
  Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
113
 
114
  We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
@@ -127,7 +161,9 @@ What is the consequence? When adding a model, we do not need to go over the enti
127
 
128
  When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
129
 
130
- ## A maintainable control surface
 
 
131
 
132
  The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
133
  If it only has a modeling file, we add its LOC count.
@@ -145,14 +181,23 @@ Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton no
145
 
146
  There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
147
 
148
- Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
 
 
 
 
149
 
 
150
 
151
- ## <a id="attention-classes"></a> External Attention classes
152
 
153
- A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
154
 
155
- What will forever stay in the modeling code is the `eager_attention_forward` because it is a core part of the modeling,
 
 
 
 
156
 
157
  ```python
158
  attention_interface: Callable = eager_attention_forward
@@ -160,9 +205,7 @@ if self.config._attn_implementation != "eager":
160
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
161
  ```
162
 
163
- We often read and understand that `kwargs` are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don''t use the same kwargs.
164
-
165
- It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
166
 
167
  For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
168
 
@@ -173,14 +216,23 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
173
  ```
174
 
175
 
176
- ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
177
 
178
- # TODO ADD LINK TO EXTERNAL BLOG POST
179
- We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
180
 
181
- It is written once in the config and passed to `.from_pretrained()`.
182
 
183
- The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 
 
 
 
 
 
 
 
 
 
 
184
 
185
  {{{fragment-tp-plan}}}
186
 
@@ -192,7 +244,7 @@ Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
192
  Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
193
 
194
 
195
- ## <a id="layers-attentions-caches"></a> Layers, attentions and caches
196
 
197
  Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
198
 
@@ -221,7 +273,7 @@ and the configuration can be _explicit_ about which attention type is in which l
221
 
222
  This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
223
 
224
- ## <a id="community-kernels"></a>Community Kernels
225
 
226
  The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
227
 
@@ -235,7 +287,7 @@ Plus, this opened another angle of contribution for the community. People who ar
235
 
236
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
237
 
238
- ## The good modularity
239
 
240
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
241
  It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
@@ -257,7 +309,7 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
257
 
258
  Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
259
 
260
- ## Many models, but not enough yet, are alike
261
 
262
  So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
263
 
@@ -329,7 +381,7 @@ The following [Pull request to standardize placeholder masking](https://github.c
329
  return special_image_mask, special_video_mask
330
  ```
331
 
332
- But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
333
 
334
 
335
  ### <a id="encoders-ftw"></a> Embedding models, now and forever.
@@ -344,38 +396,38 @@ As the codebase grows, with our friend codebase [Sentence Transformers](https://
344
 
345
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
346
 
347
- The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.
348
 
349
- ![Fast Image Processors Performance](fast_image_processors.png)
350
-
351
 
352
 
353
  ## Reduce barrier to entry/contribution
354
 
355
- This is an overall objective: there's no `transformer` without its community.
356
 
357
- We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
358
 
359
- Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
360
 
361
- In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
362
 
 
363
 
364
- ## A surgical toolbox for model development
365
 
366
  ### Attention visualisation
367
 
368
- If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
369
- machinery is the `attention mask`, cause of confusion.
370
 
371
- Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
372
 
373
  {{{fragment-attention-visualizer}}}
374
 
375
 
376
  ### Logging entire model activations
377
 
378
- Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
379
 
380
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
381
 
@@ -387,11 +439,11 @@ Having a clean _external_ API allows us to work on the true inner workings of tr
387
 
388
  {{{fragment-warmup_demo}}}
389
 
390
- It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, your iteration speed.
391
 
392
  ### Transformers-serve and continuous batching
393
 
394
- Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
395
 
396
  ```bash
397
  transformers serve
@@ -410,7 +462,7 @@ Continuous batching is in itself very much linked to the great work of vLLM with
410
 
411
  Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
412
  - having it immediately available to the community
413
- - having it immediately usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
414
 
415
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
416
 
 
1
 
2
+
3
+
4
+
5
+
6
+
7
+
8
+
9
+
10
+
11
+
12
+
13
+
14
+
15
+
16
+
17
+
18
+
19
+
20
+
21
+
22
+
23
+
24
+
25
+
26
+
27
+
28
+
29
+
30
+
31
+
32
+
33
+
34
+
35
+
36
+
37
+ ## Introduction
38
 
39
  One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
40
 
 
51
  For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
52
 
53
 
54
+ ## The core tenets of transformers
55
 
56
 
57
  We summarize the foundations on which we've built everything, and write the "tenets" of the library. They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
 
63
  <li class="tenet">
64
  <a id="source-of-truth"></a>
65
  <strong>Source of Truth</strong>
66
+ <p>We aim be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
67
  <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
68
  </li>
69
 
 
108
  <li class="tenet">
109
  <a id="consistent-public-surface"></a>
110
  <strong>Consistent Public Surface</strong>
111
+ <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal we have as well as a tenet.</p>
112
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
113
  </li>
114
  </ol>
 
131
 
132
  You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
133
 
134
+ We want all models to have self-contained modeling code.
135
 
136
+ Every core functionality _must_ be in the modeling code, every non-core functionality _can_ be outside of it.
137
 
138
  This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
139
 
 
143
 
144
  ## <a id="modular"></a> Modular transformers
145
 
 
146
  Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
147
 
148
  We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
 
161
 
162
  When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
163
 
164
+ What does that gives us?
165
+
166
+ ### A maintainable control surface
167
 
168
  The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
169
  If it only has a modeling file, we add its LOC count.
 
181
 
182
  There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
183
 
184
+ Of course, it is not only this effort that allowed to reduce the maintenance load.
185
+
186
+ A related optimization was the following one. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
187
+
188
+ The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
189
 
190
+ However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
191
 
192
+ ### <a id="attention-classes"></a> External Attention classes
193
 
194
+ Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
195
 
196
+ We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
197
+
198
+ We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
199
+
200
+ In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.
201
 
202
  ```python
203
  attention_interface: Callable = eager_attention_forward
 
205
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
206
  ```
207
 
208
+ A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
 
 
209
 
210
  For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
211
 
 
216
  ```
217
 
218
 
 
219
 
220
+ ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
 
221
 
222
+ If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
223
 
224
+ The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism) when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
225
+
226
+ Why does it matter?
227
+
228
+ Because we want to avoid code modifications that are unrelated to the model.
229
+ We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
230
+
231
+ Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
232
+
233
+ The alternative would be to modify parent classes specific to their
234
+
235
+ It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
236
 
237
  {{{fragment-tp-plan}}}
238
 
 
244
  Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
245
 
246
 
247
+ ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
248
 
249
  Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
250
 
 
273
 
274
  This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
275
 
276
+ ### <a id="community-kernels"></a>Community Kernels
277
 
278
  The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
279
 
 
287
 
288
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
289
 
290
+ ## Modular developments
291
 
292
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
293
  It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 
309
 
310
  Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
311
 
312
+ ### Many models, but not enough yet, are alike
313
 
314
  So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
315
 
 
381
  return special_image_mask, special_video_mask
382
  ```
383
 
384
+ But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
385
 
386
 
387
  ### <a id="encoders-ftw"></a> Embedding models, now and forever.
 
396
 
397
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
398
 
399
+ The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
400
 
401
+ ![Fast Image Processors Performance](static/fast_image_processors.png)
402
+ <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
403
 
404
 
405
  ## Reduce barrier to entry/contribution
406
 
407
+ This is an overall objective: there's no `transformers` without its community.
408
 
409
+ Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
410
 
411
+ Among the most valuable contributions to `transformers` is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
412
 
413
+ In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
414
 
415
+ So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
416
 
417
+ ### A surgical toolbox for model development
418
 
419
  ### Attention visualisation
420
 
421
+ All models have the same API internally for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). it allows us to build cool tools to visualize the inner workings of the attention mechanism.
 
422
 
423
+ One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
424
 
425
  {{{fragment-attention-visualizer}}}
426
 
427
 
428
  ### Logging entire model activations
429
 
430
+ Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
431
 
432
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
433
 
 
439
 
440
  {{{fragment-warmup_demo}}}
441
 
442
+ It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
443
 
444
  ### Transformers-serve and continuous batching
445
 
446
+ Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
447
 
448
  ```bash
449
  transformers serve
 
462
 
463
  Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
464
  - having it immediately available to the community
465
+ - having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
466
 
467
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
468
 
dist/distill.bundle.js CHANGED
@@ -2146,7 +2146,7 @@ function _arrayWithHoles(r) { if (Array.isArray(r)) return r; }
2146
  function bylineTemplate(frontMatter) {
2147
  return "\n <div class=\"byline grid\">\n <div>\n <h3>Authors</h3>\n <div>\n ".concat(frontMatter.authors.map(function (author, i) {
2148
  return "\n <span class=\"author\">\n ".concat(author.personalURL ? "\n <a class=\"name\" href=\"".concat(author.personalURL, "\">").concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</a>" : "\n <span class=\"name\">".concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</span>", "\n </span>\n ");
2149
- }).join(''), "\n </div>\n </div>\n <div >\n <h3>Affiliation</h3>\n <div><a href=\"https://huggingface.co/\">Hugging Face</a>\n </div>\n </div>\n <div >\n <h3>Published</h3>\n <div>August, 2025</div>\n </div>\n </div>\n\n");
2150
  }
2151
  var Byline = /*#__PURE__*/function (_HTMLElement4) {
2152
  function Byline() {
 
2146
  function bylineTemplate(frontMatter) {
2147
  return "\n <div class=\"byline grid\">\n <div>\n <h3>Authors</h3>\n <div>\n ".concat(frontMatter.authors.map(function (author, i) {
2148
  return "\n <span class=\"author\">\n ".concat(author.personalURL ? "\n <a class=\"name\" href=\"".concat(author.personalURL, "\">").concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</a>" : "\n <span class=\"name\">".concat(author.name) + (i + 1 < frontMatter.authors.length ? "," : "") + "</span>", "\n </span>\n ");
2149
+ }).join(''), "\n </div>\n </div>\n <div >\n <h3>Affiliation</h3>\n <div><a href=\"https://huggingface.co/\">Hugging Face</a>\n </div>\n </div>\n <div >\n <h3>Published</h3>\n <div>October, 2025</div>\n </div>\n </div>\n\n");
2150
  }
2151
  var Byline = /*#__PURE__*/function (_HTMLElement4) {
2152
  function Byline() {
dist/distill.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -8,21 +8,22 @@
8
  <script src="https://d3js.org/d3.v7.min.js"></script>
9
  <meta name="viewport" content="width=device-width, initial-scale=1">
10
  <meta charset="utf8">
11
- <title>Scaling Insanity: maintaining hundreds of model definitions</title>
12
  <link rel="stylesheet" href="style.css">
 
13
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
14
  </head>
15
  <body>
16
  <d-front-matter>
17
  <script id='distill-front-matter' type="text/json">{
18
- "title": "Scaling Insanity: maintaining hundreds of model definitions",
19
  "description": "A peek into software engineering for the transformers library",
20
  "published": "Aug 21, 2025",
21
  "authors": [{"author": "Pablo Montalvo", "authorURL": "https://huggingface.co/Molbap"}]
22
  }</script>
23
  </d-front-matter>
24
  <d-title>
25
- <h1>Scaling Insanity: maintaining hundreds of model definitions</h1>
26
  <p>A peek into software engineering for the transformers library</p>
27
  </d-title>
28
  <d-byline></d-byline>
@@ -48,33 +49,29 @@
48
  </nav>
49
  </d-contents>
50
  <h2>Introduction</h2>
51
- <p>The <code>transformers</code> library, built with <code>PyTorch</code>, supports all state-of-the-art LLMs, many VLMs, task-specific vision language models, video models, audio models, table models, classical encoders, to a global count of almost 400 models.<br>
52
- The name of the library itself is mostly majority driven as many models are not even transformers architectures, like Mamba, Zamba, RWKV, and convolution-based models.<br>
53
- Regardless, each of these is wrought by the research and engineering team that created them, then harmonized into a now famous interface, and callable with a simple <code>.from_pretrained</code> command.<br>
54
- Inference works for all models, training is functional for most. The library is a foundation for many machine learning courses, cookbooks, and overall, several thousands other open-source libraries depend on it. All models are tested as part of a daily CI ensuring their preservation and reproducibility. Most importantly, it is <em>open-source</em> and has been written by the community for a large part.<br>
55
- This isnt really to brag but to set the stakes: what does it take to keep such a ship afloat, made of so many moving, unrelated parts?<br>
56
- The ML wave has not stopped, there’s more and more models being added, at a steadily growing rate. <code>Transformers</code> is widely used, and we read the feedback that users post online. Whether it’s about a function that had 300+ keyword arguments, duplicated code and helpers, and mentions of <code>Copied from ... </code> everywhere, along with optimisation concerns. Text-only models are relatively tamed, but multimodal models remain to be harmonized.<br>
57
- Here we will dissect what is the new design philosophy of transformers, as a continuation from the existing older <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and an accompanying <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post from 2022</a>.<br>
58
- More recently, and I recommend the read if it’s not done yet, a blog post about <a href="https://huggingface.co/blog/faster-transformers">recent upgrades to transformers</a> was written, explaining in particular what makes the library faster today.<br>
59
- Some time ago I dare not say how long, we discussed with transformers maintainers about the state of features in transformers. A lot of recent developments were satisfactory, but if we were only talking about these, self-congratulation would be the only goalpost.<br>
60
- Reflecting on this philosophy now, as models pile up, is essential and will drive new developments.</p>
61
- <h3>The core tenets of transformers</h3>
62
- <p>Every reader, whether an OSS maintainer, power user, or casual fine-tuner, will walk away knowing how to reason about the <code>transformers</code> code base, how to use it better, how to meaningfully contribute to it.
63
- This will also showcase new features you might have missed so you’ll be up-to-date.</p>
64
- <p>So, what are the principles of <code>transformers</code>? We will try to summarize the foundations on which we’ve built everything, and write the “tenets” of the library. They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
65
  <div class="tenet-list">
66
  <ol>
67
  <li class="tenet">
68
  <a id="source-of-truth"></a>
69
  <strong>Source of Truth</strong>
70
- <p>We should be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
71
  <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
72
  </li>
73
  <li class="tenet">
74
  <a id="one-model-one-file"></a>
75
  <strong>One Model, One File</strong>
76
- <p>All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.</p>
77
- <em>Every model should be completely understandable by reading a single file from top to bottom.</em>
78
  </li>
79
  <li class="tenet">
80
  <a id="code-is-product"></a>
@@ -99,32 +96,26 @@ This will also showcase new features you might have missed so you’ll be up-to-
99
  <a id="minimal-user-api"></a>
100
  <strong>Minimal User API</strong>
101
  <p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
102
- <em>Keep the public interface simple and predictable - users should know what to expect.</em>
103
  </li>
104
  <li class="tenet">
105
  <a id="backwards-compatibility"></a>
106
  <strong>Backwards Compatibility</strong>
107
- <p>Evolve by additive standardization, <strong>never</strong> break public APIs.</p>
108
- <p><strong>Note:</strong> Some models are showing almost no use, we also stopped adding new features for non-torch frameworks. Still, we adapt to models existing on the hub.</p>
109
- <em>Once something is public, it stays public - evolution through addition, not breaking changes.</em>
110
  </li>
111
  <li class="tenet">
112
  <a id="consistent-public-surface"></a>
113
  <strong>Consistent Public Surface</strong>
114
- <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.</p>
115
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
116
  </li>
117
- <li class="tenet">
118
- <a id="modular-toolbox"></a>
119
- <strong>Modular Toolbox (Not A Framework)</strong>
120
- <p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
121
- <em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
122
- </li>
123
  </ol>
124
  </div>
125
  <p>When a PR is merged, it is because the contribution is worthwhile, and that the <code>transformers</code> team finds the design of the contribution to be aligned with what is above.</p>
126
- <p>Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We <em>try</em> to make it so all the code added is inline, lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
127
- <p>For instance, one function essential to the implementation of <a href="https://huggingface.co/papers/2104.09864">Rotary Positional Embeddings</a> is identical in 70 <code>modeling_&lt;file&gt;.py</code> across <code>src/transformers/models/.</code> Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We <a href="#do-repeat-yourself">do repeat ourselves</a>.</p>
128
  <pre><code class="language-python">def rotate_half(x):
129
  &quot;&quot;&quot;Rotates half the hidden dims of the input.&quot;&quot;&quot;
130
  x1 = x[..., : x.shape[-1] // 2]
@@ -132,11 +123,15 @@ This will also showcase new features you might have missed so you’ll be up-to-
132
  return torch.cat((-x2, x1), dim=-1)
133
  </code></pre>
134
  <p>You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that’s what I did (+ a hash to avoid quadraticity).</p>
135
- <p>So… why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each <em>view</em> of one of our cars to be able to function. Remove the clutch, you can’t drive. Remove the doors, might be uncomfortable but you’ll get there. So doors can go, but you <em>have</em> to keep the clutch, even though you know perfectly how it works.</p>
136
- <h2><a id="modular"></a> Going modular</h2>
137
- <p>It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at some drawbacks, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers">Transformers has gone modular</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>. If you’re familiar with this, you can <a href="#%5Eattention-classes">skip this section</a> and go to the next one.</p>
 
 
 
 
138
  <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
139
- <p>It is explained in details in the documentation above, but overall it works like this, you define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em>:</p>
140
  <summary>Auto-generated modeling code</summary>
141
  <p><div class=code-compare style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
142
  <div class=code-column style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
@@ -287,25 +282,49 @@ class GlmRMSNorm(nn.Module):
287
  <strong>Left:</strong> Clean modular definition with inheritance.
288
  <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.
289
  </p></p>
290
- <p>As you can see, we can now define any model as a <em>modular</em> of another. This isn’t strictly groundbreaking if you’ve done any programming, you might even think “well that’s just how inheritance works”. The crucial difference is that we do <em>visibly</em> what is essentially the <em>compiler</em>’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it <a href="#one-model-one-file">all in one piece</a>.</p>
291
- <h2><a id="attention-classes"></a> External Attention classes</h2>
292
- <p>A chronological iteration over <a href="#modular">modular</a>, and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
293
- <p>What will forever stay in the modeling code is the <code>eager_attention_forward</code> because it is a core part of the modeling,</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
  <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
295
  if self.config._attn_implementation != &quot;eager&quot;:
296
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
297
  </code></pre>
298
- <p>We often read and understand that <code>kwargs</code> are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don’'t use the same kwargs.</p>
299
- <p>It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
300
  <p>For better <em>information</em>, we plan to use <code>python</code> features such as <code>Annotated</code> for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):</p>
301
  <pre><code class="language-python">from typing import Annotated
302
 
303
  MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
304
  </code></pre>
305
- <h2><a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism</h2>
306
- <p>We want to touch minimally to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
307
- <p>It is written once in the config and passed to <code>.from_pretrained()</code>.</p>
308
- <p>The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
 
 
 
 
 
309
  <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
310
  base_model_tp_plan = {
311
  "layers.*.self_attn.q_proj": "colwise",
@@ -333,7 +352,7 @@ out = model(**inputs)</code></pre></p>
333
  <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
334
  <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
335
  <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
336
- <h2><a id="layers-attentions-caches"></a> Layers, attentions and caches</h2>
337
  <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
338
  <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
339
  &quot;full_attention&quot;,
@@ -352,8 +371,8 @@ out = model(**inputs)</code></pre></p>
352
  &quot;full_attention&quot;
353
  ],
354
  </code></pre>
355
- <p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also <a href="#modular-toolbox">easy to tweak</a>.</p>
356
- <h2><a id="community-kernels"></a>Community Kernels</h2>
357
  <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
358
  <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
359
  class GlmRMSNorm(nn.Module):
@@ -361,7 +380,7 @@ class GlmRMSNorm(nn.Module):
361
  </code></pre>
362
  <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
363
  <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
364
- <h2>The good modularity</h2>
365
  <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
366
  It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
367
  So I wanted to take a look at the current <strong>state of modularity</strong> across the repository. How many models are defined using components of others?</p>
@@ -377,12 +396,12 @@ Radically different architectures such as mamba have spawned their own dependenc
377
  <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
378
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
379
  <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
380
- <h2>Many models, but not enough yet, are alike</h2>
381
  <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
382
  <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
383
  <p> <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
384
  <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
385
- <h2>VLM improvements, avoiding abstraction</h2>
386
  <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
387
  <p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
388
  <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
@@ -432,16 +451,8 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
432
 
433
  return special_image_mask, special_video_mask
434
  </code></pre>
435
- <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the self-contained logic of the model.</p>
436
- <h2>The weight of maintenance</h2>
437
- <p>The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
438
- <p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
439
- <p>Less code to hand-maintain means fewer places to break.</p>
440
- <p>Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.</p>
441
- <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
442
- <p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
443
- <p>Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
444
- <h2><a id="encoders-ftw"></a> Embedding models, now and forever.</h2>
445
  <p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
446
  <p><html>
447
  <head><meta charset="utf-8" /></head>
@@ -4329,20 +4340,21 @@ return Plotly;
4329
  </body>
4330
  </html></p>
4331
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
4332
- <h2>On image processing and processors</h2>
4333
  <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
4334
- <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.</p>
4335
- <p><img src="fast_image_processors.png" alt="Fast Image Processors Performance"></p>
 
4336
  <h2>Reduce barrier to entry/contribution</h2>
4337
- <p>This is an overall objective: there’s no <code>transformer</code> without its community.</p>
4338
- <p>We didn’t want to make a toolbox, because <em>having a framework means forcing users into it</em>. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
4339
- <p>Among the most valuable contributions to <code>transformers</code>is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
4340
- <p>In that regard, we DO want to be a <a href="#modular-toolbox">modular toolbox</a>, being <a href="#minimal-user-api">minimal</a> enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4341
- <h2>A surgical toolbox for model development</h2>
 
4342
  <h3>Attention visualisation</h3>
4343
- <p>If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
4344
- machinery is the <code>attention mask</code>, cause of confusion.</p>
4345
- <p>Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
4346
  <p>
4347
  <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
4348
  <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
@@ -4389,7 +4401,7 @@ machinery is the <code>attention mask</code>, cause of confusion.</p>
4389
  </div>
4390
  </p>
4391
  <h3>Logging entire model activations</h3>
4392
- <p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
4393
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4394
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
4395
  <h3>Cooking faster CUDA warmups</h3>
@@ -4445,9 +4457,9 @@ machinery is the <code>attention mask</code>, cause of confusion.</p>
4445
  </div>
4446
 
4447
  <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
4448
- <p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, your iteration speed.</p>
4449
- <h2>Transformers-serve and continuous batching</h2>
4450
- <p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.</p>
4451
  <pre><code class="language-bash">transformers serve
4452
 
4453
  curl -X POST http://localhost:8000/v1/chat/completions \
@@ -4460,11 +4472,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4460
  <p>Transformers-serve is transformers-first, for sure, but it’s not limited to that. Adding a model to transformers means:</p>
4461
  <ul>
4462
  <li>having it immediately available to the community</li>
4463
- <li>having it immediately usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
4464
  </ul>
4465
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
4466
  <h2>What is coming next</h2>
4467
- <p>It sounds dumb, but it’s true: the future is very soon. One tenet that will be broken when the next major version is released, v5, <a href="#backwards-compatibility">backwards compatibility</a> will be heavily broken. Instead, what we aim to be is way more of a <a href="#modular-toolbox">modular toolbox</a>, while maintaining a <a href="#consistent-public-surface">consistent public surface</a>.</p>
 
4468
 
4469
  </d-article>
4470
 
@@ -4492,28 +4505,27 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4492
 
4493
  // Extract tenet text for tooltips
4494
  const tenetTooltips = {
4495
- 'source-of-truth': 'We aim to be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
4496
- 'one-model-one-file': 'All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.',
4497
  'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
4498
  'standardize-dont-abstract': 'If it\'s model behavior, keep it in the file; abstractions only for generic infra.',
4499
  'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
4500
  'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
4501
  'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
4502
- 'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed.',
4503
  };
4504
 
4505
- // Add smooth scrolling and active state
4506
  const tocLinks = document.querySelectorAll('d-contents a');
4507
  tocLinks.forEach(link => {
4508
  const href = link.getAttribute('href');
4509
  const anchor = href ? href.substring(1) : '';
4510
-
4511
- // Add tooltip if this is a tenet link
4512
  if (tenetTooltips[anchor]) {
4513
- link.setAttribute('title', tenetTooltips[anchor]);
4514
  link.style.position = 'relative';
4515
  }
4516
-
4517
  link.addEventListener('click', function(e) {
4518
  e.preventDefault();
4519
  const target = document.querySelector(this.getAttribute('href'));
@@ -4522,6 +4534,16 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4522
  }
4523
  });
4524
  });
 
 
 
 
 
 
 
 
 
 
4525
 
4526
  // Update active state on scroll
4527
  window.addEventListener('scroll', function() {
 
8
  <script src="https://d3js.org/d3.v7.min.js"></script>
9
  <meta name="viewport" content="width=device-width, initial-scale=1">
10
  <meta charset="utf8">
11
+ <title>Maintain the unmaintainable: 1M python loc, 400+ models</title>
12
  <link rel="stylesheet" href="style.css">
13
+ <link rel="stylesheet" href="transformers-custom.css">
14
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
15
  </head>
16
  <body>
17
  <d-front-matter>
18
  <script id='distill-front-matter' type="text/json">{
19
+ "title": "Maintain the unmaintainable: 1M python loc, 400+ models",
20
  "description": "A peek into software engineering for the transformers library",
21
  "published": "Aug 21, 2025",
22
  "authors": [{"author": "Pablo Montalvo", "authorURL": "https://huggingface.co/Molbap"}]
23
  }</script>
24
  </d-front-matter>
25
  <d-title>
26
+ <h1>Maintain the unmaintainable: 1M python loc, 400+ models</h1>
27
  <p>A peek into software engineering for the transformers library</p>
28
  </d-title>
29
  <d-byline></d-byline>
 
49
  </nav>
50
  </d-contents>
51
  <h2>Introduction</h2>
52
+ <p>One million lines of <code>python</code> code. Through them, the <code>transformers</code> library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.</p>
53
+ <p>Built on <code>PyTorch</code>, it’s a foundational tool for modern LLM usage, research, education, and tens of thousands of other open-source projects. Each AI model is added by the community, harmonized into a consistent interface, and tested daily on a CI to ensure reproducibility.</p>
54
+ <p>This scale presents a monumental engineering challenge.</p>
55
+ <p>How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library’s usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members. We continue supporting all models that come out and will continue to do so in the foreseeable future.</p>
56
+ <p>This post dissects the design philosophy that makes this possible. Its a continuation of our older principles, detailed on our previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, as well as its accompanying <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post from 2022</a>. More recently, and I recommend the read if it’s not done yet, a blog post about <a href="https://huggingface.co/blog/faster-transformers">recent upgrades to transformers</a> was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.</p>
57
+ <p>We codify the “tenets” that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library’s sustainability and growth.</p>
58
+ <p>For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon <code>transformers</code>, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.</p>
59
+ <h2>The core tenets of transformers</h2>
60
+ <p>We summarize the foundations on which we’ve built everything, and write the “tenets” of the library. They behave like <em>software interfaces</em>, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.</p>
61
+ <p>Note that the library <em>evolved</em> towards these principles, and that they <em>emerged</em> from decisions taken, and once emerged they were recognized as critical.</p>
 
 
 
 
62
  <div class="tenet-list">
63
  <ol>
64
  <li class="tenet">
65
  <a id="source-of-truth"></a>
66
  <strong>Source of Truth</strong>
67
+ <p>We aim be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
68
  <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
69
  </li>
70
  <li class="tenet">
71
  <a id="one-model-one-file"></a>
72
  <strong>One Model, One File</strong>
73
+ <p>All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model's hackability.</p>
74
+ <em>Every model should be completely understandable and hackable by reading a single file from top to bottom.</em>
75
  </li>
76
  <li class="tenet">
77
  <a id="code-is-product"></a>
 
96
  <a id="minimal-user-api"></a>
97
  <strong>Minimal User API</strong>
98
  <p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
99
+ <em>Keep the public interface simple and predictable, users should know what to expect.</em>
100
  </li>
101
  <li class="tenet">
102
  <a id="backwards-compatibility"></a>
103
  <strong>Backwards Compatibility</strong>
104
+ <p>Evolve by additive standardization, never break public APIs.</p>
105
+ <p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies.
106
+ <em>Once something is public, it stays public, evolution through addition, not breaking changes.</em>
107
  </li>
108
  <li class="tenet">
109
  <a id="consistent-public-surface"></a>
110
  <strong>Consistent Public Surface</strong>
111
+ <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal we have as well as a tenet.</p>
112
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
113
  </li>
 
 
 
 
 
 
114
  </ol>
115
  </div>
116
  <p>When a PR is merged, it is because the contribution is worthwhile, and that the <code>transformers</code> team finds the design of the contribution to be aligned with what is above.</p>
117
+ <p>Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We <em>try</em> to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break <a href="#backwards-compatibility">backwards compatibility</a>.</p>
118
+ <p>For instance, one function essential to the implementation of <a href="https://huggingface.co/papers/2104.09864">Rotary Positional Embeddings</a> is identical in 70 <code>modeling_&lt;file&gt;.py</code> across <code>src/transformers/models/.</code> Why keep it? Because we want all the model logic to be <a href="#one-model-one-file">contained in the modeling file</a>. In order to do that, we <a href="#do-repeat-yourself">do repeat ourselves</a>.</p>
119
  <pre><code class="language-python">def rotate_half(x):
120
  &quot;&quot;&quot;Rotates half the hidden dims of the input.&quot;&quot;&quot;
121
  x1 = x[..., : x.shape[-1] // 2]
 
123
  return torch.cat((-x2, x1), dim=-1)
124
  </code></pre>
125
  <p>You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that’s what I did (+ a hash to avoid quadraticity).</p>
126
+ <p>We want all models to have self-contained modeling code.</p>
127
+ <p>Every core functionality <em>must</em> be in the modeling code, every non-core functionality <em>can</em> be outside of it.</p>
128
+ <p>This comes as a great cost. Enter the <code>#Copied from...</code> mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.</p>
129
+ <p>We needed to separate both principles that were so far intertwined, <a href="#do-repeat-yourself">repetition</a> and <a href="#one-model-one-file">hackabilty</a>.</p>
130
+ <p>What was the solution to this?</p>
131
+ <h2><a id="modular"></a> Modular transformers</h2>
132
+ <p>Transformers is an opiniated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers were introduced</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
133
  <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
134
+ <p>It works as follows. In order to contribute a model, say for instance define a <code>modular_</code> file that can inherit from <em>any function across all other modeling, configuration and processor files</em>.</p>
135
  <summary>Auto-generated modeling code</summary>
136
  <p><div class=code-compare style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0;">
137
  <div class=code-column style="border: 1px solid #e2e8f0; border-radius: 8px; overflow: hidden;">
 
282
  <strong>Left:</strong> Clean modular definition with inheritance.
283
  <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.
284
  </p></p>
285
+ <p>As you can see, we can now define any model as a <em>modular</em> of another.</p>
286
+ <p>You might think “well that’s just how inheritance works”. The crucial difference is that we do <em>visibly</em> what is essentially the <em>compiler</em>’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it <a href="#one-model-one-file">all in one piece</a>.</p>
287
+ <p>What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.</p>
288
+ <p>When <code>AutoModel.from_pretrained(...)</code> is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.</p>
289
+ <p>What does that gives us?</p>
290
+ <h3>A maintainable control surface</h3>
291
+ <p>The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
292
+ If it only has a modeling file, we add its LOC count.
293
+ However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
294
+ <p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
295
+ <p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
296
+ <p>Less code to hand-maintain means fewer places to break.</p>
297
+ <p>Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.</p>
298
+ <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
299
+ <p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
300
+ <p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
301
+ <p>A related optimization was the following one. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
302
+ <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
303
+ <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
304
+ <h3><a id="attention-classes"></a> External Attention classes</h3>
305
+ <p>Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
306
+ <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
307
+ <p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
308
+ <p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.</p>
309
  <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
310
  if self.config._attn_implementation != &quot;eager&quot;:
311
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
312
  </code></pre>
313
+ <p>A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
 
314
  <p>For better <em>information</em>, we plan to use <code>python</code> features such as <code>Annotated</code> for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):</p>
315
  <pre><code class="language-python">from typing import Annotated
316
 
317
  MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
318
  </code></pre>
319
+ <h3><a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism</h3>
320
+ <p>If you’re not familiar with the different flavours of parallelism, I recommend checking out <a href="https://huggingface.co/blog/accelerate-nd-parallel">this blog post</a> first, and of course a full <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">dive into the ultra-scale playbook</a> is always recommended.</p>
321
+ <p>The essential part is that, as <a href="https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism">the documentation states</a> when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.</p>
322
+ <p>Why does it matter?</p>
323
+ <p>Because we want to avoid code modifications that are unrelated to the model.
324
+ We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
325
+ <p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
326
+ <p>The alternative would be to modify parent classes specific to their</p>
327
+ <p>It is written once in the config and passed to <code>.from_pretrained()</code>. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
328
  <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
329
  base_model_tp_plan = {
330
  "layers.*.self_attn.q_proj": "colwise",
 
352
  <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
353
  <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
354
  <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
355
+ <h3><a id="layers-attentions-caches"></a> Layers, attentions and caches</h3>
356
  <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
357
  <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
358
  &quot;full_attention&quot;,
 
371
  &quot;full_attention&quot;
372
  ],
373
  </code></pre>
374
+ <p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.</p>
375
+ <h3><a id="community-kernels"></a>Community Kernels</h3>
376
  <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
377
  <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
378
  class GlmRMSNorm(nn.Module):
 
380
  </code></pre>
381
  <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
382
  <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
383
+ <h2>Modular developments</h2>
384
  <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
385
  It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
386
  So I wanted to take a look at the current <strong>state of modularity</strong> across the repository. How many models are defined using components of others?</p>
 
396
  <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
397
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
398
  <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
399
+ <h3>Many models, but not enough yet, are alike</h3>
400
  <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
401
  <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
402
  <p> <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
403
  <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
404
+ <h3>VLM improvements, avoiding abstraction</h3>
405
  <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
406
  <p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
407
  <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
 
451
 
452
  return special_image_mask, special_video_mask
453
  </code></pre>
454
+ <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
455
+ <h3><a id="encoders-ftw"></a> Embedding models, now and forever.</h3>
 
 
 
 
 
 
 
 
456
  <p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
457
  <p><html>
458
  <head><meta charset="utf-8" /></head>
 
4340
  </body>
4341
  </html></p>
4342
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
4343
+ <h3>On image processing and processors</h3>
4344
  <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
4345
+ <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
4346
+ <p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
4347
+ <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
4348
  <h2>Reduce barrier to entry/contribution</h2>
4349
+ <p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
4350
+ <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
4351
+ <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
4352
+ <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4353
+ <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
4354
+ <h3>A surgical toolbox for model development</h3>
4355
  <h3>Attention visualisation</h3>
4356
+ <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
4357
+ <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
 
4358
  <p>
4359
  <div style="max-width: 940px; margin: 16px 0; border:1px solid #2a2f3a; border-radius:8px; background:#0b0f19; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; color:#e5e7eb;">
4360
  <div style="display:flex; align-items:center; gap:8px; padding:8px 10px; border-bottom:1px solid #1f2430; background:#111827; border-top-left-radius:8px; border-top-right-radius:8px;">
 
4401
  </div>
4402
  </p>
4403
  <h3>Logging entire model activations</h3>
4404
+ <p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
4405
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4406
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
4407
  <h3>Cooking faster CUDA warmups</h3>
 
4457
  </div>
4458
 
4459
  <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
4460
+ <p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, as it’s the narrowest bottleneck for your iteration speed.</p>
4461
+ <h3>Transformers-serve and continuous batching</h3>
4462
+ <p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various <a href="https://huggingface.co/docs/inference-providers/en/index">inference providers</a> if you’re interested in model deployment in general.</p>
4463
  <pre><code class="language-bash">transformers serve
4464
 
4465
  curl -X POST http://localhost:8000/v1/chat/completions \
 
4472
  <p>Transformers-serve is transformers-first, for sure, but it’s not limited to that. Adding a model to transformers means:</p>
4473
  <ul>
4474
  <li>having it immediately available to the community</li>
4475
+ <li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
4476
  </ul>
4477
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
4478
  <h2>What is coming next</h2>
4479
+ <p>The next major version of <code>transformers</code> is just around the corner. When v5 is releasd, <a href="#backwards-compatibility">backwards compatibility</a> will try to stay as solid as possible. Changes we do now are to ensure this.</p>
4480
+ <p>Instead, what we aim to be is way more of a modular Toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM and other fine-tuning and fast inference options.</p>
4481
 
4482
  </d-article>
4483
 
 
4505
 
4506
  // Extract tenet text for tooltips
4507
  const tenetTooltips = {
4508
+ 'source-of-truth': 'We aim be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
4509
+ 'one-model-one-file': 'All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model\'s hackability.',
4510
  'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
4511
  'standardize-dont-abstract': 'If it\'s model behavior, keep it in the file; abstractions only for generic infra.',
4512
  'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
4513
  'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
4514
  'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
4515
+ 'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.',
4516
  };
4517
 
4518
+ // Add smooth scrolling and custom tooltips to all tenet links (TOC and article)
4519
  const tocLinks = document.querySelectorAll('d-contents a');
4520
  tocLinks.forEach(link => {
4521
  const href = link.getAttribute('href');
4522
  const anchor = href ? href.substring(1) : '';
4523
+
 
4524
  if (tenetTooltips[anchor]) {
4525
+ link.setAttribute('data-tooltip', tenetTooltips[anchor]);
4526
  link.style.position = 'relative';
4527
  }
4528
+
4529
  link.addEventListener('click', function(e) {
4530
  e.preventDefault();
4531
  const target = document.querySelector(this.getAttribute('href'));
 
4534
  }
4535
  });
4536
  });
4537
+
4538
+ // Add custom tooltips to tenet links in article content
4539
+ const articleLinks = document.querySelectorAll('d-article a[href^="#"]');
4540
+ articleLinks.forEach(link => {
4541
+ const href = link.getAttribute('href');
4542
+ const anchor = href ? href.substring(1) : '';
4543
+ if (tenetTooltips[anchor]) {
4544
+ link.setAttribute('data-tooltip', tenetTooltips[anchor]);
4545
+ }
4546
+ });
4547
 
4548
  // Update active state on scroll
4549
  window.addEventListener('scroll', function() {
dist/main.bundle.js CHANGED
@@ -1631,29 +1631,32 @@ p code, li code {
1631
  /* Distill article improvements */
1632
  d-article {
1633
  max-width: none;
1634
- font-size: 18px; /* Increased from default ~16px */
1635
- line-height: 1.7;
 
 
 
1636
  }
1637
 
1638
  d-article > * {
1639
- max-width: 1100px; /* Increased from 900px for more space */
1640
- margin-left: auto;
1641
- margin-right: auto;
1642
  }
1643
 
1644
- /* Make content even wider on large screens when TOC is present */
1645
- @media (min-width: 1400px) {
1646
  d-article > * {
1647
- max-width: 1300px;
 
1648
  }
1649
  }
1650
 
1651
  /* Improve paragraph readability */
1652
  d-article p {
1653
- font-size: 18px;
1654
- line-height: 1.8;
1655
- margin-bottom: 1.5rem;
1656
- color: #2d3748;
1657
  }
1658
 
1659
  /* Improve heading sizes */
@@ -1668,7 +1671,8 @@ d-article h1 {
1668
  d-article h2 {
1669
  font-size: 2.5rem;
1670
  line-height: 1.3;
1671
- margin: 2.5rem 0 1.5rem 0;
 
1672
  color: #1a202c;
1673
  font-weight: 650;
1674
  }
@@ -1697,7 +1701,7 @@ d-article ol li {
1697
  margin-bottom: 0.5rem;
1698
  }
1699
 
1700
- /* Enhanced tenet reference styling with tooltips */
1701
  a[href^="#source-of-truth"],
1702
  a[href^="#one-model-one-file"],
1703
  a[href^="#code-is-product"],
@@ -1713,7 +1717,6 @@ a[href^="#modular-toolbox"] {
1713
  text-decoration: underline;
1714
  text-decoration-color: rgba(102, 126, 234, 0.3);
1715
  transition: all 0.3s ease;
1716
- cursor: help;
1717
  }
1718
 
1719
  a[href^="#source-of-truth"]:hover,
@@ -1732,27 +1735,9 @@ a[href^="#modular-toolbox"]:hover {
1732
  border-radius: 4px;
1733
  }
1734
 
1735
- /* Tooltip content for each tenet */
1736
- a[href^="#source-of-truth"]:after { content: "We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances."; }
1737
- a[href^="#one-model-one-file"]:after { content: "All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom."; }
1738
- a[href^="#code-is-product"]:after { content: "Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial."; }
1739
- a[href^="#standardize-dont-abstract"]:after { content: "If it's model behavior, keep it in the file; abstractions only for generic infra."; }
1740
- a[href^="#do-repeat-yourself"]:after { content: "Copy when it helps users; keep successors in sync without centralizing behavior."; }
1741
- a[href^="#minimal-user-api"]:after { content: "Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths."; }
1742
- a[href^="#backwards-compatibility"]:after { content: "Evolve by additive standardization, never break public APIs."; }
1743
- a[href^="#consistent-public-surface"]:after { content: "Same argument names, same outputs, hidden states and attentions exposed."; }
1744
- a[href^="#modular-toolbox"]:after { content: "Provide tools and utilities, but don't force users into a rigid framework."; }
1745
-
1746
- /* Universal tooltip styling for tenet references */
1747
- a[href^="#source-of-truth"]:after,
1748
- a[href^="#one-model-one-file"]:after,
1749
- a[href^="#code-is-product"]:after,
1750
- a[href^="#standardize-dont-abstract"]:after,
1751
- a[href^="#do-repeat-yourself"]:after,
1752
- a[href^="#minimal-user-api"]:after,
1753
- a[href^="#backwards-compatibility"]:after,
1754
- a[href^="#consistent-public-surface"]:after,
1755
- a[href^="#modular-toolbox"]:after {
1756
  position: absolute;
1757
  bottom: 100%;
1758
  left: 50%;
@@ -1775,16 +1760,7 @@ a[href^="#modular-toolbox"]:after {
1775
  margin-bottom: 8px;
1776
  }
1777
 
1778
- /* Tooltip arrows */
1779
- a[href^="#source-of-truth"]:before,
1780
- a[href^="#one-model-one-file"]:before,
1781
- a[href^="#code-is-product"]:before,
1782
- a[href^="#standardize-dont-abstract"]:before,
1783
- a[href^="#do-repeat-yourself"]:before,
1784
- a[href^="#minimal-user-api"]:before,
1785
- a[href^="#backwards-compatibility"]:before,
1786
- a[href^="#consistent-public-surface"]:before,
1787
- a[href^="#modular-toolbox"]:before {
1788
  content: '';
1789
  position: absolute;
1790
  bottom: 100%;
@@ -1798,25 +1774,8 @@ a[href^="#modular-toolbox"]:before {
1798
  transition: opacity 0.3s ease, visibility 0.3s ease;
1799
  }
1800
 
1801
- /* Show tooltips on hover */
1802
- a[href^="#source-of-truth"]:hover:after,
1803
- a[href^="#one-model-one-file"]:hover:after,
1804
- a[href^="#code-is-product"]:hover:after,
1805
- a[href^="#standardize-dont-abstract"]:hover:after,
1806
- a[href^="#do-repeat-yourself"]:hover:after,
1807
- a[href^="#minimal-user-api"]:hover:after,
1808
- a[href^="#backwards-compatibility"]:hover:after,
1809
- a[href^="#consistent-public-surface"]:hover:after,
1810
- a[href^="#modular-toolbox"]:hover:after,
1811
- a[href^="#source-of-truth"]:hover:before,
1812
- a[href^="#one-model-one-file"]:hover:before,
1813
- a[href^="#code-is-product"]:hover:before,
1814
- a[href^="#standardize-dont-abstract"]:hover:before,
1815
- a[href^="#do-repeat-yourself"]:hover:before,
1816
- a[href^="#minimal-user-api"]:hover:before,
1817
- a[href^="#backwards-compatibility"]:hover:before,
1818
- a[href^="#consistent-public-surface"]:hover:before,
1819
- a[href^="#modular-toolbox"]:hover:before {
1820
  opacity: 1;
1821
  visibility: visible;
1822
  }
@@ -1834,6 +1793,36 @@ d-article blockquote {
1834
  color: #4a5568;
1835
  }
1836
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1837
  /* Full width elements */
1838
  d-article .code-compare,
1839
  d-article .interactive-demo,
@@ -1858,11 +1847,13 @@ d-article .memory-chart-container {
1858
  .tenet-list li.tenet {
1859
  padding: 1rem;
1860
  }
1861
-
1862
  .interactive-demo .demo-content {
1863
  padding: 1rem;
1864
  }
1865
- }`, "",{"version":3,"sources":["webpack://./src/transformers-custom.css"],"names":[],"mappings":"AAAA,4CAA4C;;AAE5C,2BAA2B;AAC3B;IACI,aAAa;IACb,8BAA8B;IAC9B,WAAW;IACX,cAAc;IACd,kBAAkB;AACtB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,gBAAgB;IAChB,wCAAwC;AAC5C;;AAEA;IACI,mBAAmB;IACnB,qBAAqB;IACrB,gBAAgB;IAChB,cAAc;IACd,gCAAgC;IAChC,gBAAgB;AACpB;;AAEA;IACI,SAAS;IACT,aAAa;IACb,mBAAmB;IACnB,gBAAgB;IAChB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA;IACI,cAAc;AAClB;;AAEA,8CAA8C;AAC9C;IACI;QACI,0BAA0B;QAC1B,SAAS;IACb;AACJ;;AAEA,+DAA+D;AAC/D;IACI,cAAc;AAClB;;AAEA;IACI,+BAA+B,EAAE,iBAAiB;IAClD,gBAAgB;IAChB,eAAe;IACf,aAAa;IACb,0BAA0B;IAC1B,WAAW;IACX,gBAAgB;IAChB,cAAc;AAClB;;AAEA;IACI,gCAAgC;IAChC,6DAA6D;IAC7D,yBAAyB;IACzB,mBAAmB;IACnB,4BAA4B;IAC5B,SAAS;IACT,kBAAkB;IAClB,2CAA2C;IAC3C,yBAAyB;IACzB,eAAe;AACnB;;AAEA;IACI,uCAAuC;IACvC,2CAA2C;IAC3C,oCAAoC;IACpC,6DAA6D;AACjE;;AAEA,8BAA8B;AAC9B,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;;AAE1G;IACI,+BAA+B;IAC/B,kBAAkB;IAClB,UAAU;IACV,WAAW;IACX,YAAY;IACZ,WAAW;IACX,YAAY;IACZ,kBAAkB;IAClB,aAAa;IACb,mBAAmB;IACnB,uBAAuB;IACvB,gBAAgB;IAChB,iBAAiB;IACjB,0CAA0C;IAC1C,uBAAuB;AAC3B;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,iBAAiB;IACjB,kBAAkB;IAClB,cAAc;IACd,mBAAmB;IACnB,aAAa;IACb,+BAA+B;IAC/B,kBAAkB;IAClB,8BAA8B;AAClC;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,gBAAgB;AACpB;;AAEA,iDAAiD;AACjD;IACI,KAAK,0CAA0C,EAAE;IACjD,MAAM,0CAA0C,EAAE;IAClD,OAAO,0CAA0C,EAAE;AACvD;;AAEA;IACI,6CAA6C;AACjD;;AAEA,kCAAkC;AAClC;IACI,yBAAyB;IACzB,mBAAmB;IACnB,mBAAmB;IACnB,cAAc;IACd,gBAAgB;IAChB,yCAAyC;AAC7C;;AAEA,yCAAyC;AACzC;IACI,6BAA6B;IAC7B,mCAAmC;AACvC;;AAEA;IACI,6DAA6D;IAC7D,YAAY;IACZ,oBAAoB;IACpB,gBAAgB;AACpB;;AAEA;IACI,eAAe;AACnB;;AAEA;IACI,mBAAmB;IACnB,oBAAoB;IACpB,6BAA6B;IAC7B,cAAc;IACd,gBAAgB;AACpB;;AAEA,4CAA4C;AAC5C;IACI,6DAA6D;IAC7D,YAAY;IACZ,YAAY;IACZ,uBAAuB;IACvB,kBAAkB;IAClB,gBAAgB;IAChB,eAAe;IACf,2CAA2C;AAC/C;;AAEA;IACI,2BAA2B;IAC3B,+CAA+C;AACnD;;AAEA;IACI,YAAY;IACZ,mBAAmB;IACnB,eAAe;IACf,gBAAgB;AACpB;;AAEA,qBAAqB;AACrB;IACI,mBAAmB;IACnB,kBAAkB;IAClB,aAAa;IACb,cAAc;IACd,wDAAwD;IACxD,gBAAgB;AACpB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,cAAc;IACd,eAAe;IACf,kBAAkB;IAClB,WAAW;IACX,oBAAoB;AACxB;;AAEA;IACI,mBAAmB;IACnB,aAAa;IACb,kBAAkB;IAClB,qBAAqB;IACrB,qBAAqB;IACrB,iBAAiB;IACjB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA,oCAAoC;AACpC;IACI,sBAAsB;IACtB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;AAClB;;AAEA;IACI,sBAAsB;IACtB,gBAAgB;IAChB,kBAAkB;IAClB,eAAe;AACnB;;AAEA,yBAAyB;AACzB;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,aAAa;IACb,cAAc;AAClB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,YAAY;IACZ,kBAAkB;IAClB,yCAAyC;IACzC,gBAAgB;AACpB;;AAEA,kEAAkE;AAClE;IACI;QACI,4BAA4B;IAChC;;IAEA;QACI,4BAA4B;QAC5B,4BAA4B;QAC5B,+BAA+B;QAC/B,6BAA6B;QAC7B,kCAAkC;QAClC,4BAA4B;QAC5B,0BAA0B;QAC1B,6BAA6B;QAC7B,4BAA4B;QAC5B,mCAAmC,EAAE,eAAe;QACpD,2BAA2B;QAC3B,oBAAoB;QACpB,2BAA2B;QAC3B,qCAAqC;QACrC,gCAAgC;QAChC,+CAA+C;QAC/C,wBAAwB;QACxB,yBAAyB;QACzB,8BAA8B;IAClC;AACJ;;AAEA;IACI;QACI,wBAAwB;QACxB,4BAA4B;QAC5B,8BAA8B;QAC9B,4BAA4B;QAC5B,gCAAgC;QAChC,6BAA6B;QAC7B,+BAA+B;QAC/B,sDAAsD;QACtD,6BAA6B;QAC7B,qCAAqC;QACrC,gCAAgC;QAChC,wBAAwB;IAC5B;AACJ;;AAEA,0DAA0D;AAC1D;IACI,yBAAyB;IACzB,8BAA8B;IAC9B,qBAAqB;AACzB;;AAEA,2BAA2B;AAC3B;IACI,qBAAqB;IACrB,gCAAgC;IAChC,sBAAsB;AAC1B;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,WAAW;AACf;;AAEA;IACI,yBAAyB;IACzB,qBAAqB;IACrB,mBAAmB;IACnB,cAAc;IACd,iBAAiB;IACjB,gBAAgB;IAChB,gBAAgB;IAChB,2BAA2B;AAC/B;;AAEA;IACI,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,qBAAqB;AACzB;;AAEA,qBAAqB;AACrB;IACI,qBAAqB;IACrB,mDAAmD;AACvD;;AAEA;IACI,UAAU;AACd;;AAEA;IACI,uBAAuB;AAC3B;;AAEA;IACI,kCAAkC;IAClC,kBAAkB;AACtB;;AAEA;IACI,kCAAkC;AACtC;;AAEA,2CAA2C;AAC3C;IACI,kBAAkB;IAClB,YAAY;AAChB;;AAEA;IACI,cAAc;AAClB;;AAEA,8DAA8D;AAC9D;IACI,oBAAoB;IACpB,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;AAC7C;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,gCAAgC;IAChC,6BAA6B;IAC7B,2BAA2B;IAC3B,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI;QACI,UAAU;QACV,WAAW;QACX,kBAAkB;QAClB,YAAY;IAChB;;IAEA;QACI,UAAU;QACV,WAAW;QACX,+BAA+B;QAC/B,+BAA+B;QAC/B,0BAA0B;IAC9B;AACJ;;AAEA,gDAAgD;AAChD;IACI,8BAA8B;IAC9B,oCAAoC;IACpC,6BAA6B;IAC7B,0BAA0B;IAC1B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;AAC/B;;AAEA;IACI,2BAA2B;IAC3B,kFAAkF;IAClF,yBAAyB;AAC7B;;AAEA,gBAAgB;AAChB;IACI,8BAA8B;IAC9B,+BAA+B;IAC/B,6BAA6B;IAC7B,2BAA2B;IAC3B,yBAAyB;AAC7B;;AAEA,iCAAiC;AACjC;IACI,eAAe;IACf,eAAe,EAAE,iCAAiC;IAClD,gBAAgB;AACpB;;AAEA;IACI,iBAAiB,EAAE,wCAAwC;IAC3D,iBAAiB;IACjB,kBAAkB;AACtB;;AAEA,iEAAiE;AACjE;IACI;QACI,iBAAiB;IACrB;AACJ;;AAEA,kCAAkC;AAClC;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;AAClB;;AAEA,0BAA0B;AAC1B;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,uBAAuB;IACvB,cAAc;IACd,gBAAgB;AACpB;;AAEA,6BAA6B;AAC7B;;IAEI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;AACzB;;AAEA,mDAAmD;AACnD;;;;;;;;;IASI,kBAAkB;IAClB,cAAc;IACd,gBAAgB;IAChB,0BAA0B;IAC1B,+CAA+C;IAC/C,yBAAyB;IACzB,YAAY;AAChB;;AAEA;;;;;;;;;IASI,cAAc;IACd,8BAA8B;IAC9B,oCAAoC;IACpC,gBAAgB;IAChB,kBAAkB;AACtB;;AAEA,mCAAmC;AACnC,oCAAoC,uKAAuK,EAAE;AAC7M,uCAAuC,oHAAoH,EAAE;AAC7J,oCAAoC,wKAAwK,EAAE;AAC9M,8CAA8C,4FAA4F,EAAE;AAC5I,uCAAuC,2FAA2F,EAAE;AACpI,qCAAqC,8HAA8H,EAAE;AACrK,4CAA4C,uEAAuE,EAAE;AACrH,8CAA8C,mFAAmF,EAAE;AACnI,oCAAoC,qFAAqF,EAAE;;AAE3H,mDAAmD;AACnD;;;;;;;;;IASI,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,gBAAgB;IAChB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;IACzC,kBAAkB;AACtB;;AAEA,mBAAmB;AACnB;;;;;;;;;IASI,WAAW;IACX,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,6BAA6B;IAC7B,yBAAyB;IACzB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA,2BAA2B;AAC3B;;;;;;;;;;;;;;;;;;IAkBI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,gBAAgB;IAChB,oBAAoB;IACpB,cAAc;IACd,8BAA8B;IAC9B,4DAA4D;IAC5D,0BAA0B;IAC1B,kBAAkB;IAClB,cAAc;AAClB;;AAEA,wBAAwB;AACxB;;;IAGI,eAAe;IACf,WAAW;IACX,cAAc;IACd,eAAe;AACnB;;AAEA,mCAAmC;AACnC;IACI;;QAEI,cAAc;QACd,iBAAiB;QACjB,kBAAkB;IACtB;AACJ;;AAEA;IACI;QACI,aAAa;IACjB;;IAEA;QACI,aAAa;IACjB;AACJ","sourcesContent":["/* Transformers-specific styling additions */\n\n/* Code comparison layout */\n.code-compare {\n display: grid;\n grid-template-columns: 1fr 1fr;\n gap: 1.5rem;\n margin: 2rem 0;\n align-items: start;\n}\n\n.code-compare .code-column {\n background: #ffffff;\n border: 1px solid #e2e8f0;\n border-radius: 8px;\n overflow: hidden;\n box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);\n}\n\n.code-compare .code-header {\n background: #f8f9fa;\n padding: 0.75rem 1rem;\n font-weight: 600;\n color: #495057;\n border-bottom: 1px solid #e2e8f0;\n font-size: 0.9em;\n}\n\n.code-compare pre {\n margin: 0;\n padding: 1rem;\n background: #ffffff;\n overflow-x: auto;\n font-size: 0.85em;\n line-height: 1.4;\n}\n\n.code-compare pre code {\n color: #374151;\n}\n\n/* Mobile responsiveness for code comparison */\n@media (max-width: 768px) {\n .code-compare {\n grid-template-columns: 1fr;\n gap: 1rem;\n }\n}\n\n/* Tenet styling - special highlighting for design principles */\n.tenet-list {\n margin: 3rem 0;\n}\n\n.tenet-list ol {\n counter-reset: tenet-counter -1; /* Start from 0 */\n list-style: none;\n padding-left: 0;\n display: grid;\n grid-template-columns: 1fr;\n gap: 2.5rem;\n max-width: 900px;\n margin: 0 auto;\n}\n\n.tenet-list li.tenet {\n counter-increment: tenet-counter;\n background: linear-gradient(135deg, #ffffff 0%, #f8f9fa 100%);\n border: 2px solid #e2e8f0;\n border-radius: 16px;\n padding: 2rem 2rem 2rem 4rem;\n margin: 0;\n position: relative;\n box-shadow: 0 12px 35px rgba(0, 0, 0, 0.12);\n transition: all 0.3s ease;\n cursor: pointer;\n}\n\n.tenet-list li.tenet:hover {\n transform: translateY(-8px) scale(1.02);\n box-shadow: 0 20px 50px rgba(0, 0, 0, 0.25);\n border-color: rgba(0, 123, 255, 0.5);\n background: linear-gradient(135deg, #ffffff 0%, #f0f8ff 100%);\n}\n\n/* Colorful numbering system */\n.tenet-list li.tenet:nth-child(1):before { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); }\n.tenet-list li.tenet:nth-child(2):before { background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); }\n.tenet-list li.tenet:nth-child(3):before { background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); }\n.tenet-list li.tenet:nth-child(4):before { background: linear-gradient(135deg, #43e97b 0%, #38f9d7 100%); }\n.tenet-list li.tenet:nth-child(5):before { background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); }\n.tenet-list li.tenet:nth-child(6):before { background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); }\n.tenet-list li.tenet:nth-child(7):before { background: linear-gradient(135deg, #ff9a9e 0%, #fecfef 100%); }\n.tenet-list li.tenet:nth-child(8):before { background: linear-gradient(135deg, #a18cd1 0%, #fbc2eb 100%); }\n.tenet-list li.tenet:nth-child(9):before { background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); }\n\n.tenet-list li.tenet:before {\n content: counter(tenet-counter);\n position: absolute;\n top: -12px;\n left: -12px;\n color: white;\n width: 48px;\n height: 48px;\n border-radius: 50%;\n display: flex;\n align-items: center;\n justify-content: center;\n font-size: 1.2em;\n font-weight: bold;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);\n border: 3px solid white;\n}\n\n.tenet-list li.tenet strong {\n color: #1a202c;\n font-size: 1.1em;\n display: block;\n margin-bottom: 0.5rem;\n}\n\n.tenet-list li.tenet em {\n color: #4a5568;\n font-size: 0.95em;\n font-style: italic;\n display: block;\n margin-top: 0.75rem;\n padding: 1rem;\n background: rgba(0, 0, 0, 0.03);\n border-radius: 8px;\n border-left: 3px solid #e2e8f0;\n}\n\n.tenet-list li.tenet p {\n color: #2d3748;\n line-height: 1.6;\n margin: 0.5rem 0;\n}\n\n/* Add a subtle pulse animation for the numbers */\n@keyframes pulse-glow {\n 0% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n 50% { box-shadow: 0 4px 20px rgba(0, 0, 0, 0.25); }\n 100% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n}\n\n.tenet-list li.tenet:hover:before {\n animation: pulse-glow 2s ease-in-out infinite;\n}\n\n/* Interactive component styling */\n.interactive-demo {\n border: 1px solid #e2e8f0;\n border-radius: 12px;\n background: #ffffff;\n margin: 2rem 0;\n overflow: hidden;\n box-shadow: 0 4px 6px rgba(0, 0, 0, 0.07);\n}\n\n/* Model visualization fragment styling */\n[id*=\"plot-model-visualisation\"] {\n margin: 1rem -2rem !important;\n width: calc(100% + 4rem) !important;\n}\n\n.interactive-demo .demo-header {\n background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n color: white;\n padding: 1rem 1.5rem;\n font-weight: 600;\n}\n\n.interactive-demo .demo-content {\n padding: 1.5rem;\n}\n\n.interactive-demo .demo-footer {\n background: #f8f9fa;\n padding: 1rem 1.5rem;\n border-top: 1px solid #e2e8f0;\n color: #6c757d;\n font-size: 0.9em;\n}\n\n/* Button styling for interactive elements */\n.btn-primary {\n background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n border: none;\n color: white;\n padding: 0.75rem 1.5rem;\n border-radius: 6px;\n font-weight: 500;\n cursor: pointer;\n transition: transform 0.2s, box-shadow 0.2s;\n}\n\n.btn-primary:hover {\n transform: translateY(-1px);\n box-shadow: 0 4px 12px rgba(102, 126, 234, 0.3);\n}\n\n.btn-primary:disabled {\n opacity: 0.6;\n cursor: not-allowed;\n transform: none;\n box-shadow: none;\n}\n\n/* Terminal styling */\n.terminal-container {\n background: #1a202c;\n border-radius: 8px;\n padding: 1rem;\n color: #e2e8f0;\n font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;\n font-size: 0.9em;\n}\n\n.terminal-input {\n background: #2d3748;\n border: 1px solid #4a5568;\n color: #e2e8f0;\n padding: 0.5rem;\n border-radius: 4px;\n width: 100%;\n font-family: inherit;\n}\n\n.terminal-output {\n background: #0a0e1a;\n padding: 1rem;\n border-radius: 4px;\n white-space: pre-wrap;\n word-break: break-all;\n min-height: 100px;\n max-height: 300px;\n overflow-y: auto;\n}\n\n/* Attention visualization styling */\n.attention-matrix {\n font-family: monospace;\n font-size: 0.8em;\n border-collapse: collapse;\n margin: 1rem 0;\n}\n\n.attention-matrix td {\n border: 1px solid #ddd;\n padding: 4px 8px;\n text-align: center;\n min-width: 50px;\n}\n\n/* Memory chart styling */\n.memory-chart-container {\n background: #f8f9fa;\n border: 2px solid #e9ecef;\n border-radius: 8px;\n padding: 1rem;\n margin: 1rem 0;\n}\n\n/* Image styling improvements */\nimg {\n max-width: 100%;\n height: auto;\n border-radius: 8px;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);\n margin: 1.5rem 0;\n}\n\n/* Table of contents styling - Fixed positioning like ultrascale */\n@media (min-width: 1200px) {\n d-article {\n overflow: visible !important;\n }\n \n d-contents {\n align-self: start !important;\n background: white !important;\n grid-column-start: 1 !important;\n grid-column-end: 4 !important;\n grid-row: auto / span 6 !important;\n justify-self: end !important;\n margin-top: 0em !important;\n padding-right: 3em !important;\n padding-left: 2em !important;\n position: -webkit-sticky !important; /* For Safari */\n position: sticky !important;\n top: 10px !important;\n overflow-y: auto !important;\n height: calc(100vh - 40px) !important;\n scrollbar-width: none !important;\n transition: max-height 0.3s ease-out !important;\n z-index: -100 !important;\n display: block !important;\n visibility: visible !important;\n }\n}\n\n@media (max-width: 1199px) {\n d-contents {\n display: none !important;\n background: white !important;\n justify-self: start !important;\n align-self: start !important;\n padding-bottom: 0.5em !important;\n margin-bottom: 1em !important;\n padding-left: 0.25em !important;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1) !important;\n overflow-y: scroll !important;\n height: calc(100vh - 40px) !important;\n scrollbar-width: none !important;\n z-index: -100 !important;\n }\n}\n\n/* Force TOC to be visible and override distill defaults */\nd-contents {\n display: block !important;\n visibility: visible !important;\n opacity: 1 !important;\n}\n\n/* TOC Navigation styling */\nd-contents .toc-header {\n margin-bottom: 1.5rem;\n border-bottom: 2px solid #007bff;\n padding-bottom: 0.5rem;\n}\n\nd-contents .toc-title {\n font-weight: bold;\n font-size: 1.2em;\n color: #333;\n}\n\nd-contents nav a {\n color: rgba(0, 0, 0, 0.7);\n text-decoration: none;\n border-bottom: none;\n display: block;\n padding: 0.3rem 0;\n font-size: 0.9em;\n line-height: 1.4;\n transition: color 0.2s ease;\n}\n\nd-contents nav a:hover {\n color: #007bff;\n text-decoration: none;\n}\n\nd-contents nav a.active {\n color: #007bff;\n font-weight: 600;\n}\n\nd-contents nav div {\n margin-bottom: 0.2rem;\n}\n\n/* Smooth scrollbar */\nd-contents {\n scrollbar-width: thin;\n scrollbar-color: rgba(0, 123, 255, 0.3) transparent;\n}\n\nd-contents::-webkit-scrollbar {\n width: 6px;\n}\n\nd-contents::-webkit-scrollbar-track {\n background: transparent;\n}\n\nd-contents::-webkit-scrollbar-thumb {\n background: rgba(0, 123, 255, 0.3);\n border-radius: 3px;\n}\n\nd-contents::-webkit-scrollbar-thumb:hover {\n background: rgba(0, 123, 255, 0.5);\n}\n\n/* Custom tooltip styling for tenet links */\nd-contents nav a[title] {\n position: relative;\n cursor: help;\n}\n\nd-contents nav a[title]:hover {\n color: #667eea;\n}\n\n/* Enhanced tooltip using CSS (fallback for title attribute) */\nd-contents nav a[title]:after {\n content: attr(title);\n position: absolute;\n left: 100%;\n top: 50%;\n transform: translateY(-50%);\n background: #1a202c;\n color: white;\n padding: 0.75rem 1rem;\n border-radius: 8px;\n font-size: 0.85em;\n white-space: normal;\n width: 300px;\n line-height: 1.4;\n z-index: 1001;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n pointer-events: none;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n}\n\nd-contents nav a[title]:before {\n content: '';\n position: absolute;\n left: 100%;\n top: 50%;\n transform: translate(-8px, -50%);\n border: 8px solid transparent;\n border-right-color: #1a202c;\n z-index: 1002;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\nd-contents nav a[title]:hover:after,\nd-contents nav a[title]:hover:before {\n opacity: 1;\n visibility: visible;\n}\n\n/* Adjust for smaller screens */\n@media (max-width: 1400px) {\n d-contents nav a[title]:after {\n left: auto;\n right: 100%;\n margin-right: 1rem;\n width: 250px;\n }\n \n d-contents nav a[title]:before {\n left: auto;\n right: 100%;\n transform: translate(8px, -50%);\n border-right-color: transparent;\n border-left-color: #1a202c;\n }\n}\n\n/* Improve code syntax highlighting with Prism */\npre[class*=\"language-\"] {\n background: #f8f9fa !important;\n border: 1px solid #e9ecef !important;\n border-radius: 8px !important;\n padding: 1.5rem !important;\n margin: 1.5rem 0 !important;\n overflow-x: auto !important;\n font-size: 0.9em !important;\n line-height: 1.5 !important;\n}\n\ncode[class*=\"language-\"] {\n background: none !important;\n font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Courier New', monospace !important;\n color: #383a42 !important;\n}\n\n/* Inline code */\np code, li code {\n background: #f1f3f4 !important;\n padding: 0.2em 0.4em !important;\n border-radius: 3px !important;\n font-size: 0.9em !important;\n color: #d73a49 !important;\n}\n\n/* Distill article improvements */\nd-article {\n max-width: none;\n font-size: 18px; /* Increased from default ~16px */\n line-height: 1.7;\n}\n\nd-article > * {\n max-width: 1100px; /* Increased from 900px for more space */\n margin-left: auto;\n margin-right: auto;\n}\n\n/* Make content even wider on large screens when TOC is present */\n@media (min-width: 1400px) {\n d-article > * {\n max-width: 1300px;\n }\n}\n\n/* Improve paragraph readability */\nd-article p {\n font-size: 18px;\n line-height: 1.8;\n margin-bottom: 1.5rem;\n color: #2d3748;\n}\n\n/* Improve heading sizes */\nd-article h1 {\n font-size: 3rem;\n line-height: 1.2;\n margin: 3rem 0 2rem 0;\n color: #1a202c;\n font-weight: 700;\n}\n\nd-article h2 {\n font-size: 2.5rem;\n line-height: 1.3;\n margin: 2.5rem 0 1.5rem 0;\n color: #1a202c;\n font-weight: 650;\n}\n\nd-article h3 {\n font-size: 2rem;\n line-height: 1.4;\n margin: 2rem 0 1rem 0;\n color: #1a202c;\n font-weight: 600;\n}\n\nd-article h4 {\n font-size: 1.5rem;\n line-height: 1.4;\n margin: 1.5rem 0 1rem 0;\n color: #2d3748;\n font-weight: 600;\n}\n\n/* Improve list readability */\nd-article ul li,\nd-article ol li {\n font-size: 18px;\n line-height: 1.7;\n margin-bottom: 0.5rem;\n}\n\n/* Enhanced tenet reference styling with tooltips */\na[href^=\"#source-of-truth\"],\na[href^=\"#one-model-one-file\"],\na[href^=\"#code-is-product\"],\na[href^=\"#standardize-dont-abstract\"],\na[href^=\"#do-repeat-yourself\"],\na[href^=\"#minimal-user-api\"],\na[href^=\"#backwards-compatibility\"],\na[href^=\"#consistent-public-surface\"],\na[href^=\"#modular-toolbox\"] {\n position: relative;\n color: #667eea;\n font-weight: 600;\n text-decoration: underline;\n text-decoration-color: rgba(102, 126, 234, 0.3);\n transition: all 0.3s ease;\n cursor: help;\n}\n\na[href^=\"#source-of-truth\"]:hover,\na[href^=\"#one-model-one-file\"]:hover,\na[href^=\"#code-is-product\"]:hover,\na[href^=\"#standardize-dont-abstract\"]:hover,\na[href^=\"#do-repeat-yourself\"]:hover,\na[href^=\"#minimal-user-api\"]:hover,\na[href^=\"#backwards-compatibility\"]:hover,\na[href^=\"#consistent-public-surface\"]:hover,\na[href^=\"#modular-toolbox\"]:hover {\n color: #4c51bf;\n text-decoration-color: #4c51bf;\n background: rgba(102, 126, 234, 0.1);\n padding: 2px 4px;\n border-radius: 4px;\n}\n\n/* Tooltip content for each tenet */\na[href^=\"#source-of-truth\"]:after { content: \"We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.\"; }\na[href^=\"#one-model-one-file\"]:after { content: \"All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.\"; }\na[href^=\"#code-is-product\"]:after { content: \"Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.\"; }\na[href^=\"#standardize-dont-abstract\"]:after { content: \"If it's model behavior, keep it in the file; abstractions only for generic infra.\"; }\na[href^=\"#do-repeat-yourself\"]:after { content: \"Copy when it helps users; keep successors in sync without centralizing behavior.\"; }\na[href^=\"#minimal-user-api\"]:after { content: \"Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.\"; }\na[href^=\"#backwards-compatibility\"]:after { content: \"Evolve by additive standardization, never break public APIs.\"; }\na[href^=\"#consistent-public-surface\"]:after { content: \"Same argument names, same outputs, hidden states and attentions exposed.\"; }\na[href^=\"#modular-toolbox\"]:after { content: \"Provide tools and utilities, but don't force users into a rigid framework.\"; }\n\n/* Universal tooltip styling for tenet references */\na[href^=\"#source-of-truth\"]:after,\na[href^=\"#one-model-one-file\"]:after,\na[href^=\"#code-is-product\"]:after,\na[href^=\"#standardize-dont-abstract\"]:after,\na[href^=\"#do-repeat-yourself\"]:after,\na[href^=\"#minimal-user-api\"]:after,\na[href^=\"#backwards-compatibility\"]:after,\na[href^=\"#consistent-public-surface\"]:after,\na[href^=\"#modular-toolbox\"]:after {\n position: absolute;\n bottom: 100%;\n left: 50%;\n transform: translateX(-50%);\n background: #1a202c;\n color: white;\n padding: 0.75rem 1rem;\n border-radius: 8px;\n font-size: 0.85em;\n font-weight: 400;\n white-space: normal;\n width: 320px;\n line-height: 1.4;\n z-index: 1001;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n pointer-events: none;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n margin-bottom: 8px;\n}\n\n/* Tooltip arrows */\na[href^=\"#source-of-truth\"]:before,\na[href^=\"#one-model-one-file\"]:before,\na[href^=\"#code-is-product\"]:before,\na[href^=\"#standardize-dont-abstract\"]:before,\na[href^=\"#do-repeat-yourself\"]:before,\na[href^=\"#minimal-user-api\"]:before,\na[href^=\"#backwards-compatibility\"]:before,\na[href^=\"#consistent-public-surface\"]:before,\na[href^=\"#modular-toolbox\"]:before {\n content: '';\n position: absolute;\n bottom: 100%;\n left: 50%;\n transform: translateX(-50%);\n border: 8px solid transparent;\n border-top-color: #1a202c;\n z-index: 1002;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\n/* Show tooltips on hover */\na[href^=\"#source-of-truth\"]:hover:after,\na[href^=\"#one-model-one-file\"]:hover:after,\na[href^=\"#code-is-product\"]:hover:after,\na[href^=\"#standardize-dont-abstract\"]:hover:after,\na[href^=\"#do-repeat-yourself\"]:hover:after,\na[href^=\"#minimal-user-api\"]:hover:after,\na[href^=\"#backwards-compatibility\"]:hover:after,\na[href^=\"#consistent-public-surface\"]:hover:after,\na[href^=\"#modular-toolbox\"]:hover:after,\na[href^=\"#source-of-truth\"]:hover:before,\na[href^=\"#one-model-one-file\"]:hover:before,\na[href^=\"#code-is-product\"]:hover:before,\na[href^=\"#standardize-dont-abstract\"]:hover:before,\na[href^=\"#do-repeat-yourself\"]:hover:before,\na[href^=\"#minimal-user-api\"]:hover:before,\na[href^=\"#backwards-compatibility\"]:hover:before,\na[href^=\"#consistent-public-surface\"]:hover:before,\na[href^=\"#modular-toolbox\"]:hover:before {\n opacity: 1;\n visibility: visible;\n}\n\n/* Improve blockquote styling */\nd-article blockquote {\n font-size: 19px;\n line-height: 1.8;\n padding: 1.5rem 2rem;\n margin: 2rem 0;\n border-left: 4px solid #667eea;\n background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 50%);\n border-radius: 0 8px 8px 0;\n font-style: italic;\n color: #4a5568;\n}\n\n/* Full width elements */\nd-article .code-compare,\nd-article .interactive-demo,\nd-article .memory-chart-container {\n max-width: none;\n width: 100%;\n margin-left: 0;\n margin-right: 0;\n}\n\n/* Responsive design improvements */\n@media (max-width: 1200px) {\n d-article .code-compare,\n d-article .interactive-demo {\n max-width: 95%;\n margin-left: auto;\n margin-right: auto;\n }\n}\n\n@media (max-width: 768px) {\n .tenet-list li.tenet {\n padding: 1rem;\n }\n \n .interactive-demo .demo-content {\n padding: 1rem;\n }\n}"],"sourceRoot":""}]);
 
 
1866
  // Exports
1867
  /* harmony default export */ const __WEBPACK_DEFAULT_EXPORT__ = (___CSS_LOADER_EXPORT___);
1868
 
@@ -1985,7 +1976,7 @@ var update = injectStylesIntoStyleTag_default()(style/* default */.A, options);
1985
 
1986
 
1987
  // Import any additional functionality
1988
- console.log('Scaling Insanity loaded');
1989
 
1990
  // Add any custom JavaScript functionality here
1991
  document.addEventListener('DOMContentLoaded', function () {
 
1631
  /* Distill article improvements */
1632
  d-article {
1633
  max-width: none;
1634
+ font-size: 19px;
1635
+ line-height: 1.7 !important;
1636
+ color: #1a1a1a;
1637
+ padding-top: 1rem !important;
1638
+ grid-row-gap: 0 !important;
1639
  }
1640
 
1641
  d-article > * {
1642
+ grid-column: middle !important;
1643
+ max-width: none;
 
1644
  }
1645
 
1646
+ /* Adjust for TOC on larger screens */
1647
+ @media (min-width: 1200px) {
1648
  d-article > * {
1649
+ grid-column: text / page-end !important;
1650
+ max-width: none;
1651
  }
1652
  }
1653
 
1654
  /* Improve paragraph readability */
1655
  d-article p {
1656
+ font-size: 19px;
1657
+ line-height: 1.5;
1658
+ margin-top: 0 !important;
1659
+ color: #1a1a1a;
1660
  }
1661
 
1662
  /* Improve heading sizes */
 
1671
  d-article h2 {
1672
  font-size: 2.5rem;
1673
  line-height: 1.3;
1674
+ margin: 1.5rem 0 0.75rem 0 !important;
1675
+ padding-bottom: 0.5rem !important;
1676
  color: #1a202c;
1677
  font-weight: 650;
1678
  }
 
1701
  margin-bottom: 0.5rem;
1702
  }
1703
 
1704
+ /* Enhanced tenet reference styling with custom tooltips */
1705
  a[href^="#source-of-truth"],
1706
  a[href^="#one-model-one-file"],
1707
  a[href^="#code-is-product"],
 
1717
  text-decoration: underline;
1718
  text-decoration-color: rgba(102, 126, 234, 0.3);
1719
  transition: all 0.3s ease;
 
1720
  }
1721
 
1722
  a[href^="#source-of-truth"]:hover,
 
1735
  border-radius: 4px;
1736
  }
1737
 
1738
+ /* Custom tooltip using data-tooltip attribute */
1739
+ a[data-tooltip]:after {
1740
+ content: attr(data-tooltip);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1741
  position: absolute;
1742
  bottom: 100%;
1743
  left: 50%;
 
1760
  margin-bottom: 8px;
1761
  }
1762
 
1763
+ a[data-tooltip]:before {
 
 
 
 
 
 
 
 
 
1764
  content: '';
1765
  position: absolute;
1766
  bottom: 100%;
 
1774
  transition: opacity 0.3s ease, visibility 0.3s ease;
1775
  }
1776
 
1777
+ a[data-tooltip]:hover:after,
1778
+ a[data-tooltip]:hover:before {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1779
  opacity: 1;
1780
  visibility: visible;
1781
  }
 
1793
  color: #4a5568;
1794
  }
1795
 
1796
+ /* Link capsule styling - only for external HTTP(S) links */
1797
+ d-article a[href^="http://"],
1798
+ d-article a[href^="https://"] {
1799
+ background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);
1800
+ color: #1565c0;
1801
+ text-decoration: none;
1802
+ padding: 0.15em 0.5em;
1803
+ border-radius: 12px;
1804
+ border: 1px solid #90caf9;
1805
+ display: inline-block;
1806
+ transition: all 0.3s ease;
1807
+ font-weight: 500;
1808
+ box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);
1809
+ }
1810
+
1811
+ d-article a[href^="http://"]:hover,
1812
+ d-article a[href^="https://"]:hover {
1813
+ background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);
1814
+ color: white;
1815
+ border-color: #1565c0;
1816
+ transform: translateY(-1px);
1817
+ box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);
1818
+ }
1819
+
1820
+ d-article a[href^="http://"]:active,
1821
+ d-article a[href^="https://"]:active {
1822
+ transform: translateY(0);
1823
+ box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);
1824
+ }
1825
+
1826
  /* Full width elements */
1827
  d-article .code-compare,
1828
  d-article .interactive-demo,
 
1847
  .tenet-list li.tenet {
1848
  padding: 1rem;
1849
  }
1850
+
1851
  .interactive-demo .demo-content {
1852
  padding: 1rem;
1853
  }
1854
+ }
1855
+
1856
+ `, "",{"version":3,"sources":["webpack://./src/transformers-custom.css"],"names":[],"mappings":"AAAA,4CAA4C;;AAE5C,2BAA2B;AAC3B;IACI,aAAa;IACb,8BAA8B;IAC9B,WAAW;IACX,cAAc;IACd,kBAAkB;AACtB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,gBAAgB;IAChB,wCAAwC;AAC5C;;AAEA;IACI,mBAAmB;IACnB,qBAAqB;IACrB,gBAAgB;IAChB,cAAc;IACd,gCAAgC;IAChC,gBAAgB;AACpB;;AAEA;IACI,SAAS;IACT,aAAa;IACb,mBAAmB;IACnB,gBAAgB;IAChB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA;IACI,cAAc;AAClB;;AAEA,8CAA8C;AAC9C;IACI;QACI,0BAA0B;QAC1B,SAAS;IACb;AACJ;;AAEA,+DAA+D;AAC/D;IACI,cAAc;AAClB;;AAEA;IACI,+BAA+B,EAAE,iBAAiB;IAClD,gBAAgB;IAChB,eAAe;IACf,aAAa;IACb,0BAA0B;IAC1B,WAAW;IACX,gBAAgB;IAChB,cAAc;AAClB;;AAEA;IACI,gCAAgC;IAChC,6DAA6D;IAC7D,yBAAyB;IACzB,mBAAmB;IACnB,4BAA4B;IAC5B,SAAS;IACT,kBAAkB;IAClB,2CAA2C;IAC3C,yBAAyB;IACzB,eAAe;AACnB;;AAEA;IACI,uCAAuC;IACvC,2CAA2C;IAC3C,oCAAoC;IACpC,6DAA6D;AACjE;;AAEA,8BAA8B;AAC9B,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;AAC1G,2CAA2C,6DAA6D,EAAE;;AAE1G;IACI,+BAA+B;IAC/B,kBAAkB;IAClB,UAAU;IACV,WAAW;IACX,YAAY;IACZ,WAAW;IACX,YAAY;IACZ,kBAAkB;IAClB,aAAa;IACb,mBAAmB;IACnB,uBAAuB;IACvB,gBAAgB;IAChB,iBAAiB;IACjB,0CAA0C;IAC1C,uBAAuB;AAC3B;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,iBAAiB;IACjB,kBAAkB;IAClB,cAAc;IACd,mBAAmB;IACnB,aAAa;IACb,+BAA+B;IAC/B,kBAAkB;IAClB,8BAA8B;AAClC;;AAEA;IACI,cAAc;IACd,gBAAgB;IAChB,gBAAgB;AACpB;;AAEA,iDAAiD;AACjD;IACI,KAAK,0CAA0C,EAAE;IACjD,MAAM,0CAA0C,EAAE;IAClD,OAAO,0CAA0C,EAAE;AACvD;;AAEA;IACI,6CAA6C;AACjD;;AAEA,kCAAkC;AAClC;IACI,yBAAyB;IACzB,mBAAmB;IACnB,mBAAmB;IACnB,cAAc;IACd,gBAAgB;IAChB,yCAAyC;AAC7C;;AAEA,yCAAyC;AACzC;IACI,6BAA6B;IAC7B,mCAAmC;AACvC;;AAEA;IACI,6DAA6D;IAC7D,YAAY;IACZ,oBAAoB;IACpB,gBAAgB;AACpB;;AAEA;IACI,eAAe;AACnB;;AAEA;IACI,mBAAmB;IACnB,oBAAoB;IACpB,6BAA6B;IAC7B,cAAc;IACd,gBAAgB;AACpB;;AAEA,4CAA4C;AAC5C;IACI,6DAA6D;IAC7D,YAAY;IACZ,YAAY;IACZ,uBAAuB;IACvB,kBAAkB;IAClB,gBAAgB;IAChB,eAAe;IACf,2CAA2C;AAC/C;;AAEA;IACI,2BAA2B;IAC3B,+CAA+C;AACnD;;AAEA;IACI,YAAY;IACZ,mBAAmB;IACnB,eAAe;IACf,gBAAgB;AACpB;;AAEA,qBAAqB;AACrB;IACI,mBAAmB;IACnB,kBAAkB;IAClB,aAAa;IACb,cAAc;IACd,wDAAwD;IACxD,gBAAgB;AACpB;;AAEA;IACI,mBAAmB;IACnB,yBAAyB;IACzB,cAAc;IACd,eAAe;IACf,kBAAkB;IAClB,WAAW;IACX,oBAAoB;AACxB;;AAEA;IACI,mBAAmB;IACnB,aAAa;IACb,kBAAkB;IAClB,qBAAqB;IACrB,qBAAqB;IACrB,iBAAiB;IACjB,iBAAiB;IACjB,gBAAgB;AACpB;;AAEA,oCAAoC;AACpC;IACI,sBAAsB;IACtB,gBAAgB;IAChB,yBAAyB;IACzB,cAAc;AAClB;;AAEA;IACI,sBAAsB;IACtB,gBAAgB;IAChB,kBAAkB;IAClB,eAAe;AACnB;;AAEA,yBAAyB;AACzB;IACI,mBAAmB;IACnB,yBAAyB;IACzB,kBAAkB;IAClB,aAAa;IACb,cAAc;AAClB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,YAAY;IACZ,kBAAkB;IAClB,yCAAyC;IACzC,gBAAgB;AACpB;;AAEA,kEAAkE;AAClE;IACI;QACI,4BAA4B;IAChC;;IAEA;QACI,4BAA4B;QAC5B,4BAA4B;QAC5B,+BAA+B;QAC/B,6BAA6B;QAC7B,kCAAkC;QAClC,4BAA4B;QAC5B,0BAA0B;QAC1B,6BAA6B;QAC7B,4BAA4B;QAC5B,mCAAmC,EAAE,eAAe;QACpD,2BAA2B;QAC3B,oBAAoB;QACpB,2BAA2B;QAC3B,qCAAqC;QACrC,gCAAgC;QAChC,+CAA+C;QAC/C,wBAAwB;QACxB,yBAAyB;QACzB,8BAA8B;IAClC;AACJ;;AAEA;IACI;QACI,wBAAwB;QACxB,4BAA4B;QAC5B,8BAA8B;QAC9B,4BAA4B;QAC5B,gCAAgC;QAChC,6BAA6B;QAC7B,+BAA+B;QAC/B,sDAAsD;QACtD,6BAA6B;QAC7B,qCAAqC;QACrC,gCAAgC;QAChC,wBAAwB;IAC5B;AACJ;;AAEA,0DAA0D;AAC1D;IACI,yBAAyB;IACzB,8BAA8B;IAC9B,qBAAqB;AACzB;;AAEA,2BAA2B;AAC3B;IACI,qBAAqB;IACrB,gCAAgC;IAChC,sBAAsB;AAC1B;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,WAAW;AACf;;AAEA;IACI,yBAAyB;IACzB,qBAAqB;IACrB,mBAAmB;IACnB,cAAc;IACd,iBAAiB;IACjB,gBAAgB;IAChB,gBAAgB;IAChB,2BAA2B;AAC/B;;AAEA;IACI,cAAc;IACd,qBAAqB;AACzB;;AAEA;IACI,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,qBAAqB;AACzB;;AAEA,qBAAqB;AACrB;IACI,qBAAqB;IACrB,mDAAmD;AACvD;;AAEA;IACI,UAAU;AACd;;AAEA;IACI,uBAAuB;AAC3B;;AAEA;IACI,kCAAkC;IAClC,kBAAkB;AACtB;;AAEA;IACI,kCAAkC;AACtC;;AAEA,2CAA2C;AAC3C;IACI,kBAAkB;IAClB,YAAY;AAChB;;AAEA;IACI,cAAc;AAClB;;AAEA,8DAA8D;AAC9D;IACI,oBAAoB;IACpB,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;AAC7C;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,UAAU;IACV,QAAQ;IACR,gCAAgC;IAChC,6BAA6B;IAC7B,2BAA2B;IAC3B,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI;QACI,UAAU;QACV,WAAW;QACX,kBAAkB;QAClB,YAAY;IAChB;;IAEA;QACI,UAAU;QACV,WAAW;QACX,+BAA+B;QAC/B,+BAA+B;QAC/B,0BAA0B;IAC9B;AACJ;;AAEA,gDAAgD;AAChD;IACI,8BAA8B;IAC9B,oCAAoC;IACpC,6BAA6B;IAC7B,0BAA0B;IAC1B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;IAC3B,2BAA2B;AAC/B;;AAEA;IACI,2BAA2B;IAC3B,kFAAkF;IAClF,yBAAyB;AAC7B;;AAEA,gBAAgB;AAChB;IACI,8BAA8B;IAC9B,+BAA+B;IAC/B,6BAA6B;IAC7B,2BAA2B;IAC3B,yBAAyB;AAC7B;;AAEA,iCAAiC;AACjC;IACI,eAAe;IACf,eAAe;IACf,2BAA2B;IAC3B,cAAc;IACd,4BAA4B;IAC5B,0BAA0B;AAC9B;;AAEA;IACI,8BAA8B;IAC9B,eAAe;AACnB;;AAEA,qCAAqC;AACrC;IACI;QACI,uCAAuC;QACvC,eAAe;IACnB;AACJ;;AAEA,kCAAkC;AAClC;IACI,eAAe;IACf,gBAAgB;IAChB,wBAAwB;IACxB,cAAc;AAClB;;AAEA,0BAA0B;AAC1B;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,qCAAqC;IACrC,iCAAiC;IACjC,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;IACrB,cAAc;IACd,gBAAgB;AACpB;;AAEA;IACI,iBAAiB;IACjB,gBAAgB;IAChB,uBAAuB;IACvB,cAAc;IACd,gBAAgB;AACpB;;AAEA,6BAA6B;AAC7B;;IAEI,eAAe;IACf,gBAAgB;IAChB,qBAAqB;AACzB;;AAEA,0DAA0D;AAC1D;;;;;;;;;IASI,kBAAkB;IAClB,cAAc;IACd,gBAAgB;IAChB,0BAA0B;IAC1B,+CAA+C;IAC/C,yBAAyB;AAC7B;;AAEA;;;;;;;;;IASI,cAAc;IACd,8BAA8B;IAC9B,oCAAoC;IACpC,gBAAgB;IAChB,kBAAkB;AACtB;;AAEA,gDAAgD;AAChD;IACI,2BAA2B;IAC3B,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,mBAAmB;IACnB,YAAY;IACZ,qBAAqB;IACrB,kBAAkB;IAClB,iBAAiB;IACjB,gBAAgB;IAChB,mBAAmB;IACnB,YAAY;IACZ,gBAAgB;IAChB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;IACnD,oBAAoB;IACpB,yCAAyC;IACzC,kBAAkB;AACtB;;AAEA;IACI,WAAW;IACX,kBAAkB;IAClB,YAAY;IACZ,SAAS;IACT,2BAA2B;IAC3B,6BAA6B;IAC7B,yBAAyB;IACzB,aAAa;IACb,UAAU;IACV,kBAAkB;IAClB,mDAAmD;AACvD;;AAEA;;IAEI,UAAU;IACV,mBAAmB;AACvB;;AAEA,+BAA+B;AAC/B;IACI,eAAe;IACf,gBAAgB;IAChB,oBAAoB;IACpB,cAAc;IACd,8BAA8B;IAC9B,4DAA4D;IAC5D,0BAA0B;IAC1B,kBAAkB;IAClB,cAAc;AAClB;;AAEA,2DAA2D;AAC3D;;IAEI,6DAA6D;IAC7D,cAAc;IACd,qBAAqB;IACrB,qBAAqB;IACrB,mBAAmB;IACnB,yBAAyB;IACzB,qBAAqB;IACrB,yBAAyB;IACzB,gBAAgB;IAChB,8CAA8C;AAClD;;AAEA;;IAEI,6DAA6D;IAC7D,YAAY;IACZ,qBAAqB;IACrB,2BAA2B;IAC3B,8CAA8C;AAClD;;AAEA;;IAEI,wBAAwB;IACxB,6CAA6C;AACjD;;AAEA,wBAAwB;AACxB;;;IAGI,eAAe;IACf,WAAW;IACX,cAAc;IACd,eAAe;AACnB;;AAEA,mCAAmC;AACnC;IACI;;QAEI,cAAc;QACd,iBAAiB;QACjB,kBAAkB;IACtB;AACJ;;AAEA;IACI;QACI,aAAa;IACjB;;IAEA;QACI,aAAa;IACjB;AACJ","sourcesContent":["/* Transformers-specific styling additions */\n\n/* Code comparison layout */\n.code-compare {\n display: grid;\n grid-template-columns: 1fr 1fr;\n gap: 1.5rem;\n margin: 2rem 0;\n align-items: start;\n}\n\n.code-compare .code-column {\n background: #ffffff;\n border: 1px solid #e2e8f0;\n border-radius: 8px;\n overflow: hidden;\n box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);\n}\n\n.code-compare .code-header {\n background: #f8f9fa;\n padding: 0.75rem 1rem;\n font-weight: 600;\n color: #495057;\n border-bottom: 1px solid #e2e8f0;\n font-size: 0.9em;\n}\n\n.code-compare pre {\n margin: 0;\n padding: 1rem;\n background: #ffffff;\n overflow-x: auto;\n font-size: 0.85em;\n line-height: 1.4;\n}\n\n.code-compare pre code {\n color: #374151;\n}\n\n/* Mobile responsiveness for code comparison */\n@media (max-width: 768px) {\n .code-compare {\n grid-template-columns: 1fr;\n gap: 1rem;\n }\n}\n\n/* Tenet styling - special highlighting for design principles */\n.tenet-list {\n margin: 3rem 0;\n}\n\n.tenet-list ol {\n counter-reset: tenet-counter -1; /* Start from 0 */\n list-style: none;\n padding-left: 0;\n display: grid;\n grid-template-columns: 1fr;\n gap: 2.5rem;\n max-width: 900px;\n margin: 0 auto;\n}\n\n.tenet-list li.tenet {\n counter-increment: tenet-counter;\n background: linear-gradient(135deg, #ffffff 0%, #f8f9fa 100%);\n border: 2px solid #e2e8f0;\n border-radius: 16px;\n padding: 2rem 2rem 2rem 4rem;\n margin: 0;\n position: relative;\n box-shadow: 0 12px 35px rgba(0, 0, 0, 0.12);\n transition: all 0.3s ease;\n cursor: pointer;\n}\n\n.tenet-list li.tenet:hover {\n transform: translateY(-8px) scale(1.02);\n box-shadow: 0 20px 50px rgba(0, 0, 0, 0.25);\n border-color: rgba(0, 123, 255, 0.5);\n background: linear-gradient(135deg, #ffffff 0%, #f0f8ff 100%);\n}\n\n/* Colorful numbering system */\n.tenet-list li.tenet:nth-child(1):before { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); }\n.tenet-list li.tenet:nth-child(2):before { background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); }\n.tenet-list li.tenet:nth-child(3):before { background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); }\n.tenet-list li.tenet:nth-child(4):before { background: linear-gradient(135deg, #43e97b 0%, #38f9d7 100%); }\n.tenet-list li.tenet:nth-child(5):before { background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); }\n.tenet-list li.tenet:nth-child(6):before { background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); }\n.tenet-list li.tenet:nth-child(7):before { background: linear-gradient(135deg, #ff9a9e 0%, #fecfef 100%); }\n.tenet-list li.tenet:nth-child(8):before { background: linear-gradient(135deg, #a18cd1 0%, #fbc2eb 100%); }\n.tenet-list li.tenet:nth-child(9):before { background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); }\n\n.tenet-list li.tenet:before {\n content: counter(tenet-counter);\n position: absolute;\n top: -12px;\n left: -12px;\n color: white;\n width: 48px;\n height: 48px;\n border-radius: 50%;\n display: flex;\n align-items: center;\n justify-content: center;\n font-size: 1.2em;\n font-weight: bold;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);\n border: 3px solid white;\n}\n\n.tenet-list li.tenet strong {\n color: #1a202c;\n font-size: 1.1em;\n display: block;\n margin-bottom: 0.5rem;\n}\n\n.tenet-list li.tenet em {\n color: #4a5568;\n font-size: 0.95em;\n font-style: italic;\n display: block;\n margin-top: 0.75rem;\n padding: 1rem;\n background: rgba(0, 0, 0, 0.03);\n border-radius: 8px;\n border-left: 3px solid #e2e8f0;\n}\n\n.tenet-list li.tenet p {\n color: #2d3748;\n line-height: 1.6;\n margin: 0.5rem 0;\n}\n\n/* Add a subtle pulse animation for the numbers */\n@keyframes pulse-glow {\n 0% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n 50% { box-shadow: 0 4px 20px rgba(0, 0, 0, 0.25); }\n 100% { box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); }\n}\n\n.tenet-list li.tenet:hover:before {\n animation: pulse-glow 2s ease-in-out infinite;\n}\n\n/* Interactive component styling */\n.interactive-demo {\n border: 1px solid #e2e8f0;\n border-radius: 12px;\n background: #ffffff;\n margin: 2rem 0;\n overflow: hidden;\n box-shadow: 0 4px 6px rgba(0, 0, 0, 0.07);\n}\n\n/* Model visualization fragment styling */\n[id*=\"plot-model-visualisation\"] {\n margin: 1rem -2rem !important;\n width: calc(100% + 4rem) !important;\n}\n\n.interactive-demo .demo-header {\n background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n color: white;\n padding: 1rem 1.5rem;\n font-weight: 600;\n}\n\n.interactive-demo .demo-content {\n padding: 1.5rem;\n}\n\n.interactive-demo .demo-footer {\n background: #f8f9fa;\n padding: 1rem 1.5rem;\n border-top: 1px solid #e2e8f0;\n color: #6c757d;\n font-size: 0.9em;\n}\n\n/* Button styling for interactive elements */\n.btn-primary {\n background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);\n border: none;\n color: white;\n padding: 0.75rem 1.5rem;\n border-radius: 6px;\n font-weight: 500;\n cursor: pointer;\n transition: transform 0.2s, box-shadow 0.2s;\n}\n\n.btn-primary:hover {\n transform: translateY(-1px);\n box-shadow: 0 4px 12px rgba(102, 126, 234, 0.3);\n}\n\n.btn-primary:disabled {\n opacity: 0.6;\n cursor: not-allowed;\n transform: none;\n box-shadow: none;\n}\n\n/* Terminal styling */\n.terminal-container {\n background: #1a202c;\n border-radius: 8px;\n padding: 1rem;\n color: #e2e8f0;\n font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;\n font-size: 0.9em;\n}\n\n.terminal-input {\n background: #2d3748;\n border: 1px solid #4a5568;\n color: #e2e8f0;\n padding: 0.5rem;\n border-radius: 4px;\n width: 100%;\n font-family: inherit;\n}\n\n.terminal-output {\n background: #0a0e1a;\n padding: 1rem;\n border-radius: 4px;\n white-space: pre-wrap;\n word-break: break-all;\n min-height: 100px;\n max-height: 300px;\n overflow-y: auto;\n}\n\n/* Attention visualization styling */\n.attention-matrix {\n font-family: monospace;\n font-size: 0.8em;\n border-collapse: collapse;\n margin: 1rem 0;\n}\n\n.attention-matrix td {\n border: 1px solid #ddd;\n padding: 4px 8px;\n text-align: center;\n min-width: 50px;\n}\n\n/* Memory chart styling */\n.memory-chart-container {\n background: #f8f9fa;\n border: 2px solid #e9ecef;\n border-radius: 8px;\n padding: 1rem;\n margin: 1rem 0;\n}\n\n/* Image styling improvements */\nimg {\n max-width: 100%;\n height: auto;\n border-radius: 8px;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);\n margin: 1.5rem 0;\n}\n\n/* Table of contents styling - Fixed positioning like ultrascale */\n@media (min-width: 1200px) {\n d-article {\n overflow: visible !important;\n }\n \n d-contents {\n align-self: start !important;\n background: white !important;\n grid-column-start: 1 !important;\n grid-column-end: 4 !important;\n grid-row: auto / span 6 !important;\n justify-self: end !important;\n margin-top: 0em !important;\n padding-right: 3em !important;\n padding-left: 2em !important;\n position: -webkit-sticky !important; /* For Safari */\n position: sticky !important;\n top: 10px !important;\n overflow-y: auto !important;\n height: calc(100vh - 40px) !important;\n scrollbar-width: none !important;\n transition: max-height 0.3s ease-out !important;\n z-index: -100 !important;\n display: block !important;\n visibility: visible !important;\n }\n}\n\n@media (max-width: 1199px) {\n d-contents {\n display: none !important;\n background: white !important;\n justify-self: start !important;\n align-self: start !important;\n padding-bottom: 0.5em !important;\n margin-bottom: 1em !important;\n padding-left: 0.25em !important;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1) !important;\n overflow-y: scroll !important;\n height: calc(100vh - 40px) !important;\n scrollbar-width: none !important;\n z-index: -100 !important;\n }\n}\n\n/* Force TOC to be visible and override distill defaults */\nd-contents {\n display: block !important;\n visibility: visible !important;\n opacity: 1 !important;\n}\n\n/* TOC Navigation styling */\nd-contents .toc-header {\n margin-bottom: 1.5rem;\n border-bottom: 2px solid #007bff;\n padding-bottom: 0.5rem;\n}\n\nd-contents .toc-title {\n font-weight: bold;\n font-size: 1.2em;\n color: #333;\n}\n\nd-contents nav a {\n color: rgba(0, 0, 0, 0.7);\n text-decoration: none;\n border-bottom: none;\n display: block;\n padding: 0.3rem 0;\n font-size: 0.9em;\n line-height: 1.4;\n transition: color 0.2s ease;\n}\n\nd-contents nav a:hover {\n color: #007bff;\n text-decoration: none;\n}\n\nd-contents nav a.active {\n color: #007bff;\n font-weight: 600;\n}\n\nd-contents nav div {\n margin-bottom: 0.2rem;\n}\n\n/* Smooth scrollbar */\nd-contents {\n scrollbar-width: thin;\n scrollbar-color: rgba(0, 123, 255, 0.3) transparent;\n}\n\nd-contents::-webkit-scrollbar {\n width: 6px;\n}\n\nd-contents::-webkit-scrollbar-track {\n background: transparent;\n}\n\nd-contents::-webkit-scrollbar-thumb {\n background: rgba(0, 123, 255, 0.3);\n border-radius: 3px;\n}\n\nd-contents::-webkit-scrollbar-thumb:hover {\n background: rgba(0, 123, 255, 0.5);\n}\n\n/* Custom tooltip styling for tenet links */\nd-contents nav a[title] {\n position: relative;\n cursor: help;\n}\n\nd-contents nav a[title]:hover {\n color: #667eea;\n}\n\n/* Enhanced tooltip using CSS (fallback for title attribute) */\nd-contents nav a[title]:after {\n content: attr(title);\n position: absolute;\n left: 100%;\n top: 50%;\n transform: translateY(-50%);\n background: #1a202c;\n color: white;\n padding: 0.75rem 1rem;\n border-radius: 8px;\n font-size: 0.85em;\n white-space: normal;\n width: 300px;\n line-height: 1.4;\n z-index: 1001;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n pointer-events: none;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n}\n\nd-contents nav a[title]:before {\n content: '';\n position: absolute;\n left: 100%;\n top: 50%;\n transform: translate(-8px, -50%);\n border: 8px solid transparent;\n border-right-color: #1a202c;\n z-index: 1002;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\nd-contents nav a[title]:hover:after,\nd-contents nav a[title]:hover:before {\n opacity: 1;\n visibility: visible;\n}\n\n/* Adjust for smaller screens */\n@media (max-width: 1400px) {\n d-contents nav a[title]:after {\n left: auto;\n right: 100%;\n margin-right: 1rem;\n width: 250px;\n }\n \n d-contents nav a[title]:before {\n left: auto;\n right: 100%;\n transform: translate(8px, -50%);\n border-right-color: transparent;\n border-left-color: #1a202c;\n }\n}\n\n/* Improve code syntax highlighting with Prism */\npre[class*=\"language-\"] {\n background: #f8f9fa !important;\n border: 1px solid #e9ecef !important;\n border-radius: 8px !important;\n padding: 1.5rem !important;\n margin: 1.5rem 0 !important;\n overflow-x: auto !important;\n font-size: 0.9em !important;\n line-height: 1.5 !important;\n}\n\ncode[class*=\"language-\"] {\n background: none !important;\n font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Courier New', monospace !important;\n color: #383a42 !important;\n}\n\n/* Inline code */\np code, li code {\n background: #f1f3f4 !important;\n padding: 0.2em 0.4em !important;\n border-radius: 3px !important;\n font-size: 0.9em !important;\n color: #d73a49 !important;\n}\n\n/* Distill article improvements */\nd-article {\n max-width: none;\n font-size: 19px;\n line-height: 1.7 !important;\n color: #1a1a1a;\n padding-top: 1rem !important;\n grid-row-gap: 0 !important;\n}\n\nd-article > * {\n grid-column: middle !important;\n max-width: none;\n}\n\n/* Adjust for TOC on larger screens */\n@media (min-width: 1200px) {\n d-article > * {\n grid-column: text / page-end !important;\n max-width: none;\n }\n}\n\n/* Improve paragraph readability */\nd-article p {\n font-size: 19px;\n line-height: 1.5;\n margin-top: 0 !important;\n color: #1a1a1a;\n}\n\n/* Improve heading sizes */\nd-article h1 {\n font-size: 3rem;\n line-height: 1.2;\n margin: 3rem 0 2rem 0;\n color: #1a202c;\n font-weight: 700;\n}\n\nd-article h2 {\n font-size: 2.5rem;\n line-height: 1.3;\n margin: 1.5rem 0 0.75rem 0 !important;\n padding-bottom: 0.5rem !important;\n color: #1a202c;\n font-weight: 650;\n}\n\nd-article h3 {\n font-size: 2rem;\n line-height: 1.4;\n margin: 2rem 0 1rem 0;\n color: #1a202c;\n font-weight: 600;\n}\n\nd-article h4 {\n font-size: 1.5rem;\n line-height: 1.4;\n margin: 1.5rem 0 1rem 0;\n color: #2d3748;\n font-weight: 600;\n}\n\n/* Improve list readability */\nd-article ul li,\nd-article ol li {\n font-size: 18px;\n line-height: 1.7;\n margin-bottom: 0.5rem;\n}\n\n/* Enhanced tenet reference styling with custom tooltips */\na[href^=\"#source-of-truth\"],\na[href^=\"#one-model-one-file\"],\na[href^=\"#code-is-product\"],\na[href^=\"#standardize-dont-abstract\"],\na[href^=\"#do-repeat-yourself\"],\na[href^=\"#minimal-user-api\"],\na[href^=\"#backwards-compatibility\"],\na[href^=\"#consistent-public-surface\"],\na[href^=\"#modular-toolbox\"] {\n position: relative;\n color: #667eea;\n font-weight: 600;\n text-decoration: underline;\n text-decoration-color: rgba(102, 126, 234, 0.3);\n transition: all 0.3s ease;\n}\n\na[href^=\"#source-of-truth\"]:hover,\na[href^=\"#one-model-one-file\"]:hover,\na[href^=\"#code-is-product\"]:hover,\na[href^=\"#standardize-dont-abstract\"]:hover,\na[href^=\"#do-repeat-yourself\"]:hover,\na[href^=\"#minimal-user-api\"]:hover,\na[href^=\"#backwards-compatibility\"]:hover,\na[href^=\"#consistent-public-surface\"]:hover,\na[href^=\"#modular-toolbox\"]:hover {\n color: #4c51bf;\n text-decoration-color: #4c51bf;\n background: rgba(102, 126, 234, 0.1);\n padding: 2px 4px;\n border-radius: 4px;\n}\n\n/* Custom tooltip using data-tooltip attribute */\na[data-tooltip]:after {\n content: attr(data-tooltip);\n position: absolute;\n bottom: 100%;\n left: 50%;\n transform: translateX(-50%);\n background: #1a202c;\n color: white;\n padding: 0.75rem 1rem;\n border-radius: 8px;\n font-size: 0.85em;\n font-weight: 400;\n white-space: normal;\n width: 320px;\n line-height: 1.4;\n z-index: 1001;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n pointer-events: none;\n box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);\n margin-bottom: 8px;\n}\n\na[data-tooltip]:before {\n content: '';\n position: absolute;\n bottom: 100%;\n left: 50%;\n transform: translateX(-50%);\n border: 8px solid transparent;\n border-top-color: #1a202c;\n z-index: 1002;\n opacity: 0;\n visibility: hidden;\n transition: opacity 0.3s ease, visibility 0.3s ease;\n}\n\na[data-tooltip]:hover:after,\na[data-tooltip]:hover:before {\n opacity: 1;\n visibility: visible;\n}\n\n/* Improve blockquote styling */\nd-article blockquote {\n font-size: 19px;\n line-height: 1.8;\n padding: 1.5rem 2rem;\n margin: 2rem 0;\n border-left: 4px solid #667eea;\n background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 50%);\n border-radius: 0 8px 8px 0;\n font-style: italic;\n color: #4a5568;\n}\n\n/* Link capsule styling - only for external HTTP(S) links */\nd-article a[href^=\"http://\"],\nd-article a[href^=\"https://\"] {\n background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);\n color: #1565c0;\n text-decoration: none;\n padding: 0.15em 0.5em;\n border-radius: 12px;\n border: 1px solid #90caf9;\n display: inline-block;\n transition: all 0.3s ease;\n font-weight: 500;\n box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);\n}\n\nd-article a[href^=\"http://\"]:hover,\nd-article a[href^=\"https://\"]:hover {\n background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);\n color: white;\n border-color: #1565c0;\n transform: translateY(-1px);\n box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);\n}\n\nd-article a[href^=\"http://\"]:active,\nd-article a[href^=\"https://\"]:active {\n transform: translateY(0);\n box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);\n}\n\n/* Full width elements */\nd-article .code-compare,\nd-article .interactive-demo,\nd-article .memory-chart-container {\n max-width: none;\n width: 100%;\n margin-left: 0;\n margin-right: 0;\n}\n\n/* Responsive design improvements */\n@media (max-width: 1200px) {\n d-article .code-compare,\n d-article .interactive-demo {\n max-width: 95%;\n margin-left: auto;\n margin-right: auto;\n }\n}\n\n@media (max-width: 768px) {\n .tenet-list li.tenet {\n padding: 1rem;\n }\n\n .interactive-demo .demo-content {\n padding: 1rem;\n }\n}\n\n"],"sourceRoot":""}]);
1857
  // Exports
1858
  /* harmony default export */ const __WEBPACK_DEFAULT_EXPORT__ = (___CSS_LOADER_EXPORT___);
1859
 
 
1976
 
1977
 
1978
  // Import any additional functionality
1979
+ console.log('blog loaded');
1980
 
1981
  // Add any custom JavaScript functionality here
1982
  document.addEventListener('DOMContentLoaded', function () {
dist/main.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
src/distill.js CHANGED
@@ -2102,7 +2102,7 @@ d-appendix > distill-appendix {
2102
  </div>
2103
  <div >
2104
  <h3>Published</h3>
2105
- <div>August, 2025</div>
2106
  </div>
2107
  </div>
2108
 
 
2102
  </div>
2103
  <div >
2104
  <h3>Published</h3>
2105
+ <div>October, 2025</div>
2106
  </div>
2107
  </div>
2108
 
src/index.js CHANGED
@@ -2,7 +2,7 @@
2
  import './style.css';
3
 
4
  // Import any additional functionality
5
- console.log('Scaling Insanity loaded');
6
 
7
  // Add any custom JavaScript functionality here
8
  document.addEventListener('DOMContentLoaded', function() {
 
2
  import './style.css';
3
 
4
  // Import any additional functionality
5
+ console.log('blog loaded');
6
 
7
  // Add any custom JavaScript functionality here
8
  document.addEventListener('DOMContentLoaded', function() {
src/transformers-custom.css CHANGED
@@ -486,29 +486,32 @@ p code, li code {
486
  /* Distill article improvements */
487
  d-article {
488
  max-width: none;
489
- font-size: 18px; /* Increased from default ~16px */
490
- line-height: 1.7;
 
 
 
491
  }
492
 
493
  d-article > * {
494
- max-width: 1100px; /* Increased from 900px for more space */
495
- margin-left: auto;
496
- margin-right: auto;
497
  }
498
 
499
- /* Make content even wider on large screens when TOC is present */
500
- @media (min-width: 1400px) {
501
  d-article > * {
502
- max-width: 1300px;
 
503
  }
504
  }
505
 
506
  /* Improve paragraph readability */
507
  d-article p {
508
- font-size: 18px;
509
- line-height: 1.8;
510
- margin-bottom: 1.5rem;
511
- color: #2d3748;
512
  }
513
 
514
  /* Improve heading sizes */
@@ -523,7 +526,8 @@ d-article h1 {
523
  d-article h2 {
524
  font-size: 2.5rem;
525
  line-height: 1.3;
526
- margin: 2.5rem 0 1.5rem 0;
 
527
  color: #1a202c;
528
  font-weight: 650;
529
  }
@@ -552,7 +556,7 @@ d-article ol li {
552
  margin-bottom: 0.5rem;
553
  }
554
 
555
- /* Enhanced tenet reference styling with tooltips */
556
  a[href^="#source-of-truth"],
557
  a[href^="#one-model-one-file"],
558
  a[href^="#code-is-product"],
@@ -568,7 +572,6 @@ a[href^="#modular-toolbox"] {
568
  text-decoration: underline;
569
  text-decoration-color: rgba(102, 126, 234, 0.3);
570
  transition: all 0.3s ease;
571
- cursor: help;
572
  }
573
 
574
  a[href^="#source-of-truth"]:hover,
@@ -587,27 +590,9 @@ a[href^="#modular-toolbox"]:hover {
587
  border-radius: 4px;
588
  }
589
 
590
- /* Tooltip content for each tenet */
591
- a[href^="#source-of-truth"]:after { content: "We should be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances."; }
592
- a[href^="#one-model-one-file"]:after { content: "All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom."; }
593
- a[href^="#code-is-product"]:after { content: "Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial."; }
594
- a[href^="#standardize-dont-abstract"]:after { content: "If it's model behavior, keep it in the file; abstractions only for generic infra."; }
595
- a[href^="#do-repeat-yourself"]:after { content: "Copy when it helps users; keep successors in sync without centralizing behavior."; }
596
- a[href^="#minimal-user-api"]:after { content: "Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths."; }
597
- a[href^="#backwards-compatibility"]:after { content: "Evolve by additive standardization, never break public APIs."; }
598
- a[href^="#consistent-public-surface"]:after { content: "Same argument names, same outputs, hidden states and attentions exposed."; }
599
- a[href^="#modular-toolbox"]:after { content: "Provide tools and utilities, but don't force users into a rigid framework."; }
600
-
601
- /* Universal tooltip styling for tenet references */
602
- a[href^="#source-of-truth"]:after,
603
- a[href^="#one-model-one-file"]:after,
604
- a[href^="#code-is-product"]:after,
605
- a[href^="#standardize-dont-abstract"]:after,
606
- a[href^="#do-repeat-yourself"]:after,
607
- a[href^="#minimal-user-api"]:after,
608
- a[href^="#backwards-compatibility"]:after,
609
- a[href^="#consistent-public-surface"]:after,
610
- a[href^="#modular-toolbox"]:after {
611
  position: absolute;
612
  bottom: 100%;
613
  left: 50%;
@@ -630,16 +615,7 @@ a[href^="#modular-toolbox"]:after {
630
  margin-bottom: 8px;
631
  }
632
 
633
- /* Tooltip arrows */
634
- a[href^="#source-of-truth"]:before,
635
- a[href^="#one-model-one-file"]:before,
636
- a[href^="#code-is-product"]:before,
637
- a[href^="#standardize-dont-abstract"]:before,
638
- a[href^="#do-repeat-yourself"]:before,
639
- a[href^="#minimal-user-api"]:before,
640
- a[href^="#backwards-compatibility"]:before,
641
- a[href^="#consistent-public-surface"]:before,
642
- a[href^="#modular-toolbox"]:before {
643
  content: '';
644
  position: absolute;
645
  bottom: 100%;
@@ -653,25 +629,8 @@ a[href^="#modular-toolbox"]:before {
653
  transition: opacity 0.3s ease, visibility 0.3s ease;
654
  }
655
 
656
- /* Show tooltips on hover */
657
- a[href^="#source-of-truth"]:hover:after,
658
- a[href^="#one-model-one-file"]:hover:after,
659
- a[href^="#code-is-product"]:hover:after,
660
- a[href^="#standardize-dont-abstract"]:hover:after,
661
- a[href^="#do-repeat-yourself"]:hover:after,
662
- a[href^="#minimal-user-api"]:hover:after,
663
- a[href^="#backwards-compatibility"]:hover:after,
664
- a[href^="#consistent-public-surface"]:hover:after,
665
- a[href^="#modular-toolbox"]:hover:after,
666
- a[href^="#source-of-truth"]:hover:before,
667
- a[href^="#one-model-one-file"]:hover:before,
668
- a[href^="#code-is-product"]:hover:before,
669
- a[href^="#standardize-dont-abstract"]:hover:before,
670
- a[href^="#do-repeat-yourself"]:hover:before,
671
- a[href^="#minimal-user-api"]:hover:before,
672
- a[href^="#backwards-compatibility"]:hover:before,
673
- a[href^="#consistent-public-surface"]:hover:before,
674
- a[href^="#modular-toolbox"]:hover:before {
675
  opacity: 1;
676
  visibility: visible;
677
  }
@@ -689,6 +648,36 @@ d-article blockquote {
689
  color: #4a5568;
690
  }
691
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
692
  /* Full width elements */
693
  d-article .code-compare,
694
  d-article .interactive-demo,
@@ -713,8 +702,9 @@ d-article .memory-chart-container {
713
  .tenet-list li.tenet {
714
  padding: 1rem;
715
  }
716
-
717
  .interactive-demo .demo-content {
718
  padding: 1rem;
719
  }
720
- }
 
 
486
  /* Distill article improvements */
487
  d-article {
488
  max-width: none;
489
+ font-size: 19px;
490
+ line-height: 1.7 !important;
491
+ color: #1a1a1a;
492
+ padding-top: 1rem !important;
493
+ grid-row-gap: 0 !important;
494
  }
495
 
496
  d-article > * {
497
+ grid-column: middle !important;
498
+ max-width: none;
 
499
  }
500
 
501
+ /* Adjust for TOC on larger screens */
502
+ @media (min-width: 1200px) {
503
  d-article > * {
504
+ grid-column: text / page-end !important;
505
+ max-width: none;
506
  }
507
  }
508
 
509
  /* Improve paragraph readability */
510
  d-article p {
511
+ font-size: 19px;
512
+ line-height: 1.5;
513
+ margin-top: 0 !important;
514
+ color: #1a1a1a;
515
  }
516
 
517
  /* Improve heading sizes */
 
526
  d-article h2 {
527
  font-size: 2.5rem;
528
  line-height: 1.3;
529
+ margin: 1.5rem 0 0.75rem 0 !important;
530
+ padding-bottom: 0.5rem !important;
531
  color: #1a202c;
532
  font-weight: 650;
533
  }
 
556
  margin-bottom: 0.5rem;
557
  }
558
 
559
+ /* Enhanced tenet reference styling with custom tooltips */
560
  a[href^="#source-of-truth"],
561
  a[href^="#one-model-one-file"],
562
  a[href^="#code-is-product"],
 
572
  text-decoration: underline;
573
  text-decoration-color: rgba(102, 126, 234, 0.3);
574
  transition: all 0.3s ease;
 
575
  }
576
 
577
  a[href^="#source-of-truth"]:hover,
 
590
  border-radius: 4px;
591
  }
592
 
593
+ /* Custom tooltip using data-tooltip attribute */
594
+ a[data-tooltip]:after {
595
+ content: attr(data-tooltip);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
596
  position: absolute;
597
  bottom: 100%;
598
  left: 50%;
 
615
  margin-bottom: 8px;
616
  }
617
 
618
+ a[data-tooltip]:before {
 
 
 
 
 
 
 
 
 
619
  content: '';
620
  position: absolute;
621
  bottom: 100%;
 
629
  transition: opacity 0.3s ease, visibility 0.3s ease;
630
  }
631
 
632
+ a[data-tooltip]:hover:after,
633
+ a[data-tooltip]:hover:before {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
634
  opacity: 1;
635
  visibility: visible;
636
  }
 
648
  color: #4a5568;
649
  }
650
 
651
+ /* Link capsule styling - only for external HTTP(S) links */
652
+ d-article a[href^="http://"],
653
+ d-article a[href^="https://"] {
654
+ background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%);
655
+ color: #1565c0;
656
+ text-decoration: none;
657
+ padding: 0.15em 0.5em;
658
+ border-radius: 12px;
659
+ border: 1px solid #90caf9;
660
+ display: inline-block;
661
+ transition: all 0.3s ease;
662
+ font-weight: 500;
663
+ box-shadow: 0 1px 3px rgba(21, 101, 192, 0.15);
664
+ }
665
+
666
+ d-article a[href^="http://"]:hover,
667
+ d-article a[href^="https://"]:hover {
668
+ background: linear-gradient(135deg, #2196f3 0%, #1976d2 100%);
669
+ color: white;
670
+ border-color: #1565c0;
671
+ transform: translateY(-1px);
672
+ box-shadow: 0 4px 12px rgba(21, 101, 192, 0.3);
673
+ }
674
+
675
+ d-article a[href^="http://"]:active,
676
+ d-article a[href^="https://"]:active {
677
+ transform: translateY(0);
678
+ box-shadow: 0 1px 3px rgba(21, 101, 192, 0.2);
679
+ }
680
+
681
  /* Full width elements */
682
  d-article .code-compare,
683
  d-article .interactive-demo,
 
702
  .tenet-list li.tenet {
703
  padding: 1rem;
704
  }
705
+
706
  .interactive-demo .demo-content {
707
  padding: 1rem;
708
  }
709
+ }
710
+
webpack.config.js CHANGED
@@ -26,23 +26,24 @@ const loadFragmentsMap = (() => {
26
  if (fs.statSync(filePath).isDirectory()) {
27
  await walkDir(filePath, relativePath);
28
  } else {
29
- // Remove the .html extension before creating the dotted path
30
  const nameWithoutExt = relativePath.replace(/\.html$/, '');
31
  const dottedPath = 'fragment-' + nameWithoutExt.replace(/\\/g, '-').replace(/\//g, '-').replace(/\./g, '-');
32
  const content = fs.readFileSync(filePath, "utf8");
33
- // Minify the HTML content using swcMinifyFragment
34
  let minifiedContent;
35
- try {
36
- const minifiedRes = await HtmlMinimizerPlugin.swcMinifyFragment({"tmp.html": content})
37
- if (minifiedRes.errors) {
38
- console.warn("HTML minification warnings:", minifiedRes.errors);
39
- minifiedContent = content; // Use original content if errors
40
- } else {
41
- minifiedContent = minifiedRes.code;
 
 
 
 
 
 
42
  }
43
- } catch (error) {
44
- console.warn(`Failed to minify fragment ${filePath}, using original content:`, error.message);
45
- minifiedContent = content; // Fallback to original content
46
  }
47
  cachedFragments[dottedPath] = minifiedContent;
48
  }
@@ -94,8 +95,7 @@ module.exports = {
94
  presets: ["@babel/preset-env"],
95
  },
96
  },
97
- },
98
- {}
99
  ],
100
  },
101
  plugins: [
@@ -104,6 +104,7 @@ module.exports = {
104
  patterns: [
105
  { from: "src/fragments/*", to: "fragments/[name].html" },
106
  { from: "src/style.css", to: "style.css" },
 
107
  { from: "content/*.png", to: "static/[name][ext]" },
108
  { from: "content/*.svg", to: "static/[name][ext]" },
109
  { from: "content/*.html", to: "static/[name][ext]" },
@@ -150,28 +151,27 @@ module.exports = {
150
 
151
  // Extract tenet text for tooltips
152
  const tenetTooltips = {
153
- 'source-of-truth': 'We aim to be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
154
- 'one-model-one-file': 'All inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.',
155
  'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
156
  'standardize-dont-abstract': 'If it\\'s model behavior, keep it in the file; abstractions only for generic infra.',
157
  'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
158
  'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
159
  'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
160
- 'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed.',
161
  };
162
 
163
- // Add smooth scrolling and active state
164
  const tocLinks = document.querySelectorAll('d-contents a');
165
  tocLinks.forEach(link => {
166
  const href = link.getAttribute('href');
167
  const anchor = href ? href.substring(1) : '';
168
-
169
- // Add tooltip if this is a tenet link
170
  if (tenetTooltips[anchor]) {
171
- link.setAttribute('title', tenetTooltips[anchor]);
172
  link.style.position = 'relative';
173
  }
174
-
175
  link.addEventListener('click', function(e) {
176
  e.preventDefault();
177
  const target = document.querySelector(this.getAttribute('href'));
@@ -180,6 +180,16 @@ module.exports = {
180
  }
181
  });
182
  });
 
 
 
 
 
 
 
 
 
 
183
 
184
  // Update active state on scroll
185
  window.addEventListener('scroll', function() {
@@ -224,7 +234,7 @@ module.exports = {
224
  initializeSyntaxHighlighting();
225
  }, 1000);
226
  </script>`;
227
-
228
  // Create full HTML document with distill template
229
  const template = `<!DOCTYPE html>
230
  <html>
@@ -238,6 +248,7 @@ module.exports = {
238
  <meta charset="utf8">
239
  <title>${appConfig.fullTitle}</title>
240
  <link rel="stylesheet" href="style.css">
 
241
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
242
  </head>
243
  <body>
 
26
  if (fs.statSync(filePath).isDirectory()) {
27
  await walkDir(filePath, relativePath);
28
  } else {
 
29
  const nameWithoutExt = relativePath.replace(/\.html$/, '');
30
  const dottedPath = 'fragment-' + nameWithoutExt.replace(/\\/g, '-').replace(/\//g, '-').replace(/\./g, '-');
31
  const content = fs.readFileSync(filePath, "utf8");
 
32
  let minifiedContent;
33
+
34
+ if (content.trim().startsWith('<!DOCTYPE') || content.trim().startsWith('<html')) {
35
+ minifiedContent = content;
36
+ } else {
37
+ try {
38
+ const minifiedRes = await HtmlMinimizerPlugin.swcMinifyFragment({"tmp.html": content})
39
+ if (minifiedRes.errors) {
40
+ minifiedContent = content;
41
+ } else {
42
+ minifiedContent = minifiedRes.code;
43
+ }
44
+ } catch (error) {
45
+ minifiedContent = content;
46
  }
 
 
 
47
  }
48
  cachedFragments[dottedPath] = minifiedContent;
49
  }
 
95
  presets: ["@babel/preset-env"],
96
  },
97
  },
98
+ }
 
99
  ],
100
  },
101
  plugins: [
 
104
  patterns: [
105
  { from: "src/fragments/*", to: "fragments/[name].html" },
106
  { from: "src/style.css", to: "style.css" },
107
+ { from: "src/transformers-custom.css", to: "transformers-custom.css" },
108
  { from: "content/*.png", to: "static/[name][ext]" },
109
  { from: "content/*.svg", to: "static/[name][ext]" },
110
  { from: "content/*.html", to: "static/[name][ext]" },
 
151
 
152
  // Extract tenet text for tooltips
153
  const tenetTooltips = {
154
+ 'source-of-truth': 'We aim be a source of truth for all model definitions. Model implementations should be reliable, reproducible, and faithful to the original performances.',
155
+ 'one-model-one-file': 'All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model\\'s hackability.',
156
  'code-is-product': 'Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.',
157
  'standardize-dont-abstract': 'If it\\'s model behavior, keep it in the file; abstractions only for generic infra.',
158
  'do-repeat-yourself': 'Copy when it helps users; keep successors in sync without centralizing behavior.',
159
  'minimal-user-api': 'Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths.',
160
  'backwards-compatibility': 'Evolve by additive standardization, never break public APIs.',
161
+ 'consistent-public-surface': 'Same argument names, same outputs, hidden states and attentions exposed, enforced by tests.',
162
  };
163
 
164
+ // Add smooth scrolling and custom tooltips to all tenet links (TOC and article)
165
  const tocLinks = document.querySelectorAll('d-contents a');
166
  tocLinks.forEach(link => {
167
  const href = link.getAttribute('href');
168
  const anchor = href ? href.substring(1) : '';
169
+
 
170
  if (tenetTooltips[anchor]) {
171
+ link.setAttribute('data-tooltip', tenetTooltips[anchor]);
172
  link.style.position = 'relative';
173
  }
174
+
175
  link.addEventListener('click', function(e) {
176
  e.preventDefault();
177
  const target = document.querySelector(this.getAttribute('href'));
 
180
  }
181
  });
182
  });
183
+
184
+ // Add custom tooltips to tenet links in article content
185
+ const articleLinks = document.querySelectorAll('d-article a[href^="#"]');
186
+ articleLinks.forEach(link => {
187
+ const href = link.getAttribute('href');
188
+ const anchor = href ? href.substring(1) : '';
189
+ if (tenetTooltips[anchor]) {
190
+ link.setAttribute('data-tooltip', tenetTooltips[anchor]);
191
+ }
192
+ });
193
 
194
  // Update active state on scroll
195
  window.addEventListener('scroll', function() {
 
234
  initializeSyntaxHighlighting();
235
  }, 1000);
236
  </script>`;
237
+
238
  // Create full HTML document with distill template
239
  const template = `<!DOCTYPE html>
240
  <html>
 
248
  <meta charset="utf8">
249
  <title>${appConfig.fullTitle}</title>
250
  <link rel="stylesheet" href="style.css">
251
+ <link rel="stylesheet" href="transformers-custom.css">
252
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css">
253
  </head>
254
  <body>