tfrere HF Staff commited on
Commit
909948e
·
1 Parent(s): 6c9bb4f

update breadcrumbs

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +36 -36
app/src/content/article.mdx CHANGED
@@ -170,11 +170,11 @@ We needed to separate two principles that were so far intertwined, <Tenet term="
170
 
171
  What was the solution to this? Let's talk about modular transformers.
172
 
173
- <div class="crumbs">
174
  <strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
175
 
176
  <strong>Next:</strong> how modular transformers honor these while removing boilerplate.
177
- </div>
178
 
179
 
180
  ## <a id="modular"></a> Modular transformers
@@ -355,11 +355,11 @@ More importantly, the auto-generated modeling file is what users _read_ to under
355
 
356
  What does that give us?
357
 
358
- <div class="crumbs">
359
  <strong>TL;DR:</strong> A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible and <Tenet term="one-model-one-file" display="unique" position="top"/>. Reviewers and contributors maintain the shard, not the repetition.
360
 
361
  <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
362
- </div>
363
 
364
 
365
  ### A maintainable control surface
@@ -386,11 +386,11 @@ The _attention computation_ itself happens at a _lower_ level of abstraction tha
386
 
387
  However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we did.
388
 
389
- <div class="crumbs">
390
  Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
391
 
392
  <strong>Next:</strong> how the attention interface stays standard without hiding semantics.
393
- </div>
394
 
395
  ### <a id="attention-classes"></a> External Attention classes
396
 
@@ -423,11 +423,11 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
423
  ```
424
 
425
 
426
- <div class="crumbs">
427
  Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
428
 
429
  <strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
430
- </div>
431
 
432
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
433
 
@@ -479,11 +479,11 @@ The `tp_plan` solution allows users to run the same model on a single GPU, or di
479
 
480
  Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
481
 
482
- <div class="crumbs">
483
  Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
484
 
485
  <strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
486
- </div>
487
 
488
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
489
 
@@ -514,11 +514,11 @@ and the configuration can be _explicit_ about which attention type is in which l
514
 
515
  This is <Tenet term="minimal-user-api" display="minimal" position="top" /> to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
516
 
517
- <div class="crumbs">
518
  Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
519
 
520
  <strong>Next:</strong> speedups come from kernels that don't change semantics.
521
- </div>
522
 
523
 
524
  ### <a id="community-kernels"></a>Community Kernels
@@ -535,11 +535,11 @@ This also opens another contribution path: GPU specialists can contribute optimi
535
 
536
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
537
 
538
- <div class="crumbs">
539
  Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
540
 
541
  <strong>Next:</strong> what modularity looks like across the repo.
542
- </div>
543
 
544
 
545
  ## A Modular State
@@ -587,10 +587,10 @@ Another problem is, this visualization only shows `modular` models. Several mode
587
 
588
  Hence the next question, and how do we identify modularisable models?
589
 
590
- <div class="crumbs">
591
  Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
592
  <strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
593
- </div>
594
 
595
 
596
  ### Many models, but not enough yet, are alike
@@ -628,11 +628,11 @@ Here `roberta`, `xlm_roberta`, `ernie` are `modular`s of BERT, while models like
628
  <Image src={classicEncoders} alt="Classical encoders" zoomable caption="<strong>Figure 7:</strong> Family of classical encoders centered on BERT, with several models already modularized." />
629
 
630
 
631
- <div class="crumbs">
632
  Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
633
 
634
  <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
635
- </div>
636
 
637
  ### VLM improvements, avoiding abstraction
638
 
@@ -704,10 +704,10 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
704
 
705
  What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
706
 
707
- <div class="crumbs">
708
  Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
709
  <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
710
- </div>
711
 
712
 
713
  ### On image processing and processors
@@ -718,11 +718,11 @@ The gains in performance are immense, up to 20x speedup for most models when usi
718
 
719
  <Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
720
 
721
- <div class="crumbs">
722
  PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
723
 
724
  <strong>Next:</strong> how this lowers friction for contributors and downstream users.
725
- </div>
726
 
727
 
728
  ## Reduce barrier to entry/contribution
@@ -738,11 +738,11 @@ These additions are immediately available for other models to use.
738
  Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
739
 
740
 
741
- <div class="crumbs">
742
  The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
743
 
744
  <strong>Next:</strong> power tools enabled by a consistent API.
745
- </div>
746
 
747
 
748
  ### <a id="encoders-ftw"></a> Models popularity
@@ -759,11 +759,11 @@ In that regard, we DO want to be a modular toolbox, being <Tenet term="minimal-u
759
 
760
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
761
 
762
- <div class="crumbs">
763
  Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
764
 
765
  <strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
766
- </div>
767
 
768
 
769
  ## A surgical toolbox for model development
@@ -780,11 +780,11 @@ One particular piece of machinery is the `attention mask`. Here you see the famo
780
 
781
  <HtmlEmbed src="transformers/attention-visualizer.html" frameless />
782
 
783
- <div class="crumbs">
784
  Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
785
 
786
  <strong>Next:</strong> whole-model tracing for ports and regressions.
787
- </div>
788
 
789
 
790
  ### Logging entire model activations
@@ -797,11 +797,11 @@ It just works with PyTorch models and is especially useful when aligning outputs
797
  <Image src={modelDebugger} alt="Model debugger interface" zoomable caption="<strong>Figure 10:</strong> Model debugger interface intercepting calls and logging statistics in nested JSON." />
798
  </Wide>
799
 
800
- <div class="crumbs">
801
  Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
802
 
803
  <strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
804
- </div>
805
 
806
 
807
 
@@ -815,11 +815,11 @@ Having a clean _external_ API allows us to work on the <Tenet term="code-is-prod
815
 
816
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
817
 
818
- <div class="crumbs">
819
  Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
820
 
821
  <strong>Next:</strong> consistent interfaces allow transformers-serve.
822
- </div>
823
 
824
 
825
 
@@ -845,11 +845,11 @@ curl -X POST http://localhost:8000/v1/chat/completions \
845
 
846
  For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
847
 
848
- <div class="crumbs">
849
  OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
850
 
851
  <strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
852
- </div>
853
 
854
 
855
  ## Community reusability
@@ -866,11 +866,11 @@ Adding a model to transformers means:
866
  This further cements the need for a <Tenet term="consistent-public-surface" display="consistent public surface" position="top" />: we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
867
 
868
 
869
- <div class="crumbs">
870
  Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
871
 
872
  <strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
873
- </div>
874
 
875
  ## What is coming next
876
 
 
170
 
171
  What was the solution to this? Let's talk about modular transformers.
172
 
173
+ <Note variant="info">
174
  <strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
175
 
176
  <strong>Next:</strong> how modular transformers honor these while removing boilerplate.
177
+ </Note>
178
 
179
 
180
  ## <a id="modular"></a> Modular transformers
 
355
 
356
  What does that give us?
357
 
358
+ <Note variant="info">
359
  <strong>TL;DR:</strong> A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible and <Tenet term="one-model-one-file" display="unique" position="top"/>. Reviewers and contributors maintain the shard, not the repetition.
360
 
361
  <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
362
+ </Note>
363
 
364
 
365
  ### A maintainable control surface
 
386
 
387
  However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we did.
388
 
389
+ <Note variant="info">
390
  Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
391
 
392
  <strong>Next:</strong> how the attention interface stays standard without hiding semantics.
393
+ </Note>
394
 
395
  ### <a id="attention-classes"></a> External Attention classes
396
 
 
423
  ```
424
 
425
 
426
+ <Note variant="info">
427
  Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
428
 
429
  <strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
430
+ </Note>
431
 
432
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
433
 
 
479
 
480
  Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
481
 
482
+ <Note variant="info">
483
  Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
484
 
485
  <strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
486
+ </Note>
487
 
488
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
489
 
 
514
 
515
  This is <Tenet term="minimal-user-api" display="minimal" position="top" /> to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
516
 
517
+ <Note variant="info">
518
  Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
519
 
520
  <strong>Next:</strong> speedups come from kernels that don't change semantics.
521
+ </Note>
522
 
523
 
524
  ### <a id="community-kernels"></a>Community Kernels
 
535
 
536
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
537
 
538
+ <Note variant="info">
539
  Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
540
 
541
  <strong>Next:</strong> what modularity looks like across the repo.
542
+ </Note>
543
 
544
 
545
  ## A Modular State
 
587
 
588
  Hence the next question, and how do we identify modularisable models?
589
 
590
+ <Note variant="info">
591
  Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
592
  <strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
593
+ </Note>
594
 
595
 
596
  ### Many models, but not enough yet, are alike
 
628
  <Image src={classicEncoders} alt="Classical encoders" zoomable caption="<strong>Figure 7:</strong> Family of classical encoders centered on BERT, with several models already modularized." />
629
 
630
 
631
+ <Note variant="info">
632
  Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
633
 
634
  <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
635
+ </Note>
636
 
637
  ### VLM improvements, avoiding abstraction
638
 
 
704
 
705
  What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
706
 
707
+ <Note variant="info">
708
  Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
709
  <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
710
+ </Note>
711
 
712
 
713
  ### On image processing and processors
 
718
 
719
  <Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
720
 
721
+ <Note variant="info">
722
  PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
723
 
724
  <strong>Next:</strong> how this lowers friction for contributors and downstream users.
725
+ </Note>
726
 
727
 
728
  ## Reduce barrier to entry/contribution
 
738
  Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
739
 
740
 
741
+ <Note variant="info">
742
  The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
743
 
744
  <strong>Next:</strong> power tools enabled by a consistent API.
745
+ </Note>
746
 
747
 
748
  ### <a id="encoders-ftw"></a> Models popularity
 
759
 
760
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
761
 
762
+ <Note variant="info">
763
  Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
764
 
765
  <strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
766
+ </Note>
767
 
768
 
769
  ## A surgical toolbox for model development
 
780
 
781
  <HtmlEmbed src="transformers/attention-visualizer.html" frameless />
782
 
783
+ <Note variant="info">
784
  Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
785
 
786
  <strong>Next:</strong> whole-model tracing for ports and regressions.
787
+ </Note>
788
 
789
 
790
  ### Logging entire model activations
 
797
  <Image src={modelDebugger} alt="Model debugger interface" zoomable caption="<strong>Figure 10:</strong> Model debugger interface intercepting calls and logging statistics in nested JSON." />
798
  </Wide>
799
 
800
+ <Note variant="info">
801
  Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
802
 
803
  <strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
804
+ </Note>
805
 
806
 
807
 
 
815
 
816
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
817
 
818
+ <Note variant="info">
819
  Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
820
 
821
  <strong>Next:</strong> consistent interfaces allow transformers-serve.
822
+ </Note>
823
 
824
 
825
 
 
845
 
846
  For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
847
 
848
+ <Note variant="info">
849
  OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
850
 
851
  <strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
852
+ </Note>
853
 
854
 
855
  ## Community reusability
 
866
  This further cements the need for a <Tenet term="consistent-public-surface" display="consistent public surface" position="top" />: we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
867
 
868
 
869
+ <Note variant="info">
870
  Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
871
 
872
  <strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
873
+ </Note>
874
 
875
  ## What is coming next
876