update breadcrumbs
Browse files- app/src/content/article.mdx +36 -36
app/src/content/article.mdx
CHANGED
|
@@ -170,11 +170,11 @@ We needed to separate two principles that were so far intertwined, <Tenet term="
|
|
| 170 |
|
| 171 |
What was the solution to this? Let's talk about modular transformers.
|
| 172 |
|
| 173 |
-
<
|
| 174 |
<strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
|
| 175 |
|
| 176 |
<strong>Next:</strong> how modular transformers honor these while removing boilerplate.
|
| 177 |
-
</
|
| 178 |
|
| 179 |
|
| 180 |
## <a id="modular"></a> Modular transformers
|
|
@@ -355,11 +355,11 @@ More importantly, the auto-generated modeling file is what users _read_ to under
|
|
| 355 |
|
| 356 |
What does that give us?
|
| 357 |
|
| 358 |
-
<
|
| 359 |
<strong>TL;DR:</strong> A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible and <Tenet term="one-model-one-file" display="unique" position="top"/>. Reviewers and contributors maintain the shard, not the repetition.
|
| 360 |
|
| 361 |
<strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
|
| 362 |
-
</
|
| 363 |
|
| 364 |
|
| 365 |
### A maintainable control surface
|
|
@@ -386,11 +386,11 @@ The _attention computation_ itself happens at a _lower_ level of abstraction tha
|
|
| 386 |
|
| 387 |
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we did.
|
| 388 |
|
| 389 |
-
<
|
| 390 |
Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
|
| 391 |
|
| 392 |
<strong>Next:</strong> how the attention interface stays standard without hiding semantics.
|
| 393 |
-
</
|
| 394 |
|
| 395 |
### <a id="attention-classes"></a> External Attention classes
|
| 396 |
|
|
@@ -423,11 +423,11 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
|
|
| 423 |
```
|
| 424 |
|
| 425 |
|
| 426 |
-
<
|
| 427 |
Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
|
| 428 |
|
| 429 |
<strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
|
| 430 |
-
</
|
| 431 |
|
| 432 |
### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
|
| 433 |
|
|
@@ -479,11 +479,11 @@ The `tp_plan` solution allows users to run the same model on a single GPU, or di
|
|
| 479 |
|
| 480 |
Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
|
| 481 |
|
| 482 |
-
<
|
| 483 |
Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
|
| 484 |
|
| 485 |
<strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
|
| 486 |
-
</
|
| 487 |
|
| 488 |
### <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 489 |
|
|
@@ -514,11 +514,11 @@ and the configuration can be _explicit_ about which attention type is in which l
|
|
| 514 |
|
| 515 |
This is <Tenet term="minimal-user-api" display="minimal" position="top" /> to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
|
| 516 |
|
| 517 |
-
<
|
| 518 |
Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
|
| 519 |
|
| 520 |
<strong>Next:</strong> speedups come from kernels that don't change semantics.
|
| 521 |
-
</
|
| 522 |
|
| 523 |
|
| 524 |
### <a id="community-kernels"></a>Community Kernels
|
|
@@ -535,11 +535,11 @@ This also opens another contribution path: GPU specialists can contribute optimi
|
|
| 535 |
|
| 536 |
Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
|
| 537 |
|
| 538 |
-
<
|
| 539 |
Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
|
| 540 |
|
| 541 |
<strong>Next:</strong> what modularity looks like across the repo.
|
| 542 |
-
</
|
| 543 |
|
| 544 |
|
| 545 |
## A Modular State
|
|
@@ -587,10 +587,10 @@ Another problem is, this visualization only shows `modular` models. Several mode
|
|
| 587 |
|
| 588 |
Hence the next question, and how do we identify modularisable models?
|
| 589 |
|
| 590 |
-
<
|
| 591 |
Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
|
| 592 |
<strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
|
| 593 |
-
</
|
| 594 |
|
| 595 |
|
| 596 |
### Many models, but not enough yet, are alike
|
|
@@ -628,11 +628,11 @@ Here `roberta`, `xlm_roberta`, `ernie` are `modular`s of BERT, while models like
|
|
| 628 |
<Image src={classicEncoders} alt="Classical encoders" zoomable caption="<strong>Figure 7:</strong> Family of classical encoders centered on BERT, with several models already modularized." />
|
| 629 |
|
| 630 |
|
| 631 |
-
<
|
| 632 |
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
|
| 633 |
|
| 634 |
<strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
|
| 635 |
-
</
|
| 636 |
|
| 637 |
### VLM improvements, avoiding abstraction
|
| 638 |
|
|
@@ -704,10 +704,10 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
|
|
| 704 |
|
| 705 |
What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
|
| 706 |
|
| 707 |
-
<
|
| 708 |
Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
|
| 709 |
<strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
|
| 710 |
-
</
|
| 711 |
|
| 712 |
|
| 713 |
### On image processing and processors
|
|
@@ -718,11 +718,11 @@ The gains in performance are immense, up to 20x speedup for most models when usi
|
|
| 718 |
|
| 719 |
<Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
|
| 720 |
|
| 721 |
-
<
|
| 722 |
PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
|
| 723 |
|
| 724 |
<strong>Next:</strong> how this lowers friction for contributors and downstream users.
|
| 725 |
-
</
|
| 726 |
|
| 727 |
|
| 728 |
## Reduce barrier to entry/contribution
|
|
@@ -738,11 +738,11 @@ These additions are immediately available for other models to use.
|
|
| 738 |
Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
|
| 739 |
|
| 740 |
|
| 741 |
-
<
|
| 742 |
The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
|
| 743 |
|
| 744 |
<strong>Next:</strong> power tools enabled by a consistent API.
|
| 745 |
-
</
|
| 746 |
|
| 747 |
|
| 748 |
### <a id="encoders-ftw"></a> Models popularity
|
|
@@ -759,11 +759,11 @@ In that regard, we DO want to be a modular toolbox, being <Tenet term="minimal-u
|
|
| 759 |
|
| 760 |
So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
|
| 761 |
|
| 762 |
-
<
|
| 763 |
Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
|
| 764 |
|
| 765 |
<strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
|
| 766 |
-
</
|
| 767 |
|
| 768 |
|
| 769 |
## A surgical toolbox for model development
|
|
@@ -780,11 +780,11 @@ One particular piece of machinery is the `attention mask`. Here you see the famo
|
|
| 780 |
|
| 781 |
<HtmlEmbed src="transformers/attention-visualizer.html" frameless />
|
| 782 |
|
| 783 |
-
<
|
| 784 |
Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
|
| 785 |
|
| 786 |
<strong>Next:</strong> whole-model tracing for ports and regressions.
|
| 787 |
-
</
|
| 788 |
|
| 789 |
|
| 790 |
### Logging entire model activations
|
|
@@ -797,11 +797,11 @@ It just works with PyTorch models and is especially useful when aligning outputs
|
|
| 797 |
<Image src={modelDebugger} alt="Model debugger interface" zoomable caption="<strong>Figure 10:</strong> Model debugger interface intercepting calls and logging statistics in nested JSON." />
|
| 798 |
</Wide>
|
| 799 |
|
| 800 |
-
<
|
| 801 |
Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
|
| 802 |
|
| 803 |
<strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
|
| 804 |
-
</
|
| 805 |
|
| 806 |
|
| 807 |
|
|
@@ -815,11 +815,11 @@ Having a clean _external_ API allows us to work on the <Tenet term="code-is-prod
|
|
| 815 |
|
| 816 |
It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
|
| 817 |
|
| 818 |
-
<
|
| 819 |
Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
|
| 820 |
|
| 821 |
<strong>Next:</strong> consistent interfaces allow transformers-serve.
|
| 822 |
-
</
|
| 823 |
|
| 824 |
|
| 825 |
|
|
@@ -845,11 +845,11 @@ curl -X POST http://localhost:8000/v1/chat/completions \
|
|
| 845 |
|
| 846 |
For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
|
| 847 |
|
| 848 |
-
<
|
| 849 |
OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
|
| 850 |
|
| 851 |
<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
|
| 852 |
-
</
|
| 853 |
|
| 854 |
|
| 855 |
## Community reusability
|
|
@@ -866,11 +866,11 @@ Adding a model to transformers means:
|
|
| 866 |
This further cements the need for a <Tenet term="consistent-public-surface" display="consistent public surface" position="top" />: we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
|
| 867 |
|
| 868 |
|
| 869 |
-
<
|
| 870 |
Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
|
| 871 |
|
| 872 |
<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
|
| 873 |
-
</
|
| 874 |
|
| 875 |
## What is coming next
|
| 876 |
|
|
|
|
| 170 |
|
| 171 |
What was the solution to this? Let's talk about modular transformers.
|
| 172 |
|
| 173 |
+
<Note variant="info">
|
| 174 |
<strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
|
| 175 |
|
| 176 |
<strong>Next:</strong> how modular transformers honor these while removing boilerplate.
|
| 177 |
+
</Note>
|
| 178 |
|
| 179 |
|
| 180 |
## <a id="modular"></a> Modular transformers
|
|
|
|
| 355 |
|
| 356 |
What does that give us?
|
| 357 |
|
| 358 |
+
<Note variant="info">
|
| 359 |
<strong>TL;DR:</strong> A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible and <Tenet term="one-model-one-file" display="unique" position="top"/>. Reviewers and contributors maintain the shard, not the repetition.
|
| 360 |
|
| 361 |
<strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
|
| 362 |
+
</Note>
|
| 363 |
|
| 364 |
|
| 365 |
### A maintainable control surface
|
|
|
|
| 386 |
|
| 387 |
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we did.
|
| 388 |
|
| 389 |
+
<Note variant="info">
|
| 390 |
Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
|
| 391 |
|
| 392 |
<strong>Next:</strong> how the attention interface stays standard without hiding semantics.
|
| 393 |
+
</Note>
|
| 394 |
|
| 395 |
### <a id="attention-classes"></a> External Attention classes
|
| 396 |
|
|
|
|
| 423 |
```
|
| 424 |
|
| 425 |
|
| 426 |
+
<Note variant="info">
|
| 427 |
Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
|
| 428 |
|
| 429 |
<strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
|
| 430 |
+
</Note>
|
| 431 |
|
| 432 |
### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
|
| 433 |
|
|
|
|
| 479 |
|
| 480 |
Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
|
| 481 |
|
| 482 |
+
<Note variant="info">
|
| 483 |
Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
|
| 484 |
|
| 485 |
<strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
|
| 486 |
+
</Note>
|
| 487 |
|
| 488 |
### <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 489 |
|
|
|
|
| 514 |
|
| 515 |
This is <Tenet term="minimal-user-api" display="minimal" position="top" /> to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
|
| 516 |
|
| 517 |
+
<Note variant="info">
|
| 518 |
Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
|
| 519 |
|
| 520 |
<strong>Next:</strong> speedups come from kernels that don't change semantics.
|
| 521 |
+
</Note>
|
| 522 |
|
| 523 |
|
| 524 |
### <a id="community-kernels"></a>Community Kernels
|
|
|
|
| 535 |
|
| 536 |
Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
|
| 537 |
|
| 538 |
+
<Note variant="info">
|
| 539 |
Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
|
| 540 |
|
| 541 |
<strong>Next:</strong> what modularity looks like across the repo.
|
| 542 |
+
</Note>
|
| 543 |
|
| 544 |
|
| 545 |
## A Modular State
|
|
|
|
| 587 |
|
| 588 |
Hence the next question, and how do we identify modularisable models?
|
| 589 |
|
| 590 |
+
<Note variant="info">
|
| 591 |
Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
|
| 592 |
<strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
|
| 593 |
+
</Note>
|
| 594 |
|
| 595 |
|
| 596 |
### Many models, but not enough yet, are alike
|
|
|
|
| 628 |
<Image src={classicEncoders} alt="Classical encoders" zoomable caption="<strong>Figure 7:</strong> Family of classical encoders centered on BERT, with several models already modularized." />
|
| 629 |
|
| 630 |
|
| 631 |
+
<Note variant="info">
|
| 632 |
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
|
| 633 |
|
| 634 |
<strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
|
| 635 |
+
</Note>
|
| 636 |
|
| 637 |
### VLM improvements, avoiding abstraction
|
| 638 |
|
|
|
|
| 704 |
|
| 705 |
What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
|
| 706 |
|
| 707 |
+
<Note variant="info">
|
| 708 |
Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
|
| 709 |
<strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
|
| 710 |
+
</Note>
|
| 711 |
|
| 712 |
|
| 713 |
### On image processing and processors
|
|
|
|
| 718 |
|
| 719 |
<Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
|
| 720 |
|
| 721 |
+
<Note variant="info">
|
| 722 |
PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
|
| 723 |
|
| 724 |
<strong>Next:</strong> how this lowers friction for contributors and downstream users.
|
| 725 |
+
</Note>
|
| 726 |
|
| 727 |
|
| 728 |
## Reduce barrier to entry/contribution
|
|
|
|
| 738 |
Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
|
| 739 |
|
| 740 |
|
| 741 |
+
<Note variant="info">
|
| 742 |
The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
|
| 743 |
|
| 744 |
<strong>Next:</strong> power tools enabled by a consistent API.
|
| 745 |
+
</Note>
|
| 746 |
|
| 747 |
|
| 748 |
### <a id="encoders-ftw"></a> Models popularity
|
|
|
|
| 759 |
|
| 760 |
So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
|
| 761 |
|
| 762 |
+
<Note variant="info">
|
| 763 |
Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
|
| 764 |
|
| 765 |
<strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
|
| 766 |
+
</Note>
|
| 767 |
|
| 768 |
|
| 769 |
## A surgical toolbox for model development
|
|
|
|
| 780 |
|
| 781 |
<HtmlEmbed src="transformers/attention-visualizer.html" frameless />
|
| 782 |
|
| 783 |
+
<Note variant="info">
|
| 784 |
Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
|
| 785 |
|
| 786 |
<strong>Next:</strong> whole-model tracing for ports and regressions.
|
| 787 |
+
</Note>
|
| 788 |
|
| 789 |
|
| 790 |
### Logging entire model activations
|
|
|
|
| 797 |
<Image src={modelDebugger} alt="Model debugger interface" zoomable caption="<strong>Figure 10:</strong> Model debugger interface intercepting calls and logging statistics in nested JSON." />
|
| 798 |
</Wide>
|
| 799 |
|
| 800 |
+
<Note variant="info">
|
| 801 |
Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth."
|
| 802 |
|
| 803 |
<strong>Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
|
| 804 |
+
</Note>
|
| 805 |
|
| 806 |
|
| 807 |
|
|
|
|
| 815 |
|
| 816 |
It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
|
| 817 |
|
| 818 |
+
<Note variant="info">
|
| 819 |
Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
|
| 820 |
|
| 821 |
<strong>Next:</strong> consistent interfaces allow transformers-serve.
|
| 822 |
+
</Note>
|
| 823 |
|
| 824 |
|
| 825 |
|
|
|
|
| 845 |
|
| 846 |
For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
|
| 847 |
|
| 848 |
+
<Note variant="info">
|
| 849 |
OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
|
| 850 |
|
| 851 |
<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
|
| 852 |
+
</Note>
|
| 853 |
|
| 854 |
|
| 855 |
## Community reusability
|
|
|
|
| 866 |
This further cements the need for a <Tenet term="consistent-public-surface" display="consistent public surface" position="top" />: we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
|
| 867 |
|
| 868 |
|
| 869 |
+
<Note variant="info">
|
| 870 |
Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
|
| 871 |
|
| 872 |
<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
|
| 873 |
+
</Note>
|
| 874 |
|
| 875 |
## What is coming next
|
| 876 |
|