big change
Browse files- content/article.md +82 -58
content/article.md
CHANGED
|
@@ -1,37 +1,42 @@
|
|
| 1 |
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
### The core tenets of transformers
|
| 15 |
|
| 16 |
-
Every reader, whether an OSS maintainer, power user, or casual fine-tuner, will walk away knowing how to reason about the `transformers` code base, how to use it better, how to meaningfully contribute to it.
|
| 17 |
-
This will also showcase new features you might have missed so you'll be up-to-date.
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
<div class="tenet-list">
|
| 22 |
<ol>
|
| 23 |
<li class="tenet">
|
| 24 |
<a id="source-of-truth"></a>
|
| 25 |
<strong>Source of Truth</strong>
|
| 26 |
-
<p>We
|
| 27 |
<em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
|
| 28 |
</li>
|
| 29 |
|
| 30 |
<li class="tenet">
|
| 31 |
<a id="one-model-one-file"></a>
|
| 32 |
<strong>One Model, One File</strong>
|
| 33 |
-
<p>All inference
|
| 34 |
-
<em>Every model should be completely understandable by reading a single file from top to bottom.</em>
|
| 35 |
</li>
|
| 36 |
<li class="tenet">
|
| 37 |
<a id="code-is-product"></a>
|
|
@@ -56,37 +61,30 @@ So, what are the principles of `transformers`? We will try to summarize the foun
|
|
| 56 |
<a id="minimal-user-api"></a>
|
| 57 |
<strong>Minimal User API</strong>
|
| 58 |
<p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
|
| 59 |
-
<em>Keep the public interface simple and predictable
|
| 60 |
</li>
|
| 61 |
<li class="tenet">
|
| 62 |
<a id="backwards-compatibility"></a>
|
| 63 |
<strong>Backwards Compatibility</strong>
|
| 64 |
-
<p>Evolve by additive standardization,
|
| 65 |
-
<p
|
| 66 |
-
<em>Once something is public, it stays public
|
| 67 |
</li>
|
| 68 |
<li class="tenet">
|
| 69 |
<a id="consistent-public-surface"></a>
|
| 70 |
<strong>Consistent Public Surface</strong>
|
| 71 |
-
<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests
|
| 72 |
<em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
|
| 73 |
</li>
|
| 74 |
-
<li class="tenet">
|
| 75 |
-
<a id="modular-toolbox"></a>
|
| 76 |
-
<strong>Modular Toolbox (Not A Framework)</strong>
|
| 77 |
-
<p>We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM.</p>
|
| 78 |
-
<em>This is the largest change. Provide tools and utilities, but don't force users into a rigid framework.</em>
|
| 79 |
-
</li>
|
| 80 |
</ol>
|
| 81 |
</div>
|
| 82 |
|
| 83 |
|
| 84 |
When a PR is merged, it is because the contribution is worthwhile, and that the `transformers` team finds the design of the contribution to be aligned with what is above.
|
| 85 |
|
| 86 |
-
Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is
|
| 87 |
-
|
| 88 |
|
| 89 |
-
For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70
|
| 90 |
|
| 91 |
```python
|
| 92 |
def rotate_half(x):
|
|
@@ -98,23 +96,57 @@ def rotate_half(x):
|
|
| 98 |
|
| 99 |
You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
| 102 |
|
|
|
|
| 103 |
|
| 104 |
-
|
| 105 |
|
|
|
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
|
| 110 |
|
| 111 |
-
It
|
| 112 |
|
| 113 |
<summary>Auto-generated modeling code</summary>
|
| 114 |
|
| 115 |
{{{fragment-glm-compare}}}
|
| 116 |
|
| 117 |
-
As you can see, we can now define any model as a _modular_ of another.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
## <a id="attention-classes"></a> External Attention classes
|
| 120 |
|
|
@@ -143,6 +175,7 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
|
|
| 143 |
|
| 144 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 145 |
|
|
|
|
| 146 |
We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
|
| 147 |
|
| 148 |
It is written once in the config and passed to `.from_pretrained()`.
|
|
@@ -186,7 +219,7 @@ and the configuration can be _explicit_ about which attention type is in which l
|
|
| 186 |
],
|
| 187 |
```
|
| 188 |
|
| 189 |
-
This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also
|
| 190 |
|
| 191 |
## <a id="community-kernels"></a>Community Kernels
|
| 192 |
|
|
@@ -235,7 +268,7 @@ It is interesting, for that, to look at _when_ we deployed this modular logic an
|
|
| 235 |
If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
|
| 236 |
|
| 237 |
|
| 238 |
-
|
| 239 |
|
| 240 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
| 241 |
|
|
@@ -298,24 +331,8 @@ The following [Pull request to standardize placeholder masking](https://github.c
|
|
| 298 |
|
| 299 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
|
| 300 |
|
| 301 |
-
## The weight of maintenance
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.
|
| 305 |
-
|
| 306 |
-
๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 LOC/day; with ๏ฟฝ๏ฟฝ๏ฟฝ๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
|
| 307 |
-
|
| 308 |
-
Less code to hand-maintain means fewer places to break.
|
| 309 |
-
|
| 310 |
-
Cyclomatic complexity isnโt LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐ ~ ๐ญ ๐ก๐ฃ ๐ญ. Lower ๐
(lower loc) helps.
|
| 311 |
-
|
| 312 |
-
{{{fragment-loc-growth}}}
|
| 313 |
-
|
| 314 |
-
There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
|
| 315 |
-
|
| 316 |
-
Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
|
| 317 |
|
| 318 |
-
|
| 319 |
|
| 320 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 321 |
|
|
@@ -323,7 +340,7 @@ Models popularity speaks for itself! This is because the usage of encoders lies
|
|
| 323 |
|
| 324 |
As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
|
| 325 |
|
| 326 |
-
|
| 327 |
|
| 328 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
| 329 |
|
|
@@ -372,7 +389,7 @@ Having a clean _external_ API allows us to work on the true inner workings of tr
|
|
| 372 |
|
| 373 |
It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, your iteration speed.
|
| 374 |
|
| 375 |
-
|
| 376 |
|
| 377 |
Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
|
| 378 |
|
|
@@ -401,4 +418,11 @@ This cements the need even more for a [consistent public surface](#consistent-pu
|
|
| 401 |
|
| 402 |
## What is coming next
|
| 403 |
|
| 404 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
+
# Introduction
|
| 3 |
+
|
| 4 |
+
One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
|
| 5 |
+
|
| 6 |
+
Built on `PyTorch`, it's a foundational tool for modern LLM usage, research, education, and tens of thousands of other open-source projects. Each AI model is added by the community, harmonized into a consistent interface, and tested daily on a CI to ensure reproducibility.
|
| 7 |
+
|
| 8 |
+
This scale presents a monumental engineering challenge.
|
| 9 |
+
|
| 10 |
+
How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members. We continue supporting all models that come out and will continue to do so in the foreseeable future.
|
| 11 |
+
|
| 12 |
+
This post dissects the design philosophy that makes this possible. It's a continuation of our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently, and I recommend the read if it's not done yet, a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.
|
| 13 |
+
|
| 14 |
+
We codify the "tenets" that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
|
| 15 |
+
|
| 16 |
+
For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
|
| 17 |
+
|
| 18 |
|
| 19 |
### The core tenets of transformers
|
| 20 |
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
We summarize the foundations on which we've built everything, and write the "tenets" of the library. They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
|
| 23 |
+
|
| 24 |
+
Note that the library _evolved_ towards these principles, and that they _emerged_ from decisions taken, and once emerged they were recognized as critical.
|
| 25 |
|
| 26 |
<div class="tenet-list">
|
| 27 |
<ol>
|
| 28 |
<li class="tenet">
|
| 29 |
<a id="source-of-truth"></a>
|
| 30 |
<strong>Source of Truth</strong>
|
| 31 |
+
<p>We aim be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
|
| 32 |
<em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
|
| 33 |
</li>
|
| 34 |
|
| 35 |
<li class="tenet">
|
| 36 |
<a id="one-model-one-file"></a>
|
| 37 |
<strong>One Model, One File</strong>
|
| 38 |
+
<p>All inference and training core logic has to be visible, topโtoโbottom, to maximize each model's hackability.</p>
|
| 39 |
+
<em>Every model should be completely understandable and hackable by reading a single file from top to bottom.</em>
|
| 40 |
</li>
|
| 41 |
<li class="tenet">
|
| 42 |
<a id="code-is-product"></a>
|
|
|
|
| 61 |
<a id="minimal-user-api"></a>
|
| 62 |
<strong>Minimal User API</strong>
|
| 63 |
<p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
|
| 64 |
+
<em>Keep the public interface simple and predictable, users should know what to expect.</em>
|
| 65 |
</li>
|
| 66 |
<li class="tenet">
|
| 67 |
<a id="backwards-compatibility"></a>
|
| 68 |
<strong>Backwards Compatibility</strong>
|
| 69 |
+
<p>Evolve by additive standardization, never break public APIs.</p>
|
| 70 |
+
<p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies.
|
| 71 |
+
<em>Once something is public, it stays public, evolution through addition, not breaking changes.</em>
|
| 72 |
</li>
|
| 73 |
<li class="tenet">
|
| 74 |
<a id="consistent-public-surface"></a>
|
| 75 |
<strong>Consistent Public Surface</strong>
|
| 76 |
+
<p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goalpost</p>
|
| 77 |
<em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
|
| 78 |
</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
</ol>
|
| 80 |
</div>
|
| 81 |
|
| 82 |
|
| 83 |
When a PR is merged, it is because the contribution is worthwhile, and that the `transformers` team finds the design of the contribution to be aligned with what is above.
|
| 84 |
|
| 85 |
+
Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break [backwards compatibility](#backwards-compatibility).
|
|
|
|
| 86 |
|
| 87 |
+
For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because we want all the model logic to be [contained in the modeling file](#one-model-one-file). In order to do that, we [do repeat ourselves](#do-repeat-yourself).
|
| 88 |
|
| 89 |
```python
|
| 90 |
def rotate_half(x):
|
|
|
|
| 96 |
|
| 97 |
You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
|
| 98 |
|
| 99 |
+
All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works. It is a core functionality.
|
| 100 |
+
|
| 101 |
+
In the same way, we want all models to have a self-contained modeling code.
|
| 102 |
|
| 103 |
+
This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
|
| 104 |
|
| 105 |
+
We needed to separate both principles that were so far intertwined, [repetition](#do-repeat-yourself) and [hackabilty](#one-model-one-file).
|
| 106 |
|
| 107 |
+
What was the solution to this?
|
| 108 |
|
| 109 |
+
## <a id="modular"></a> Modular transformers
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
|
| 113 |
|
| 114 |
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
|
| 115 |
|
| 116 |
+
It works as follows. In order to contribute a model, say for instance define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_.
|
| 117 |
|
| 118 |
<summary>Auto-generated modeling code</summary>
|
| 119 |
|
| 120 |
{{{fragment-glm-compare}}}
|
| 121 |
|
| 122 |
+
As you can see, we can now define any model as a _modular_ of another.
|
| 123 |
+
|
| 124 |
+
You might think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
|
| 125 |
+
|
| 126 |
+
What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
|
| 127 |
+
|
| 128 |
+
When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
|
| 129 |
+
|
| 130 |
+
## A maintainable control surface
|
| 131 |
+
|
| 132 |
+
The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
|
| 133 |
+
If it only has a modeling file, we add its LOC count.
|
| 134 |
+
However, if a model has a modular_*.py and a corresponding automatically generated modeling_*/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.
|
| 135 |
+
|
| 136 |
+
That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.
|
| 137 |
+
|
| 138 |
+
๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 new LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
|
| 139 |
+
|
| 140 |
+
Less code to hand-maintain means fewer places to break.
|
| 141 |
+
|
| 142 |
+
Cyclomatic complexity isnโt LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐ ~ ๐ญ ๐ก๐ฃ ๐ญ. Lower ๐
(lower loc) helps.
|
| 143 |
+
|
| 144 |
+
{{{fragment-loc-growth}}}
|
| 145 |
+
|
| 146 |
+
There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
|
| 147 |
+
|
| 148 |
+
Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
|
| 149 |
+
|
| 150 |
|
| 151 |
## <a id="attention-classes"></a> External Attention classes
|
| 152 |
|
|
|
|
| 175 |
|
| 176 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 177 |
|
| 178 |
+
# TODO ADD LINK TO EXTERNAL BLOG POST
|
| 179 |
We want to touch minimally to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
|
| 180 |
|
| 181 |
It is written once in the config and passed to `.from_pretrained()`.
|
|
|
|
| 219 |
],
|
| 220 |
```
|
| 221 |
|
| 222 |
+
This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
|
| 223 |
|
| 224 |
## <a id="community-kernels"></a>Community Kernels
|
| 225 |
|
|
|
|
| 268 |
If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
|
| 269 |
|
| 270 |
|
| 271 |
+
### VLM improvements, avoiding abstraction
|
| 272 |
|
| 273 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
| 274 |
|
|
|
|
| 331 |
|
| 332 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
|
| 333 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
+
### <a id="encoders-ftw"></a> Embedding models, now and forever.
|
| 336 |
|
| 337 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 338 |
|
|
|
|
| 340 |
|
| 341 |
As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
|
| 342 |
|
| 343 |
+
### On image processing and processors
|
| 344 |
|
| 345 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
| 346 |
|
|
|
|
| 389 |
|
| 390 |
It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, your iteration speed.
|
| 391 |
|
| 392 |
+
### Transformers-serve and continuous batching
|
| 393 |
|
| 394 |
Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
|
| 395 |
|
|
|
|
| 418 |
|
| 419 |
## What is coming next
|
| 420 |
|
| 421 |
+
The next major version of `transformers` is just around the corner. When v5 is releasd, [backwards compatibility](#backwards-compatibility) will try to stay as solid as possible. Changes we do now are to ensure this.
|
| 422 |
+
|
| 423 |
+
Instead, what we aim to be is way more of a modular Toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM and other fine-tuning and fast inference options.
|
| 424 |
+
|
| 425 |
+
|
| 426 |
+
|
| 427 |
+
|
| 428 |
+
|