biiig push
Browse files- content/article.md +41 -25
content/article.md
CHANGED
|
@@ -96,8 +96,6 @@ When a PR is merged, it is because the contribution is worthwhile, and that the
|
|
| 96 |
Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
|
| 97 |
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
|
| 102 |
|
| 103 |
```python
|
|
@@ -108,16 +106,15 @@ def rotate_half(x):
|
|
| 108 |
return torch.cat((-x2, x1), dim=-1)
|
| 109 |
```
|
| 110 |
|
| 111 |
-
You can use a
|
| 112 |
|
| 113 |
So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
|
| 114 |
|
| 115 |
-
As I was looking for things to improve and make better, it's one of the iterations I attempted: a function is almost everywhere the same, let's import it from some common file? But no! Goes against
|
| 116 |
|
| 117 |
## <a id="modular"></a> Going modular
|
| 118 |
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
|
| 123 |
|
|
@@ -127,6 +124,8 @@ It is explained in details in the documentation above, but overall it works like
|
|
| 127 |
|
| 128 |
{{{fragment-glm-compare}}}
|
| 129 |
|
|
|
|
|
|
|
| 130 |
## <a id="attention-classes"></a> External Attention classes
|
| 131 |
|
| 132 |
A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
|
|
@@ -143,7 +142,14 @@ We often read and understand that `kwargs` are criticized, and we are typing the
|
|
| 143 |
|
| 144 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 145 |
|
| 146 |
-
For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 149 |
|
|
@@ -207,9 +213,7 @@ Plus, this opened another angle of contribution for the community. People who ar
|
|
| 207 |
## The good modularity
|
| 208 |
|
| 209 |
Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
|
| 210 |
-
|
| 211 |
-
My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.
|
| 212 |
-
|
| 213 |
So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
|
| 214 |
|
| 215 |
To get this graph, I used the heuristic of modular inheritance.
|
|
@@ -220,26 +224,24 @@ To get this graph, I used the heuristic of modular inheritance.
|
|
| 220 |
So what do we see? Llama is a basis for many models, and it shows.
|
| 221 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 222 |
|
| 223 |
-
{{{fragment-dependency-graph}}}
|
| 224 |
-
|
| 225 |
|
| 226 |
-
|
| 227 |
-
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
| 228 |
|
|
|
|
|
|
|
| 229 |
|
| 230 |
-
|
| 231 |
|
| 232 |
## Too many models, yet not enough, are alike
|
| 233 |
|
| 234 |
-
So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together,
|
| 235 |
|
|
|
|
| 236 |
{{{fragment-model-timeline}}}
|
| 237 |
|
| 238 |
-
|
| 239 |
|
| 240 |
-

|
| 241 |
|
| 242 |
-
The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
|
| 243 |
## VLM improvements, avoiding abstraction
|
| 244 |
|
| 245 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
|
@@ -303,18 +305,29 @@ The following [Pull request to standardize placeholder masking](https://github.c
|
|
| 303 |
|
| 304 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
|
| 305 |
|
| 306 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 307 |
|
| 308 |
-
|
| 309 |
-
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 310 |
|
| 311 |
{{{fragment-loc-growth}}}
|
| 312 |
|
|
|
|
|
|
|
| 313 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 314 |
|
| 315 |
-
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings
|
| 316 |
|
| 317 |

|
|
|
|
|
|
|
|
|
|
| 318 |
## On image processing and processors
|
| 319 |
|
| 320 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
@@ -325,10 +338,13 @@ The gains in performance are immense, up to 20x speed for most models when compi
|
|
| 325 |
|
| 326 |
## Reduce barrier to entry/contribution
|
| 327 |
|
| 328 |
-
This is an overall objective
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
|
| 330 |
-
|
| 331 |
-
Among the most valuable contributions to `transformers`is of course the addition of new models.
|
| 332 |
|
| 333 |
|
| 334 |
## A surgical toolbox for model development
|
|
|
|
| 96 |
Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
|
| 97 |
|
| 98 |
|
|
|
|
|
|
|
| 99 |
For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
|
| 100 |
|
| 101 |
```python
|
|
|
|
| 106 |
return torch.cat((-x2, x1), dim=-1)
|
| 107 |
```
|
| 108 |
|
| 109 |
+
You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
|
| 110 |
|
| 111 |
So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
|
| 112 |
|
|
|
|
| 113 |
|
| 114 |
## <a id="modular"></a> Going modular
|
| 115 |
|
| 116 |
|
| 117 |
+
It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
|
| 118 |
|
| 119 |
We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
|
| 120 |
|
|
|
|
| 124 |
|
| 125 |
{{{fragment-glm-compare}}}
|
| 126 |
|
| 127 |
+
As you can see, we can now define any model as a _modular_ of another. This isn't strictly groundbreaking if you've done any programming, you might even think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
|
| 128 |
+
|
| 129 |
## <a id="attention-classes"></a> External Attention classes
|
| 130 |
|
| 131 |
A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
|
|
|
|
| 142 |
|
| 143 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 144 |
|
| 145 |
+
For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
|
| 146 |
+
|
| 147 |
+
```python
|
| 148 |
+
from typing import Annotated
|
| 149 |
+
|
| 150 |
+
MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
|
| 154 |
## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
|
| 155 |
|
|
|
|
| 213 |
## The good modularity
|
| 214 |
|
| 215 |
Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
|
| 216 |
+
It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
|
|
|
|
|
|
|
| 217 |
So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
|
| 218 |
|
| 219 |
To get this graph, I used the heuristic of modular inheritance.
|
|
|
|
| 224 |
So what do we see? Llama is a basis for many models, and it shows.
|
| 225 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 226 |
|
|
|
|
|
|
|
| 227 |
|
| 228 |
+
{{{fragment-dependency-graph}}}
|
|
|
|
| 229 |
|
| 230 |
+
However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
|
| 231 |
+
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
|
| 232 |
|
| 233 |
+
Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
|
| 234 |
|
| 235 |
## Too many models, yet not enough, are alike
|
| 236 |
|
| 237 |
+
So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
|
| 238 |
|
| 239 |
+
It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library:
|
| 240 |
{{{fragment-model-timeline}}}
|
| 241 |
|
| 242 |
+
If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
|
| 243 |
|
|
|
|
| 244 |
|
|
|
|
| 245 |
## VLM improvements, avoiding abstraction
|
| 246 |
|
| 247 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
|
|
|
| 305 |
|
| 306 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
|
| 307 |
|
| 308 |
+
## The weight of maintenance
|
| 309 |
+
|
| 310 |
+
|
| 311 |
+
The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.
|
| 312 |
+
|
| 313 |
+
๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
|
| 314 |
+
|
| 315 |
+
Less code to hand-maintain means fewer places to break.
|
| 316 |
|
| 317 |
+
Cyclomatic complexity isnโt LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐ ~ ๐ญ ๐ก๐ฃ ๐ญ. Lower ๐
(lower loc) helps.
|
|
|
|
| 318 |
|
| 319 |
{{{fragment-loc-growth}}}
|
| 320 |
|
| 321 |
+
Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
|
| 322 |
+
|
| 323 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 324 |
|
| 325 |
+
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 326 |
|
| 327 |

|
| 328 |
+
|
| 329 |
+
As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
|
| 330 |
+
|
| 331 |
## On image processing and processors
|
| 332 |
|
| 333 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
|
|
| 338 |
|
| 339 |
## Reduce barrier to entry/contribution
|
| 340 |
|
| 341 |
+
This is an overall objective: there's no `transformer` without its community.
|
| 342 |
+
|
| 343 |
+
We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
|
| 344 |
+
|
| 345 |
+
Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
|
| 346 |
|
| 347 |
+
In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
|
|
|
|
| 348 |
|
| 349 |
|
| 350 |
## A surgical toolbox for model development
|