Molbap HF Staff commited on
Commit
3a0c25b
ยท
1 Parent(s): 5643222

biiig push

Browse files
Files changed (1) hide show
  1. content/article.md +41 -25
content/article.md CHANGED
@@ -96,8 +96,6 @@ When a PR is merged, it is because the contribution is worthwhile, and that the
96
  Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
97
 
98
 
99
-
100
-
101
  For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
102
 
103
  ```python
@@ -108,16 +106,15 @@ def rotate_half(x):
108
  return torch.cat((-x2, x1), dim=-1)
109
  ```
110
 
111
- You can use a script such as [[top_methods.py]] to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
112
 
113
  So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
114
 
115
- As I was looking for things to improve and make better, it's one of the iterations I attempted: a function is almost everywhere the same, let's import it from some common file? But no! Goes against
116
 
117
  ## <a id="modular"></a> Going modular
118
 
119
 
120
- However, both of these works were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers) , allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
121
 
122
  We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
123
 
@@ -127,6 +124,8 @@ It is explained in details in the documentation above, but overall it works like
127
 
128
  {{{fragment-glm-compare}}}
129
 
 
 
130
  ## <a id="attention-classes"></a> External Attention classes
131
 
132
  A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
@@ -143,7 +142,14 @@ We often read and understand that `kwargs` are criticized, and we are typing the
143
 
144
  It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
145
 
146
- For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations.
 
 
 
 
 
 
 
147
 
148
  ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
149
 
@@ -207,9 +213,7 @@ Plus, this opened another angle of contribution for the community. People who ar
207
  ## The good modularity
208
 
209
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
210
-
211
- My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.
212
-
213
  So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
214
 
215
  To get this graph, I used the heuristic of modular inheritance.
@@ -220,26 +224,24 @@ To get this graph, I used the heuristic of modular inheritance.
220
  So what do we see? Llama is a basis for many models, and it shows.
221
  Radically different architectures such as mamba have spawned their own dependency subgraph.
222
 
223
- {{{fragment-dependency-graph}}}
224
-
225
 
226
- But there is no similar miracle for VLMs across the board.
227
- As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
228
 
 
 
229
 
230
- One problem is, this is only for `modular` models. Several models do NOT have a modular file. In other words, we have a big "hidden space here."
231
 
232
  ## Too many models, yet not enough, are alike
233
 
234
- So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out [[find_dependencies.py]] .
235
 
 
236
  {{{fragment-model-timeline}}}
237
 
238
- {{{fragment-terminal}}}
239
 
240
- ![Jaccard similarity plot showing model relationships](static/Jaccard_similarity_plot.png)
241
 
242
- The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
243
  ## VLM improvements, avoiding abstraction
244
 
245
  We don't have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.
@@ -303,18 +305,29 @@ The following [Pull request to standardize placeholder masking](https://github.c
303
 
304
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
305
 
306
- ## Modularity candidates
 
 
 
 
 
 
 
307
 
308
- So the question abounds naturally: How can we modularize more?
309
- I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
310
 
311
  {{{fragment-loc-growth}}}
312
 
 
 
313
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
314
 
315
- Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
316
 
317
  ![Popular models bar plot](static/popular_models_barplot.png)
 
 
 
318
  ## On image processing and processors
319
 
320
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -325,10 +338,13 @@ The gains in performance are immense, up to 20x speed for most models when compi
325
 
326
  ## Reduce barrier to entry/contribution
327
 
328
- This is an overall objective, no transformers without community.
 
 
 
 
329
 
330
- We didn't want to make a toolbox, old tenet, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
331
- Among the most valuable contributions to `transformers`is of course the addition of new models.
332
 
333
 
334
  ## A surgical toolbox for model development
 
96
  Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
97
 
98
 
 
 
99
  For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
100
 
101
  ```python
 
106
  return torch.cat((-x2, x1), dim=-1)
107
  ```
108
 
109
+ You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
110
 
111
  So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
112
 
 
113
 
114
  ## <a id="modular"></a> Going modular
115
 
116
 
117
+ It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
118
 
119
  We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
120
 
 
124
 
125
  {{{fragment-glm-compare}}}
126
 
127
+ As you can see, we can now define any model as a _modular_ of another. This isn't strictly groundbreaking if you've done any programming, you might even think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
128
+
129
  ## <a id="attention-classes"></a> External Attention classes
130
 
131
  A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
 
142
 
143
  It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
144
 
145
+ For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
146
+
147
+ ```python
148
+ from typing import Annotated
149
+
150
+ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
151
+ ```
152
+
153
 
154
  ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
155
 
 
213
  ## The good modularity
214
 
215
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
216
+ It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 
 
217
  So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
218
 
219
  To get this graph, I used the heuristic of modular inheritance.
 
224
  So what do we see? Llama is a basis for many models, and it shows.
225
  Radically different architectures such as mamba have spawned their own dependency subgraph.
226
 
 
 
227
 
228
+ {{{fragment-dependency-graph}}}
 
229
 
230
+ However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
231
+ As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
232
 
233
+ Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
234
 
235
  ## Too many models, yet not enough, are alike
236
 
237
+ So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
238
 
239
+ It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library:
240
  {{{fragment-model-timeline}}}
241
 
242
+ If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
243
 
 
244
 
 
245
  ## VLM improvements, avoiding abstraction
246
 
247
  We don't have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.
 
305
 
306
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
307
 
308
+ ## The weight of maintenance
309
+
310
+
311
+ The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an โ€œeffective LOCโ€ curve: the ๐—บ๐—ฎ๐—ถ๐—ป๐˜๐—ฒ๐—ป๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜€๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ.
312
+
313
+ ๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
314
+
315
+ Less code to hand-maintain means fewer places to break.
316
 
317
+ Cyclomatic complexity isnโ€™t LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐™™ ~ ๐™ญ ๐™ก๐™ฃ ๐™ญ. Lower ๐˜… (lower loc) helps.
 
318
 
319
  {{{fragment-loc-growth}}}
320
 
321
+ Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
322
+
323
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
324
 
325
+ Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
326
 
327
  ![Popular models bar plot](static/popular_models_barplot.png)
328
+
329
+ As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
330
+
331
  ## On image processing and processors
332
 
333
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 
338
 
339
  ## Reduce barrier to entry/contribution
340
 
341
+ This is an overall objective: there's no `transformer` without its community.
342
+
343
+ We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
344
+
345
+ Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
346
 
347
+ In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 
348
 
349
 
350
  ## A surgical toolbox for model development