Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Sep 26

Commit

3a0c25b

1 Parent(s): 5643222

biiig push

Browse files

Files changed (1) hide show

content/article.md +41 -25

content/article.md CHANGED Viewed

@@ -96,8 +96,6 @@ When a PR is merged, it is because the contribution is worthwhile, and that the
 Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
 For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70  `modeling_<file>.py` across `src/transformers/models/.`  Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
 ```python
@@ -108,16 +106,15 @@ def rotate_half(x):
     return torch.cat((-x2, x1), dim=-1)
 ```
-You can use a script such as [[top_methods.py]]  to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
 So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
-As I was looking for things to improve and make better, it's one of the iterations I attempted: a function is almost everywhere the same, let's import it from some common file? But no! Goes against
 ## <a id="modular"></a> Going modular
-However, both of these works were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers) , allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
 We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
@@ -127,6 +124,8 @@ It is explained in details in the documentation above, but overall it works like
 {{{fragment-glm-compare}}}
 ## <a id="attention-classes"></a> External Attention classes
 A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
@@ -143,7 +142,14 @@ We often read and understand that `kwargs` are criticized, and we are typing the
 It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
-For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations.
 ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
@@ -207,9 +213,7 @@ Plus, this opened another angle of contribution for the community. People who ar
 ## The good modularity
 Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
-My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.
 So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
 To get this graph, I used the heuristic of modular inheritance.
@@ -220,26 +224,24 @@ To get this graph, I used the heuristic of modular inheritance.
 So what do we see? Llama is a basis for many models, and it shows.
 Radically different architectures such as mamba have spawned their own dependency subgraph.
-{{{fragment-dependency-graph}}}
- But there is no similar miracle for VLMs across the board.
-As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
-One problem is, this is only for `modular` models. Several models do NOT have a modular file. In other words, we have a big "hidden space here."
 ## Too many models, yet not enough, are alike
-So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out  [[find_dependencies.py]] .
 {{{fragment-model-timeline}}}
-{{{fragment-terminal}}}
-![Jaccard similarity plot showing model relationships](static/Jaccard_similarity_plot.png)
-The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
 ## VLM improvements, avoiding abstraction
 We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
@@ -303,18 +305,29 @@ The following [Pull request to standardize placeholder masking](https://github.c
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
-## Modularity candidates
-So the question abounds naturally: How can we modularize more?
-I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
 {{{fragment-loc-growth}}}
 ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
-Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
 ![Popular models bar plot](static/popular_models_barplot.png)
 ## On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -325,10 +338,13 @@ The gains in performance are immense, up to 20x speed for most models when compi
 ## Reduce barrier to entry/contribution
-This is an overall objective, no transformers without community.
-We didn't want to make a toolbox, old tenet, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
-Among the most valuable contributions to `transformers`is of course the addition of new models.
 ## A surgical toolbox for model development

 Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
 For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70  `modeling_<file>.py` across `src/transformers/models/.`  Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
 ```python
     return torch.cat((-x2, x1), dim=-1)
 ```
+You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
 So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
 ## <a id="modular"></a> Going modular
+It is opinionated, and it can be frustrating when you encounter an opinionated library. Our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
 We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
 {{{fragment-glm-compare}}}
+As you can see, we can now define any model as a _modular_ of another. This isn't strictly groundbreaking if you've done any programming, you might even think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
 ## <a id="attention-classes"></a> External Attention classes
 A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
 It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
+For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations, like so (tentative design):
+```python
+from typing import Annotated
+MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
+```
 ## <a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism
 ## The good modularity
 Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
+It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
 To get this graph, I used the heuristic of modular inheritance.
 So what do we see? Llama is a basis for many models, and it shows.
 Radically different architectures such as mamba have spawned their own dependency subgraph.
+{{{fragment-dependency-graph}}}
+However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
+As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
+Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
 ## Too many models, yet not enough, are alike
+So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
+It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library:
 {{{fragment-model-timeline}}}
+If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
 ## VLM improvements, avoiding abstraction
 We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
+## The weight of maintenance
+The effect of modular can be measured straight from git history: at every commit I counted LOC (lines of code) under src/transformers/models, but if a model has a modular_*.py I count it. That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
+𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
+Less code to hand-maintain means fewer places to break.
+Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.
 {{{fragment-loc-growth}}}
+Of course, it is not only this effort that allowed to reduce the maintenance load. Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
 ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
+Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
 ![Popular models bar plot](static/popular_models_barplot.png)
+As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
 ## On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 ## Reduce barrier to entry/contribution
+This is an overall objective: there's no `transformer` without its community.
+We didn't want to make a toolbox, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
+Among the most valuable contributions to `transformers`is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
+In that regard, we DO want to be a [modular toolbox](#modular-toolbox), being [minimal](#minimal-user-api) enough (and hopefully well documented enough) so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 ## A surgical toolbox for model development