Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Aug 19

Commit

b20dcba

1 Parent(s): 25e73ce

core article

Browse files

Files changed (1) hide show

content/article.md +422 -0

content/article.md ADDED Viewed

	@@ -0,0 +1,422 @@

+#transformers  #huggingface
+## Digging through tenets and time
+### Introduction
+###context The `transformers` library, built with `PyTorch`, supports all state-of-the-art LLMs, many VLMs, task-specific vision language models, video models, audio models, table models, classical encoders, to a global count of almost 400 models. The name of the library itself is mostly majority driven as many models are not even transformers architectures, like Mamba/RWKV.  Regardless, each of these is wrought by the research and engineering team that created them, then harmonized into a now famous interface, and callable with a simple `.from_pretrained`. Inference and training are supported. The library supports ML courses, cookbooks, and several thousands other open-source libraries depend on it. All models are tested as part of a daily CI ensuring their preservation and reproducibility. Most importantly, it is open-source and has been written by the community for a large part.
+###tension The ML wave has not stopped, there's more and more models being added. `Transformers` is widely used, and we read the feedback that users post. Whether it's about a function that had 300+ keyword arguments, duplicated code and helpers, and mentions of `Copied from ... ` everywhere, along with optimisation concerns. Text-only models are relatively tamed, but multimodal models remain to be harmonized.
+###scope Here we will dissect what is the design philosophy of transformers, as a continuation from the existing older [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and an accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy) . Some time ago I dare not say how long, we discussed with transformers maintainers about the state of things. A lot of recent developments were satisfactory, but if we were only talking about these, self-congratulation would be the only goalpost. Reflecting on this philosophy now, as models pile up, is essential and will drive new developments.
+###promise Every reader, whether an OSS maintainer, power user, or casual fine-tuner, will walk away knowing how to reason about the `transformers` code base, how to use it better, how to meaningfully contribute to it.
+This will also showcase new features you might have missed so you'll be up-to-date.
+So, what are the principles of `transformers`? We will try to summarize the foundations on which we've built everything, and write the "tenets" of the library.  They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
+-  0.  <a id="source-of-truth"></a>overarching "Guideline": we should be a source of truth for all model definitions. This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.
+- 1. <a id="one-model-one-file"></a> One model, one file: all inference (and most of training, loss is separate, not a part of model) logic visible, top‑to‑bottom.
+- 2. <a id="code-is-product"></a>Code is the product: optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.
+- 3. <a id="standardize-dont-abstract"></a>Standardize, don’t abstract: if it’s model behavior, keep it in the file; abstractions only for generic infra.
+- 4. ###TOCHANGE   <a id="do-repeat-yourself"></a>DRY* (DO Repeat Yourself) via the copy mechanism: copy when it helps users; keep successors in sync without centralizing behavior.
+	- 4 prime: We amend this tenet. With the introduction and global adoption of  [`modular`](#modular) transformers, we do not repeat any logic in the `modular` files, but end user files remain faithful to the original tenet.
+- 5. <a id="minimal-user-api"></a>Minimal user API: config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.
+- 6. <a id="backwards-compatibility"></a>Backwards compatibility first: evolve by additive standardization, **never** break public APIs.
+	- Some models are showing almost no use, we also stopped adding new features for non-`torch` frameworks. Still, we adapt to models existing on the hub.
+- 7. ###TOCHANGE <a id="consistent-public-surface"></a>Consistent public surface, enforced by tests: same argument names, same outputs, hidden states and attentions exposed.
+	-
+- 8. ###TOCHANGE  We are not a modular toolbox. Components should be separable and users encouraged to use PyTorch directly for further usage.
+	- This is the largest change. We ARE a toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is _better_ for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as  PEFT/TRL/SGLang/vLLM.
+When a PR is merged, it is because the contribution is worthwhile, and that the  `transformers` team finds the design of the contribution to be aligned with what is above.
+Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is inline, lest we break [backwards compatibility](#backwards-compatibility).
+For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70  `modeling_<file>.py` across `src/transformers/models/.`  Why keep it? Because removing it would make those files unloadable checkpoints rather than self-contained blueprints. We [do repeat ourselves](#do-repeat-yourself).
+```python
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+```
+You can use a script such as [[top_methods.py]]  to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
+So.... why keep it in all modeling files? Because if we were to remove it, the model would not work anymore. Think of the modeling files as a car (I know, what a novel metaphor! But, it works out.). All manual transmission cars have a clutch, but we want each _view_ of one of our cars to be able to function. Remove the clutch, you can't drive. Remove the doors, might be uncomfortable but you'll get there. So doors can go, but you _have_ to keep the clutch, even though you know perfectly how it works.
+As I was looking for things to improve and make better, it's one of the iterations I attempted: a function is almost everywhere the same, let's import it from some common file? But no! Goes against
+## <a id="modular"></a> Going modular
+However, both of these works were already pointing at some drawbacks, which have been iteratively addressed. [Transformers has gone modular](https://huggingface.co/docs/transformers/en/modular_transformers) , allowing a form of inheritance without breaking [One model, One file](#one-model-one-file). If you're familiar with this, you can [skip this section](#^attention-classes) and go to the next one.
+We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively
+It is explained in details in the documentation above, but overall it works like this, you define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_:
+```python
+class GlmMLP(Phi3MLP):
+    pass
+class GlmAttention(LlamaAttention):
+    def __init__(self, config, layer_idx=None):
+        super().__init__(config, layer_idx)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+class GlmForCausalLM(LlamaForCausalLM):
+    pass
+```
+That will get auto-expanded into the modeling file, which will actually be run.
+In other words, we now WRITE the modular file but READ the modeling file.
+<details>
+<summary>Auto-generated modeling code</summary>
+```python
+class GlmMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.gate_up_proj = nn.Linear(config.hidden_size, 2 * config.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
+        self.activation_fn = ACT2FN[config.hidden_act]
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        up_states = self.gate_up_proj(hidden_states)
+        gate, up_states = up_states.chunk(2, dim=-1)
+        up_states = up_states * self.activation_fn(gate)
+        return self.down_proj(up_states)
+class GlmAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: GlmConfig, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+@use_kernel_forward_from_hub("RMSNorm")
+class GlmRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        GlmRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class GlmRotaryEmbedding(nn.Module):
+    def __init__(self, config: GlmConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+```
+</details>
+## <a id="attention-classes"></a> External Attention classes
+A chronological iteration over [modular](#modular), and a big improvement in terms of readabilty, was to remove the various attention-backend-specific attention classes across the repository. Before, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
+What will forever stay in the modeling code is the `eager_attention_forward` because it is a core part of the modeling,
+```python
+attention_interface: Callable = eager_attention_forward
+if self.config._attn_implementation != "eager":
+    attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+```
+We often read and understand that `kwargs` are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don''t use the same kwargs.
+It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
+For a better _information_, we plan to use `python`features such as `Annoted` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the
+## Community Kernels
+The same principle extends to normalization, activation, and other hot paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
+```python
+@use_kernel_forward_from_hub("RMSNorm")
+class GlmRMSNorm(nn.Module):
+    ...
+```
+Plus, this opened another angle of contribution for the community. People who are GPU whisperersYou can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
+## The good modularity
+Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
+My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.
+So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
+To get this graph, I used the heuristic of modular inheritance.
+1. Does this model have a `modular` file?
+2. In this `modular` file, what models, configurations and processings are imported?
+3. Recurse through the model list that way.
+So what do we see? Llama is a basis for many models, and it shows.
+Radically different architectures such as mamba have spawned their own dependency subgraph.
+[code relatedness](d3_dependency_graph.html)
+![[Pasted image 20250729153809.png]]
+ But there is no similar miracle for VLMs across the board.
+As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
+One problem is, this is only for `modular` models. Several models do NOT have a modular file. In other words, we have a big "hidden space here."
+## Too many models, yet not enough, are alike
+So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out  [[find_dependencies.py]] .
+{{TERMINAL}}
+![[Pasted image 20250728175655.png]]
+The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
+## VLM improvements, avoiding abstraction
+We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
+So initially I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
+```python
+class InputsEmbeddingMixerMixin(nn.Module):
+    #
+```
+But this is breaking [Standardize, don't abstract](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
+This is the current state of abstractions across a modeling file:
+![[Pasted image 20250728181550.png]]
+The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
+```python
+    def get_placeholder_mask(
+        self,
+        input_ids: torch.LongTensor,
+        inputs_embeds: torch.FloatTensor,
+        image_features: torch.FloatTensor = None,
+        video_features: torch.FloatTensor = None,
+    ):
+        """
+        Obtains multimodal placeholdr mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
+        equal to the length of multimodal features. If the lengths are different, an error is raised.
+        """
+        if input_ids is None:
+            special_image_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_image_mask = special_image_mask.all(-1)
+            special_video_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(self.config.video_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_video_mask = special_video_mask.all(-1)
+        else:
+            special_image_mask = input_ids == self.config.image_token_id
+            special_video_mask = input_ids == self.config.video_token_id
+        n_image_tokens = special_image_mask.sum()
+        special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        if image_features is not None and inputs_embeds[special_image_mask].numel() != image_features.numel():
+            raise ValueError(
+                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {image_features.shape[0]}"
+            )
+        n_video_tokens = special_video_mask.sum()
+        special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        if video_features is not None and inputs_embeds[special_video_mask].numel() != video_features.numel():
+            raise ValueError(
+                f"Videos features and video tokens do not match: tokens: {n_video_tokens}, features {video_features.shape[0]}"
+            )
+        return special_image_mask, special_video_mask
+```
+But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the self-contained logic of the model.
+## Modularity candidates
+So the question abounds naturally: How can we modularize more?
+I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
+![[Pasted image 20250729174627.png]]
+## <a id="encoders-ftw"></a> Encoders win !
+Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
+![[Pasted image 20250728175753.png]]
+## On image processing and processors
+Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
+The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops.
+## Reduce barrier to entry/contribution
+This is an overall objective, no transformers without community.
+We didn't want to make a toolbox, old tenet, because _having a framework means forcing users into it_. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
+Among the most valuable contributions to `transformers`is of course the addition of new models.
+## A surgical toolbox for model development
+### Attention visualisation
+If all models have the same API internally for attention computation, it allows us to build cool tools to visualize the inner workings of the attention mechanism. One particular piece of
+machinery is the `attention mask`, cause of confusion. Thankfully, we can fix it.
+{{ATTN_VIS}}
+Because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
+It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
+![[Pasted image 20250813175317.png]]
+### Transformers-serve
+Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
+#### add example
+## Community reusability
+Adding a model to transformers means:
+- having it immediately available to the community
+- usable in vLLM, SGLang, and so on without additional code.
+## Inner cooking: Cache allocator
+Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _Cache pre-allocator_ which improved massively the loading footprint.
+{{ALLOC_PLOT}}
+### Linkedin post (to remove)
+Linkedin post for videos:
+In transformers, how do we deal with cross-model dependencies, while supporting ~400 models? Maybe you've seen the same 200-lines functions in too many _modeling_file.py_? Duplication isn’t inevitable.
+The “one‑model/one‑file” rule keeps every model readable and runnable. It also means identical code is copied hundreds of times. Maintenance hurts, contributor PRs snowball, and vision–language models especially end up in siloed forks.
+modular_*.py fixes the trade‑off, by auto-generating the modeling file from a modular file, which can use inheritance.
+With a small analyser I’ve mapped which models already share modular pieces and which 100‑plus still repeat themselves. Red nodes in the graph = lowest‑hanging fruit for refactor; blue = already modular.
+The result: contributors can focus on novel layers instead of boilerplate, reviews shrink from “new file diff” to “does this override make sense?”, and the codebase stays something you can actually open and read.
+If you maintain or ship models on top of Transformers, take a look at modular, in 2025 it’s how we keep shipping breadth without the bloat. 🛠️