attempt
Browse files- app.py +1 -1
- content/article.md +7 -7
app.py
CHANGED
|
@@ -281,7 +281,7 @@ def build_image(filename):
|
|
| 281 |
for directory in ['content', 'static']:
|
| 282 |
filepath = Path(directory) / filename
|
| 283 |
if filepath.exists():
|
| 284 |
-
gr.
|
| 285 |
return
|
| 286 |
gr.Markdown(f"*Image not found: {filename}*")
|
| 287 |
return _build
|
|
|
|
| 281 |
for directory in ['content', 'static']:
|
| 282 |
filepath = Path(directory) / filename
|
| 283 |
if filepath.exists():
|
| 284 |
+
gr.Image(value=str(filepath), show_label=False, interactive=False, show_download_button=False)
|
| 285 |
return
|
| 286 |
gr.Markdown(f"*Image not found: {filename}*")
|
| 287 |
return _build
|
content/article.md
CHANGED
|
@@ -262,9 +262,9 @@ To get this graph, I used the heuristic of modular inheritance.
|
|
| 262 |
|
| 263 |
So what do we see? Llama is a basis for many models, and it shows.
|
| 264 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 265 |
-
|
| 266 |
|
| 267 |
-
|
| 268 |
|
| 269 |
But there is no similar miracle for VLMs across the board.
|
| 270 |
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
|
@@ -278,7 +278,7 @@ So I looked into Jaccard similarity, which we use to measure set differences. I
|
|
| 278 |
|
| 279 |
{{TERMINAL}}
|
| 280 |
|
| 281 |
-
|
| 282 |
|
| 283 |
The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
|
| 284 |
## VLM improvements, avoiding abstraction
|
|
@@ -296,7 +296,7 @@ But this is breaking [Standardize, don't abstract](#standardize-dont-abstract).
|
|
| 296 |
|
| 297 |
This is the current state of abstractions across a modeling file:
|
| 298 |
|
| 299 |
-
|
| 300 |
|
| 301 |
The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
|
| 302 |
|
|
@@ -350,13 +350,13 @@ So the question abounds naturally: How can we modularize more?
|
|
| 350 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 351 |
|
| 352 |
|
| 353 |
-
|
| 354 |
|
| 355 |
## <a id="encoders-ftw"></a> Encoders win !
|
| 356 |
|
| 357 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 358 |
|
| 359 |
-
|
| 360 |
## On image processing and processors
|
| 361 |
|
| 362 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
@@ -386,7 +386,7 @@ Because it is all PyTorch (and it is even more now that we support only PyTorch)
|
|
| 386 |
|
| 387 |
It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
|
| 388 |
|
| 389 |
-
|
| 390 |
|
| 391 |
### Transformers-serve
|
| 392 |
|
|
|
|
| 262 |
|
| 263 |
So what do we see? Llama is a basis for many models, and it shows.
|
| 264 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 265 |
+
{{D3_GRAPH}}
|
| 266 |
|
| 267 |
+
{{graph_modular_related_models}}
|
| 268 |
|
| 269 |
But there is no similar miracle for VLMs across the board.
|
| 270 |
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
|
|
|
| 278 |
|
| 279 |
{{TERMINAL}}
|
| 280 |
|
| 281 |
+
{{Jaccard_similarity_plot}}
|
| 282 |
|
| 283 |
The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
|
| 284 |
## VLM improvements, avoiding abstraction
|
|
|
|
| 296 |
|
| 297 |
This is the current state of abstractions across a modeling file:
|
| 298 |
|
| 299 |
+
{{Bloatedness_visualizer}}
|
| 300 |
|
| 301 |
The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
|
| 302 |
|
|
|
|
| 350 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 351 |
|
| 352 |
|
| 353 |
+
{{modular_candidates}}
|
| 354 |
|
| 355 |
## <a id="encoders-ftw"></a> Encoders win !
|
| 356 |
|
| 357 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 358 |
|
| 359 |
+
{{popular_models_barplot}}
|
| 360 |
## On image processing and processors
|
| 361 |
|
| 362 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
|
|
| 386 |
|
| 387 |
It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
|
| 388 |
|
| 389 |
+
{{model_debugger}}
|
| 390 |
|
| 391 |
### Transformers-serve
|
| 392 |
|