Molbap HF Staff commited on
Commit
997e875
·
1 Parent(s): 34046df
Files changed (2) hide show
  1. app.py +1 -1
  2. content/article.md +7 -7
app.py CHANGED
@@ -281,7 +281,7 @@ def build_image(filename):
281
  for directory in ['content', 'static']:
282
  filepath = Path(directory) / filename
283
  if filepath.exists():
284
- gr.File(value=str(filepath), show_label=False, interactive=False)
285
  return
286
  gr.Markdown(f"*Image not found: {filename}*")
287
  return _build
 
281
  for directory in ['content', 'static']:
282
  filepath = Path(directory) / filename
283
  if filepath.exists():
284
+ gr.Image(value=str(filepath), show_label=False, interactive=False, show_download_button=False)
285
  return
286
  gr.Markdown(f"*Image not found: {filename}*")
287
  return _build
content/article.md CHANGED
@@ -262,9 +262,9 @@ To get this graph, I used the heuristic of modular inheritance.
262
 
263
  So what do we see? Llama is a basis for many models, and it shows.
264
  Radically different architectures such as mamba have spawned their own dependency subgraph.
265
- [code relatedness](d3_dependency_graph.html)
266
 
267
- ![[graph_modular_related_models.png]]
268
 
269
  But there is no similar miracle for VLMs across the board.
270
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
@@ -278,7 +278,7 @@ So I looked into Jaccard similarity, which we use to measure set differences. I
278
 
279
  {{TERMINAL}}
280
 
281
- ![[Jaccard_similarity_plot.png]]
282
 
283
  The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
284
  ## VLM improvements, avoiding abstraction
@@ -296,7 +296,7 @@ But this is breaking [Standardize, don't abstract](#standardize-dont-abstract).
296
 
297
  This is the current state of abstractions across a modeling file:
298
 
299
- ![[Bloatedness_visualizer.png]]
300
 
301
  The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
302
 
@@ -350,13 +350,13 @@ So the question abounds naturally: How can we modularize more?
350
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
351
 
352
 
353
- ![[modular_candidates.png]]
354
 
355
  ## <a id="encoders-ftw"></a> Encoders win !
356
 
357
  Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
358
 
359
- ![[popular_models_barplot.png]]
360
  ## On image processing and processors
361
 
362
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -386,7 +386,7 @@ Because it is all PyTorch (and it is even more now that we support only PyTorch)
386
 
387
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
388
 
389
- ![[model_debugger.png]]
390
 
391
  ### Transformers-serve
392
 
 
262
 
263
  So what do we see? Llama is a basis for many models, and it shows.
264
  Radically different architectures such as mamba have spawned their own dependency subgraph.
265
+ {{D3_GRAPH}}
266
 
267
+ {{graph_modular_related_models}}
268
 
269
  But there is no similar miracle for VLMs across the board.
270
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
 
278
 
279
  {{TERMINAL}}
280
 
281
+ {{Jaccard_similarity_plot}}
282
 
283
  The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
284
  ## VLM improvements, avoiding abstraction
 
296
 
297
  This is the current state of abstractions across a modeling file:
298
 
299
+ {{Bloatedness_visualizer}}
300
 
301
  The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
302
 
 
350
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
351
 
352
 
353
+ {{modular_candidates}}
354
 
355
  ## <a id="encoders-ftw"></a> Encoders win !
356
 
357
  Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
358
 
359
+ {{popular_models_barplot}}
360
  ## On image processing and processors
361
 
362
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 
386
 
387
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
388
 
389
+ {{model_debugger}}
390
 
391
  ### Transformers-serve
392