Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Sep 26

Commit

e7f22ff

1 Parent(s): 347ff85

start fix

Browse files

Files changed (1) hide show

content/article.md +6 -10

content/article.md CHANGED Viewed

@@ -279,7 +279,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
 So the question abounds naturally: How can we modularize more?
 I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
-{{fragment-space-embed}}
 ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
@@ -323,16 +323,10 @@ Having all these models readily available allows to use all of them with transfo
 ```bash
 # Start serving a model with transformers serve
-transformers serve microsoft/DialoGPT-medium --port 8000
 # Query the model using OpenAI-compatible API
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d "{
-    \"model\": \"microsoft/DialoGPT-medium\",
-    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
-    \"max_tokens\": 50
-  }"
 ```
 This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
@@ -342,7 +336,9 @@ This provides an OpenAI-compatible API with features like continuous batching fo
 Adding a model to transformers means:
 - having it immediately available to the community
-- usable in vLLM, SGLang, and so on without additional code.
 ## Cooking faster CUDA warmups

 So the question abounds naturally: How can we modularize more?
 I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
+{{fragment-modular-growth}}
 ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
 ```bash
 # Start serving a model with transformers serve
+transformers serve
 # Query the model using OpenAI-compatible API
+curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
 ```
 This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
 Adding a model to transformers means:
 - having it immediately available to the community
+- usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
+This cements
 ## Cooking faster CUDA warmups