Molbap HF Staff commited on
Commit
e7f22ff
·
1 Parent(s): 347ff85
Files changed (1) hide show
  1. content/article.md +6 -10
content/article.md CHANGED
@@ -279,7 +279,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
279
  So the question abounds naturally: How can we modularize more?
280
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
281
 
282
- {{fragment-space-embed}}
283
 
284
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
285
 
@@ -323,16 +323,10 @@ Having all these models readily available allows to use all of them with transfo
323
 
324
  ```bash
325
  # Start serving a model with transformers serve
326
- transformers serve microsoft/DialoGPT-medium --port 8000
327
 
328
  # Query the model using OpenAI-compatible API
329
- curl -X POST http://localhost:8000/v1/chat/completions \
330
- -H "Content-Type: application/json" \
331
- -d "{
332
- \"model\": \"microsoft/DialoGPT-medium\",
333
- \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
334
- \"max_tokens\": 50
335
- }"
336
  ```
337
 
338
  This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
@@ -342,7 +336,9 @@ This provides an OpenAI-compatible API with features like continuous batching fo
342
 
343
  Adding a model to transformers means:
344
  - having it immediately available to the community
345
- - usable in vLLM, SGLang, and so on without additional code.
 
 
346
 
347
  ## Cooking faster CUDA warmups
348
 
 
279
  So the question abounds naturally: How can we modularize more?
280
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
281
 
282
+ {{fragment-modular-growth}}
283
 
284
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
285
 
 
323
 
324
  ```bash
325
  # Start serving a model with transformers serve
326
+ transformers serve
327
 
328
  # Query the model using OpenAI-compatible API
329
+ curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
 
 
 
 
 
 
330
  ```
331
 
332
  This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
 
336
 
337
  Adding a model to transformers means:
338
  - having it immediately available to the community
339
+ - usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
340
+
341
+ This cements
342
 
343
  ## Cooking faster CUDA warmups
344