start fix
Browse files- content/article.md +6 -10
content/article.md
CHANGED
|
@@ -279,7 +279,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
|
|
| 279 |
So the question abounds naturally: How can we modularize more?
|
| 280 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 281 |
|
| 282 |
-
{{fragment-
|
| 283 |
|
| 284 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 285 |
|
|
@@ -323,16 +323,10 @@ Having all these models readily available allows to use all of them with transfo
|
|
| 323 |
|
| 324 |
```bash
|
| 325 |
# Start serving a model with transformers serve
|
| 326 |
-
transformers serve
|
| 327 |
|
| 328 |
# Query the model using OpenAI-compatible API
|
| 329 |
-
curl -X POST http://localhost:8000/v1/chat/completions
|
| 330 |
-
-H "Content-Type: application/json" \
|
| 331 |
-
-d "{
|
| 332 |
-
\"model\": \"microsoft/DialoGPT-medium\",
|
| 333 |
-
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
|
| 334 |
-
\"max_tokens\": 50
|
| 335 |
-
}"
|
| 336 |
```
|
| 337 |
|
| 338 |
This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
|
|
@@ -342,7 +336,9 @@ This provides an OpenAI-compatible API with features like continuous batching fo
|
|
| 342 |
|
| 343 |
Adding a model to transformers means:
|
| 344 |
- having it immediately available to the community
|
| 345 |
-
- usable in vLLM, SGLang, and so on without additional code.
|
|
|
|
|
|
|
| 346 |
|
| 347 |
## Cooking faster CUDA warmups
|
| 348 |
|
|
|
|
| 279 |
So the question abounds naturally: How can we modularize more?
|
| 280 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 281 |
|
| 282 |
+
{{fragment-modular-growth}}
|
| 283 |
|
| 284 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 285 |
|
|
|
|
| 323 |
|
| 324 |
```bash
|
| 325 |
# Start serving a model with transformers serve
|
| 326 |
+
transformers serve
|
| 327 |
|
| 328 |
# Query the model using OpenAI-compatible API
|
| 329 |
+
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
```
|
| 331 |
|
| 332 |
This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
|
|
|
|
| 336 |
|
| 337 |
Adding a model to transformers means:
|
| 338 |
- having it immediately available to the community
|
| 339 |
+
- usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 340 |
+
|
| 341 |
+
This cements
|
| 342 |
|
| 343 |
## Cooking faster CUDA warmups
|
| 344 |
|