try integration
Browse files- dist/index.html +69 -58
dist/index.html
CHANGED
|
@@ -457,69 +457,80 @@ machinery is the <code>attention mask</code>, cause of confusion. Thankfully, we
|
|
| 457 |
<li>having it immediately available to the community</li>
|
| 458 |
<li>usable in vLLM, SGLang, and so on without additional code.</li>
|
| 459 |
</ul>
|
| 460 |
-
<p>## Inner cooking:
|
| 461 |
-
<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>
|
| 462 |
-
<
|
| 463 |
-
<div
|
| 464 |
-
<
|
| 465 |
-
<p style="margin: 0; font-size: 0.9em; color: #6c757d;">
|
| 466 |
-
Compare model loading with and without transformers' caching allocator warmup. This demonstrates the memory efficiency improvements.
|
| 467 |
-
</p>
|
| 468 |
</div>
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
<div style="display: grid; grid-template-columns: 1fr auto; gap: 1rem; align-items: end; margin-bottom: 1.5rem;">
|
| 472 |
-
<div>
|
| 473 |
-
<label style="display: block; font-weight: 600; margin-bottom: 0.5rem; color: #374151;">Model to Profile:</label>
|
| 474 |
-
<select id=memory-model-select style="width: 100%; padding: 0.5rem; border: 1px solid #d1d5db; border-radius: 6px; background: white;">
|
| 475 |
-
<option value=openai-community/gpt2>openai-community/gpt2</option>
|
| 476 |
-
<option value=google/gemma-2-2b>google/gemma-2-2b</option>
|
| 477 |
-
<option value=microsoft/DialoGPT-small>microsoft/DialoGPT-small</option>
|
| 478 |
-
<option value=facebook/opt-125m>facebook/opt-125m</option>
|
| 479 |
-
</select>
|
| 480 |
-
<div style="font-size: 0.8em; color: #6c757d; margin-top: 0.25rem;">
|
| 481 |
-
Select a model or enter a custom HuggingFace model ID
|
| 482 |
-
</div>
|
| 483 |
-
</div>
|
| 484 |
-
|
| 485 |
-
<div>
|
| 486 |
-
<button id=memory-profile-btn style="padding: 0.75rem 1.5rem; background: #dc2626; color: white; border: none; border-radius: 6px; cursor: pointer; font-weight: 500;">
|
| 487 |
-
🔥 Profile Memory
|
| 488 |
-
</button>
|
| 489 |
-
</div>
|
| 490 |
-
</div>
|
| 491 |
-
|
| 492 |
-
<div id=memory-chart-container style="width: 100%; height: 400px; border: 1px solid #e2e8f0; border-radius: 6px; background: #f8f9fa; position: relative;">
|
| 493 |
-
<div id=memory-placeholder style="position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); text-align: center; color: #6c757d; font-style: italic;">
|
| 494 |
-
Click "Profile Memory" to generate memory allocation timeline
|
| 495 |
-
</div>
|
| 496 |
-
<canvas id=memory-chart width=100% height=400 style="display: none;"></canvas>
|
| 497 |
-
</div>
|
| 498 |
-
|
| 499 |
-
<div id=memory-stats style="margin-top: 1rem; padding: 1rem; background: #f1f5f9; border-radius: 6px; display: none;">
|
| 500 |
-
<h5 style="margin: 0 0 0.5rem 0; color: #374151;">Memory Statistics</h5>
|
| 501 |
-
<div id=memory-results></div>
|
| 502 |
-
</div>
|
| 503 |
</div>
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
<strong>Note:</strong> This demo requires GPU access. The warmup feature reduces peak memory usage during model loading.
|
| 507 |
-
In the original app, this uses ZeroGPU to measure actual memory allocation timelines.
|
| 508 |
</div>
|
| 509 |
</div>
|
| 510 |
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 523 |
<h3>Linkedin post (to remove)</h3>
|
| 524 |
<p>Linkedin post for videos:</p>
|
| 525 |
<p>In transformers, how do we deal with cross-model dependencies, while supporting ~400 models? Maybe you’ve seen the same 200-lines functions in too many <em>modeling_file.py</em>? Duplication isn’t inevitable.</p>
|
|
|
|
| 457 |
<li>having it immediately available to the community</li>
|
| 458 |
<li>usable in vLLM, SGLang, and so on without additional code.</li>
|
| 459 |
</ul>
|
| 460 |
+
<p>## Inner cooking: CUDA Warmup</p>
|
| 461 |
+
<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.</p>
|
| 462 |
+
<div class=interactive-demo>
|
| 463 |
+
<div class=demo-header>
|
| 464 |
+
<h3>🚀 CUDA Warmup Efficiency Benchmark</h3>
|
|
|
|
|
|
|
|
|
|
| 465 |
</div>
|
| 466 |
+
<div class=demo-content>
|
| 467 |
+
<iframe src=https://molbap-cuda-warmup-transformers.hf.space width=100% height=600px frameborder=0 style="border-radius: 8px; background: white;"></iframe>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 468 |
</div>
|
| 469 |
+
<div class=demo-footer>
|
| 470 |
+
Real CUDA warmup benchmarking with actual Transformers models. Measure the performance impact of the <code>caching_allocator_warmup</code> function at <code>transformers/src/transformers/modeling_utils.py:6186</code>. This interactive tool loads models twice - once with warmup disabled and once with warmup enabled - to demonstrate the significant loading time improvements.
|
|
|
|
|
|
|
| 471 |
</div>
|
| 472 |
</div>
|
| 473 |
|
| 474 |
+
|
| 475 |
+
|
| 476 |
+
|
| 477 |
+
|
| 478 |
+
|
| 479 |
+
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
|
| 483 |
+
|
| 484 |
+
|
| 485 |
+
|
| 486 |
+
|
| 487 |
+
|
| 488 |
+
|
| 489 |
+
|
| 490 |
+
|
| 491 |
+
|
| 492 |
+
|
| 493 |
+
|
| 494 |
+
|
| 495 |
+
|
| 496 |
+
|
| 497 |
+
|
| 498 |
+
|
| 499 |
+
|
| 500 |
+
|
| 501 |
+
|
| 502 |
+
|
| 503 |
+
|
| 504 |
+
|
| 505 |
+
|
| 506 |
+
|
| 507 |
+
|
| 508 |
+
|
| 509 |
+
|
| 510 |
+
|
| 511 |
+
|
| 512 |
+
|
| 513 |
+
|
| 514 |
+
|
| 515 |
+
|
| 516 |
+
|
| 517 |
+
|
| 518 |
+
|
| 519 |
+
|
| 520 |
+
|
| 521 |
+
|
| 522 |
+
|
| 523 |
+
|
| 524 |
+
|
| 525 |
+
|
| 526 |
+
|
| 527 |
+
|
| 528 |
+
|
| 529 |
+
|
| 530 |
+
|
| 531 |
+
|
| 532 |
+
|
| 533 |
+
|
| 534 |
<h3>Linkedin post (to remove)</h3>
|
| 535 |
<p>Linkedin post for videos:</p>
|
| 536 |
<p>In transformers, how do we deal with cross-model dependencies, while supporting ~400 models? Maybe you’ve seen the same 200-lines functions in too many <em>modeling_file.py</em>? Duplication isn’t inevitable.</p>
|