Color
Browse files- index.html +5 -5
- llm_conf.qmd +5 -5
index.html
CHANGED
|
@@ -426,7 +426,7 @@
|
|
| 426 |
<li>Backward ~= 2x the model size</li>
|
| 427 |
<li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
|
| 428 |
</ul>
|
| 429 |
-
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
|
| 430 |
<table>
|
| 431 |
<thead>
|
| 432 |
<tr class="header">
|
|
@@ -465,7 +465,7 @@
|
|
| 465 |
<p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
|
| 466 |
<p>But what happens as we scale?</p>
|
| 467 |
<p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
|
| 468 |
-
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
|
| 469 |
<table>
|
| 470 |
<thead>
|
| 471 |
<tr class="header">
|
|
@@ -698,7 +698,7 @@
|
|
| 698 |
<li>Rely on <code>config.yaml</code> files</li>
|
| 699 |
<li>Choose to either running <code>accelerate config</code> or write your own:</li>
|
| 700 |
</ul>
|
| 701 |
-
<div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);">
|
| 702 |
<div class="column" style="width:40%;">
|
| 703 |
<div class="code-with-filename">
|
| 704 |
<div class="code-with-filename-file">
|
|
@@ -804,7 +804,7 @@
|
|
| 804 |
<ul>
|
| 805 |
<li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
|
| 806 |
</ul>
|
| 807 |
-
<div style="font-size: 60%;background-color: rgba(0,0,0,.1);">
|
| 808 |
<table style="width:100%;">
|
| 809 |
<colgroup>
|
| 810 |
<col style="width: 14%">
|
|
@@ -894,7 +894,7 @@
|
|
| 894 |
<ul>
|
| 895 |
<li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
|
| 896 |
</ul>
|
| 897 |
-
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
|
| 898 |
<table style="width:100%;">
|
| 899 |
<colgroup>
|
| 900 |
<col style="width: 16%">
|
|
|
|
| 426 |
<li>Backward ~= 2x the model size</li>
|
| 427 |
<li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
|
| 428 |
</ul>
|
| 429 |
+
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
|
| 430 |
<table>
|
| 431 |
<thead>
|
| 432 |
<tr class="header">
|
|
|
|
| 465 |
<p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
|
| 466 |
<p>But what happens as we scale?</p>
|
| 467 |
<p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
|
| 468 |
+
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
|
| 469 |
<table>
|
| 470 |
<thead>
|
| 471 |
<tr class="header">
|
|
|
|
| 698 |
<li>Rely on <code>config.yaml</code> files</li>
|
| 699 |
<li>Choose to either running <code>accelerate config</code> or write your own:</li>
|
| 700 |
</ul>
|
| 701 |
+
<div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
|
| 702 |
<div class="column" style="width:40%;">
|
| 703 |
<div class="code-with-filename">
|
| 704 |
<div class="code-with-filename-file">
|
|
|
|
| 804 |
<ul>
|
| 805 |
<li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
|
| 806 |
</ul>
|
| 807 |
+
<div style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
|
| 808 |
<table style="width:100%;">
|
| 809 |
<colgroup>
|
| 810 |
<col style="width: 14%">
|
|
|
|
| 894 |
<ul>
|
| 895 |
<li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
|
| 896 |
</ul>
|
| 897 |
+
<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
|
| 898 |
<table style="width:100%;">
|
| 899 |
<colgroup>
|
| 900 |
<col style="width: 16%">
|
llm_conf.qmd
CHANGED
|
@@ -28,7 +28,7 @@ General estimate (`bert-base-cased`, 108M params):
|
|
| 28 |
- Backward ~= 2x the model size
|
| 29 |
- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
|
| 30 |
|
| 31 |
-
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
|
| 32 |
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|
| 33 |
|---------|:-----|:------:|:------:|:------:|:------:|
|
| 34 |
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
|
|
@@ -45,7 +45,7 @@ But what happens as we scale?
|
|
| 45 |
|
| 46 |
Here's `llama-3-8B` (8.03B parameters)
|
| 47 |
|
| 48 |
-
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
|
| 49 |
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|
| 50 |
|---------|:-----|:------:|:------:|:------:|:------:|
|
| 51 |
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
|
|
@@ -202,7 +202,7 @@ accelerate launch script.py
|
|
| 202 |
* Rely on `config.yaml` files
|
| 203 |
* Choose to either running `accelerate config` or write your own:
|
| 204 |
|
| 205 |
-
:::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);"}
|
| 206 |
::: {.column width="40%"}
|
| 207 |
```{.yaml filename=ddp_config.yaml}
|
| 208 |
compute_environment: LOCAL_MACHINE
|
|
@@ -302,7 +302,7 @@ for batch in dataloader:
|
|
| 302 |
|
| 303 |
* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
|
| 304 |
|
| 305 |
-
::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);"}
|
| 306 |
| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
|
| 307 |
| -- | -- | -- | -- | -- | -- | -- |
|
| 308 |
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
|
|
@@ -326,7 +326,7 @@ What is actually happening:
|
|
| 326 |
|
| 327 |
* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
|
| 328 |
|
| 329 |
-
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
|
| 330 |
Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
|
| 331 |
--|--|--|--|--|--
|
| 332 |
FSDP | bf16 | default (none) | bf16 | bf16 | bf16
|
|
|
|
| 28 |
- Backward ~= 2x the model size
|
| 29 |
- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
|
| 30 |
|
| 31 |
+
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
|
| 32 |
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|
| 33 |
|---------|:-----|:------:|:------:|:------:|:------:|
|
| 34 |
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
|
|
|
|
| 45 |
|
| 46 |
Here's `llama-3-8B` (8.03B parameters)
|
| 47 |
|
| 48 |
+
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
|
| 49 |
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|
| 50 |
|---------|:-----|:------:|:------:|:------:|:------:|
|
| 51 |
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
|
|
|
|
| 202 |
* Rely on `config.yaml` files
|
| 203 |
* Choose to either running `accelerate config` or write your own:
|
| 204 |
|
| 205 |
+
:::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
|
| 206 |
::: {.column width="40%"}
|
| 207 |
```{.yaml filename=ddp_config.yaml}
|
| 208 |
compute_environment: LOCAL_MACHINE
|
|
|
|
| 302 |
|
| 303 |
* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
|
| 304 |
|
| 305 |
+
::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
|
| 306 |
| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
|
| 307 |
| -- | -- | -- | -- | -- | -- | -- |
|
| 308 |
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
|
|
|
|
| 326 |
|
| 327 |
* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
|
| 328 |
|
| 329 |
+
::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
|
| 330 |
Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
|
| 331 |
--|--|--|--|--|--
|
| 332 |
FSDP | bf16 | default (none) | bf16 | bf16 | bf16
|