drbh HF Staff commited on
Commit
6ba657a
·
verified ·
1 Parent(s): aaf1485

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. cells/nv.py +3 -0
  2. index.html +132 -83
  3. note_test_override.html +132 -83
cells/nv.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ import subprocess
2
+
3
+ print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
index.html CHANGED
@@ -3711,14 +3711,81 @@ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left:
3711
  </div>
3712
 
3713
  <div class="main-content">
3714
- <div class="cell" id="cell-setup">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3715
  <div class="cell-header">
3716
  <span class="collapse-indicators">
3717
  <span onclick="toggleCode('setup')" style="cursor: pointer;">▼ code</span>
3718
  <span onclick="toggleOutput('setup')" style="cursor: pointer;">▼ output</span>
3719
  <span id="uv-indicator-setup" onclick="toggleUvLogsFromHeader('setup')" style="cursor: pointer;">▶ uv-logs</span>
3720
  </span> |
3721
- Cell: setup | 132.82s
3722
  | <button class="run-btn" onclick="runCell('setup')">▶ run</button>
3723
  <button class="copy-btn" onclick="copyCell('setup')">Copy</button>
3724
  <a href="cells/setup.py" target="_blank" class="raw-btn">Raw</a>
@@ -3977,26 +4044,8 @@ Reasoning: low
3977
 
3978
  What is Tensor Parallelism?
3979
 
3980
- &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It&#x27;s used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather. Similarly for attention QKV projections. Provide example: GPT-3 uses tensor parallelism. Also mention frameworks: Megatron-LM, DeepSpeed, etc. Provide pros/cons. Provide typical implementation: using torch.distributed.all_reduce, gather, etc. Provide code snippet. Also mention that it&#x27;s different from data parallelism. Provide explanation of how it works in practice. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed ZeRO Stage 3&quot; or &quot;Megatron-LM&quot;. Provide mention of &quot;tensor parallelism&quot; as part of &quot;model parallelism&quot; to reduce memory usage. Provide mention of &quot;tensor parallelism&quot; as &quot;splitting weight matrices across GPUs&quot; and &quot;communication overhead&quot;.
3981
-
3982
- Also mention that it&#x27;s used for large transformer models like GPT-3, LLaMA, etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;DeepSpeed&#x27;s ZeRO-Offload&quot; or &quot;ZeRO-3&quot;.
3983
-
3984
- Also mention that &quot;tensor parallelism&quot; can be combined with &quot;pipeline parallelism&quot; and &quot;data parallelism&quot; to achieve full scaling.
3985
-
3986
- Also mention that &quot;tensor parallelism&quot; can be implemented by splitting the weight matrix along the output dimension, performing local matrix multiplication, then all-reduce to sum partial outputs.
3987
-
3988
- Also mention that &quot;tensor parallelism&quot; can be used for linear layers, self-attention, feed-forward networks, etc.
3989
-
3990
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;embedding tables&quot; by sharding them across devices.
3991
-
3992
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;attention heads&quot; by splitting across heads.
3993
-
3994
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;parameter sharding&quot;.
3995
-
3996
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;model parallelism&quot; to reduce memory usage.
3997
-
3998
-
3999
- Generation took 51.90 seconds
4000
  </div>
4001
  <div class="uv-install-logs" id="uv-logs-setup">
4002
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
@@ -4006,31 +4055,31 @@ Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB)
4006
  Updating https://github.com/huggingface/transformers.git (HEAD)
4007
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4008
  Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4009
- Downloading nvidia-cusolver-cu12 (255.1MiB)
4010
  Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4011
- Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4012
- Downloading sympy (6.0MiB)
4013
  Downloading jedi (1.5MiB)
4014
- Downloading fonttools (4.7MiB)
4015
- Downloading pillow (6.3MiB)
4016
  Downloading nvidia-cusparse-cu12 (274.9MiB)
 
 
 
4017
  Downloading nvidia-curand-cu12 (60.7MiB)
4018
- Downloading numpy (15.9MiB)
4019
  Downloading nvidia-cufft-cu12 (184.2MiB)
4020
- Downloading matplotlib (8.3MiB)
 
 
 
 
 
 
 
4021
  Downloading pygments (1.2MiB)
 
4022
  Downloading nvidia-cublas-cu12 (566.8MiB)
 
 
4023
  Downloading kiwisolver (1.4MiB)
4024
  Downloading nvidia-nccl-cu12 (307.4MiB)
4025
- Downloading nvidia-cusparselt-cu12 (273.9MiB)
4026
- Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4027
- Downloading networkx (1.9MiB)
4028
- Downloading hf-xet (3.0MiB)
4029
- Downloading nvidia-cudnn-cu12 (674.0MiB)
4030
  Downloading torch (846.8MiB)
4031
- Downloading tokenizers (3.1MiB)
4032
- Downloading nvidia-cufile-cu12 (1.1MiB)
4033
- Downloading triton (148.4MiB)
4034
  Downloading nvidia-cufile-cu12
4035
  Downloading kiwisolver
4036
  Downloading pygments
@@ -4043,8 +4092,8 @@ Downloading triton (148.4MiB)
4043
  Downloading nvidia-cuda-cupti-cu12
4044
  Downloading numpy
4045
  Downloading sympy
4046
- Downloading nvidia-nvjitlink-cu12
4047
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4048
  Downloading jedi
4049
  Downloading nvidia-curand-cu12
4050
  Downloading nvidia-cuda-nvrtc-cu12
@@ -4057,27 +4106,26 @@ Downloading triton (148.4MiB)
4057
  Downloading nvidia-cublas-cu12
4058
  Downloading nvidia-cudnn-cu12
4059
  Downloading torch
4060
- Installed 69 packages in 539ms
4061
  </div>
4062
  </div>
4063
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4064
- Fetching 3 files: 33%|███▎ | 1/3 [00:07&lt;00:15, 7.55s/it]
4065
- Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.72s/it]
4066
- Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.87s/it]
4067
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4068
 
4069
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4070
  Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4071
- Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.28s/it]
4072
- Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.82s/it]
4073
- Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.95s/it]
4074
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4075
 
4076
  Fetching 6 files: 0%| | 0/6 [00:00&lt;?, ?it/s]
4077
- Fetching 6 files: 17%|█▋ | 1/6 [00:00&lt;00:00, 5.35it/s]
4078
- Fetching 6 files: 50%|█████ | 3/6 [00:00&lt;00:00, 6.55it/s]
4079
- Fetching 6 files: 100%|██████████| 6/6 [00:00&lt;00:00, 12.81it/s]
4080
- /tmp/uvnote-run-og9tszom/home/.cache/uv/environments-v2/setup-d9b6d9dd835772a9/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4081
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4082
  warnings.warn(
4083
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
@@ -4104,7 +4152,7 @@ INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for laye
4104
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4105
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4106
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4107
- /tmp/uvnote-run-og9tszom/home/.cache/uv/environments-v2/setup-d9b6d9dd835772a9/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4108
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4109
  warnings.warn(
4110
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
@@ -4141,7 +4189,7 @@ INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for laye
4141
  <span onclick="toggleOutput('setup2')" style="cursor: pointer;">▼ output</span>
4142
  <span id="uv-indicator-setup2" onclick="toggleUvLogsFromHeader('setup2')" style="cursor: pointer;">▶ uv-logs</span>
4143
  </span> |
4144
- Cell: setup2 | 140.15s
4145
  | <button class="run-btn" onclick="runCell('setup2')">▶ run</button>
4146
  <button class="copy-btn" onclick="copyCell('setup2')">Copy</button>
4147
  <a href="cells/setup2.py" target="_blank" class="raw-btn">Raw</a>
@@ -4399,7 +4447,7 @@ Reasoning: low
4399
  What is Tensor Parallelism?
4400
 
4401
  &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it&#x27;s used in large models like GPT-3, Megatron-LM, DeepSpeed, etc. Provide explanation of how it reduces memory usage, increases throughput. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate, DeepSpeed, Megatron. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in the &quot;DeepSpeed ZeRO-Offload&quot; or &quot;ZeRO-3&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&quot; and &quot;Megatron-LM&quot; and &quot;DeepSpeed&#x27;s ZeRO&quot; and &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the
4402
- Generation took 57.93 seconds
4403
  </div>
4404
  <div class="uv-install-logs" id="uv-logs-setup2">
4405
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
@@ -4408,32 +4456,32 @@ Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB)
4408
  Downloading cpython-3.13.7-linux-x86_64-gnu (download)
4409
  Updating https://github.com/huggingface/transformers.git (HEAD)
4410
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4411
- Downloading numpy (15.9MiB)
4412
  Downloading pygments (1.2MiB)
4413
- Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4414
- Downloading pillow (6.3MiB)
4415
- Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4416
- Downloading nvidia-cusolver-cu12 (255.1MiB)
4417
- Downloading networkx (1.9MiB)
4418
  Downloading nvidia-cufile-cu12 (1.1MiB)
4419
- Downloading tokenizers (3.1MiB)
4420
  Downloading hf-xet (3.0MiB)
4421
- Downloading nvidia-cublas-cu12 (566.8MiB)
4422
- Downloading nvidia-cudnn-cu12 (674.0MiB)
4423
- Downloading nvidia-cufft-cu12 (184.2MiB)
4424
  Downloading sympy (6.0MiB)
4425
- Downloading nvidia-curand-cu12 (60.7MiB)
 
 
 
 
 
 
 
4426
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4427
- Downloading jedi (1.5MiB)
4428
  Downloading nvidia-cusparselt-cu12 (273.9MiB)
4429
- Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4430
  Downloading nvidia-nccl-cu12 (307.4MiB)
4431
- Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4432
- Downloading torch (846.8MiB)
 
4433
  Downloading triton (148.4MiB)
4434
- Downloading matplotlib (8.3MiB)
4435
  Downloading kiwisolver (1.4MiB)
4436
- Downloading fonttools (4.7MiB)
4437
  Downloading nvidia-cufile-cu12
4438
  Downloading kiwisolver
4439
  Downloading pygments
@@ -4446,8 +4494,8 @@ Downloading fonttools (4.7MiB)
4446
  Downloading nvidia-cuda-cupti-cu12
4447
  Downloading numpy
4448
  Downloading sympy
4449
- Downloading nvidia-nvjitlink-cu12
4450
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4451
  Downloading jedi
4452
  Downloading nvidia-curand-cu12
4453
  Downloading nvidia-cuda-nvrtc-cu12
@@ -4460,30 +4508,31 @@ Downloading fonttools (4.7MiB)
4460
  Downloading nvidia-cublas-cu12
4461
  Downloading nvidia-cudnn-cu12
4462
  Downloading torch
4463
- Installed 69 packages in 460ms
4464
  </div>
4465
  </div>
4466
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4467
- Fetching 3 files: 33%|███▎ | 1/3 [00:07&lt;00:14, 7.31s/it]
4468
- Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.67s/it]
4469
- Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.81s/it]
4470
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4471
 
4472
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4473
- Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.35s/it]
4474
  Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4475
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4476
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4477
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4478
 
4479
  Fetching 66 files: 0%| | 0/66 [00:00&lt;?, ?it/s]
4480
- Fetching 66 files: 2%|▏ | 1/66 [00:00&lt;00:17, 3.80it/s]
4481
- Fetching 66 files: 9%|▉ | 6/66 [00:00&lt;00:03, 19.77it/s]
4482
- Fetching 66 files: 14%|█▎ | 9/66 [00:00&lt;00:02, 19.05it/s]
4483
- Fetching 66 files: 26%|██▌ | 17/66 [00:01&lt;00:04, 10.13it/s]
4484
- Fetching 66 files: 86%|████████▋ | 57/66 [00:01&lt;00:00, 47.70it/s]
4485
- Fetching 66 files: 100%|██████████| 66/66 [00:01&lt;00:00, 36.16it/s]
4486
- /tmp/uvnote-run-d2g9g4zl/home/.cache/uv/environments-v2/setup2-ea0d7cee95bc10c1/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
 
4487
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4488
  warnings.warn(
4489
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
@@ -4510,7 +4559,7 @@ INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks
4510
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4511
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4512
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4513
- /tmp/uvnote-run-d2g9g4zl/home/.cache/uv/environments-v2/setup2-ea0d7cee95bc10c1/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4514
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4515
  warnings.warn(
4516
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
 
3711
  </div>
3712
 
3713
  <div class="main-content">
3714
+ <div class="cell" id="cell-nv">
3715
+ <div class="cell-header">
3716
+ <span class="collapse-indicators">
3717
+ <span onclick="toggleCode('nv')" style="cursor: pointer;">▼ code</span>
3718
+ <span onclick="toggleOutput('nv')" style="cursor: pointer;">▼ output</span>
3719
+ <span id="uv-indicator-nv" style="cursor: default; opacity: 0.3;">▶ uv-logs</span>
3720
+ </span> |
3721
+ Cell: nv | 0.53s
3722
+ | <button class="run-btn" onclick="runCell('nv')">▶ run</button>
3723
+ <button class="copy-btn" onclick="copyCell('nv')">Copy</button>
3724
+ <a href="cells/nv.py" target="_blank" class="raw-btn">Raw</a>
3725
+ </div>
3726
+ <div id="code-nv" class="cell-code" data-lines="3">
3727
+ <div class="highlight-with-lines">
3728
+ <div class="line-numbers" id="lines-nv">
3729
+ <a class="line-number" data-cell="nv" data-line="1" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 1, true);">1</a>
3730
+ <a class="line-number" data-cell="nv" data-line="2" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 2, true);">2</a>
3731
+ <a class="line-number" data-cell="nv" data-line="3" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 3, true);">3</a>
3732
+ </div>
3733
+ <div class="code-wrap">
3734
+ <div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">subprocess</span>
3735
+
3736
+ <span class="nb">print</span><span class="p">(</span><span class="n">subprocess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="s2">&quot;nvidia-smi&quot;</span><span class="p">],</span> <span class="n">capture_output</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">stdout</span><span class="p">)</span>
3737
+ </pre></div>
3738
+
3739
+ <div class="code-line-highlight" id="line-highlight-nv"></div>
3740
+ </div>
3741
+ </div>
3742
+ </div>
3743
+ <div id="output-nv" class="cell-output">
3744
+ <div class="cell-stdout">Tue Sep 23 19:46:07 2025
3745
+ +-----------------------------------------------------------------------------------------+
3746
+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
3747
+ |-----------------------------------------+------------------------+----------------------+
3748
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
3749
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
3750
+ | | | MIG M. |
3751
+ |=========================================+========================+======================|
3752
+ | 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
3753
+ | 0% 42C P0 71W / 300W | 0MiB / 23028MiB | 0% Default |
3754
+ | | | N/A |
3755
+ +-----------------------------------------+------------------------+----------------------+
3756
+ | 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
3757
+ | 0% 43C P0 44W / 300W | 0MiB / 23028MiB | 0% Default |
3758
+ | | | N/A |
3759
+ +-----------------------------------------+------------------------+----------------------+
3760
+ | 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
3761
+ | 0% 42C P0 46W / 300W | 0MiB / 23028MiB | 0% Default |
3762
+ | | | N/A |
3763
+ +-----------------------------------------+------------------------+----------------------+
3764
+ | 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
3765
+ | 0% 41C P0 43W / 300W | 0MiB / 23028MiB | 0% Default |
3766
+ | | | N/A |
3767
+ +-----------------------------------------+------------------------+----------------------+
3768
+
3769
+ +-----------------------------------------------------------------------------------------+
3770
+ | Processes: |
3771
+ | GPU GI CI PID Type Process name GPU Memory |
3772
+ | ID ID Usage |
3773
+ |=========================================================================================|
3774
+ | No running processes found |
3775
+ +-----------------------------------------------------------------------------------------+
3776
+
3777
+ </div>
3778
+ </div>
3779
+ </div>
3780
+
3781
+ <div class="cell" id="cell-setup">
3782
  <div class="cell-header">
3783
  <span class="collapse-indicators">
3784
  <span onclick="toggleCode('setup')" style="cursor: pointer;">▼ code</span>
3785
  <span onclick="toggleOutput('setup')" style="cursor: pointer;">▼ output</span>
3786
  <span id="uv-indicator-setup" onclick="toggleUvLogsFromHeader('setup')" style="cursor: pointer;">▶ uv-logs</span>
3787
  </span> |
3788
+ Cell: setup | 133.12s
3789
  | <button class="run-btn" onclick="runCell('setup')">▶ run</button>
3790
  <button class="copy-btn" onclick="copyCell('setup')">Copy</button>
3791
  <a href="cells/setup.py" target="_blank" class="raw-btn">Raw</a>
 
4044
 
4045
  What is Tensor Parallelism?
4046
 
4047
+ &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It&#x27;s used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather results. In backward, gradients are computed locally and then aggregated. Provide example: GPT-3 training uses tensor parallelism. Also mention frameworks: Megatron-LM, DeepSpeed, etc. Provide pros/cons. Provide code snippet maybe. Also mention that it&#x27;s different from data parallelism. Provide explanation of how it works in practice. Provide mention of communication overhead. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate. Provide mention of &quot;tensor parallelism&quot; in context of DeepSpeed ZeRO stage 3. Provide mention of &quot;tensor parallelism&quot; in context of Megatron-LM. Provide mention of &quot;tensor parallelism&quot; in context of GPT-NeoX. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-Offload&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-2&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor
4048
+ Generation took 51.92 seconds
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4049
  </div>
4050
  <div class="uv-install-logs" id="uv-logs-setup">
4051
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
 
4055
  Updating https://github.com/huggingface/transformers.git (HEAD)
4056
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4057
  Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4058
  Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
 
 
4059
  Downloading jedi (1.5MiB)
 
 
4060
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4061
+ Downloading hf-xet (3.0MiB)
4062
+ Downloading nvidia-cusparselt-cu12 (273.9MiB)
4063
+ Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4064
  Downloading nvidia-curand-cu12 (60.7MiB)
4065
+ Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4066
  Downloading nvidia-cufft-cu12 (184.2MiB)
4067
+ Downloading nvidia-cusolver-cu12 (255.1MiB)
4068
+ Downloading pillow (6.3MiB)
4069
+ Downloading sympy (6.0MiB)
4070
+ Downloading fonttools (4.7MiB)
4071
+ Downloading numpy (15.9MiB)
4072
+ Downloading triton (148.4MiB)
4073
+ Downloading networkx (1.9MiB)
4074
+ Downloading tokenizers (3.1MiB)
4075
  Downloading pygments (1.2MiB)
4076
+ Downloading matplotlib (8.3MiB)
4077
  Downloading nvidia-cublas-cu12 (566.8MiB)
4078
+ Downloading nvidia-cufile-cu12 (1.1MiB)
4079
+ Downloading nvidia-cudnn-cu12 (674.0MiB)
4080
  Downloading kiwisolver (1.4MiB)
4081
  Downloading nvidia-nccl-cu12 (307.4MiB)
 
 
 
 
 
4082
  Downloading torch (846.8MiB)
 
 
 
4083
  Downloading nvidia-cufile-cu12
4084
  Downloading kiwisolver
4085
  Downloading pygments
 
4092
  Downloading nvidia-cuda-cupti-cu12
4093
  Downloading numpy
4094
  Downloading sympy
 
4095
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4096
+ Downloading nvidia-nvjitlink-cu12
4097
  Downloading jedi
4098
  Downloading nvidia-curand-cu12
4099
  Downloading nvidia-cuda-nvrtc-cu12
 
4106
  Downloading nvidia-cublas-cu12
4107
  Downloading nvidia-cudnn-cu12
4108
  Downloading torch
4109
+ Installed 69 packages in 467ms
4110
  </div>
4111
  </div>
4112
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4113
+ Fetching 3 files: 33%|███▎ | 1/3 [00:06&lt;00:13, 6.78s/it]
4114
+ Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.65s/it]
4115
+ Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.75s/it]
4116
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4117
 
4118
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4119
  Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4120
+ Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4121
+ Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4122
+ Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4123
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4124
 
4125
  Fetching 6 files: 0%| | 0/6 [00:00&lt;?, ?it/s]
4126
+ Fetching 6 files: 17%|█▋ | 1/6 [00:00&lt;00:01, 3.89it/s]
4127
+ Fetching 6 files: 100%|██████████| 6/6 [00:00&lt;00:00, 17.67it/s]
4128
+ /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
 
4129
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4130
  warnings.warn(
4131
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
 
4152
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4153
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4154
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4155
+ /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4156
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4157
  warnings.warn(
4158
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
 
4189
  <span onclick="toggleOutput('setup2')" style="cursor: pointer;">▼ output</span>
4190
  <span id="uv-indicator-setup2" onclick="toggleUvLogsFromHeader('setup2')" style="cursor: pointer;">▶ uv-logs</span>
4191
  </span> |
4192
+ Cell: setup2 | 139.97s
4193
  | <button class="run-btn" onclick="runCell('setup2')">▶ run</button>
4194
  <button class="copy-btn" onclick="copyCell('setup2')">Copy</button>
4195
  <a href="cells/setup2.py" target="_blank" class="raw-btn">Raw</a>
 
4447
  What is Tensor Parallelism?
4448
 
4449
  &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it&#x27;s used in large models like GPT-3, Megatron-LM, DeepSpeed, etc. Provide explanation of how it reduces memory usage, increases throughput. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate, DeepSpeed, Megatron. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in the &quot;DeepSpeed ZeRO-Offload&quot; or &quot;ZeRO-3&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&quot; and &quot;Megatron-LM&quot; and &quot;DeepSpeed&#x27;s ZeRO&quot; and &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the
4450
+ Generation took 57.98 seconds
4451
  </div>
4452
  <div class="uv-install-logs" id="uv-logs-setup2">
4453
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
 
4456
  Downloading cpython-3.13.7-linux-x86_64-gnu (download)
4457
  Updating https://github.com/huggingface/transformers.git (HEAD)
4458
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4459
+ Downloading jedi (1.5MiB)
4460
  Downloading pygments (1.2MiB)
 
 
 
 
 
4461
  Downloading nvidia-cufile-cu12 (1.1MiB)
4462
+ Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4463
  Downloading hf-xet (3.0MiB)
4464
+ Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4465
+ Downloading numpy (15.9MiB)
 
4466
  Downloading sympy (6.0MiB)
4467
+ Downloading matplotlib (8.3MiB)
4468
+ Downloading nvidia-cudnn-cu12 (674.0MiB)
4469
+ Downloading networkx (1.9MiB)
4470
+ Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4471
+ Downloading pillow (6.3MiB)
4472
+ Downloading nvidia-cublas-cu12 (566.8MiB)
4473
+ Downloading tokenizers (3.1MiB)
4474
+ Downloading nvidia-cusolver-cu12 (255.1MiB)
4475
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4476
+ Downloading nvidia-curand-cu12 (60.7MiB)
4477
  Downloading nvidia-cusparselt-cu12 (273.9MiB)
 
4478
  Downloading nvidia-nccl-cu12 (307.4MiB)
4479
+ Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4480
+ Downloading fonttools (4.7MiB)
4481
+ Downloading nvidia-cufft-cu12 (184.2MiB)
4482
  Downloading triton (148.4MiB)
 
4483
  Downloading kiwisolver (1.4MiB)
4484
+ Downloading torch (846.8MiB)
4485
  Downloading nvidia-cufile-cu12
4486
  Downloading kiwisolver
4487
  Downloading pygments
 
4494
  Downloading nvidia-cuda-cupti-cu12
4495
  Downloading numpy
4496
  Downloading sympy
 
4497
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4498
+ Downloading nvidia-nvjitlink-cu12
4499
  Downloading jedi
4500
  Downloading nvidia-curand-cu12
4501
  Downloading nvidia-cuda-nvrtc-cu12
 
4508
  Downloading nvidia-cublas-cu12
4509
  Downloading nvidia-cudnn-cu12
4510
  Downloading torch
4511
+ Installed 69 packages in 468ms
4512
  </div>
4513
  </div>
4514
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4515
+ Fetching 3 files: 33%|███▎ | 1/3 [00:06&lt;00:12, 6.38s/it]
4516
+ Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.61s/it]
4517
+ Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.69s/it]
4518
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4519
 
4520
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4521
+ Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4522
  Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4523
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4524
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4525
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4526
 
4527
  Fetching 66 files: 0%| | 0/66 [00:00&lt;?, ?it/s]
4528
+ Fetching 66 files: 2%|▏ | 1/66 [00:00&lt;00:10, 6.10it/s]
4529
+ Fetching 66 files: 14%|█▎ | 9/66 [00:00&lt;00:01, 30.47it/s]
4530
+ Fetching 66 files: 24%|██▍ | 16/66 [00:00&lt;00:01, 37.56it/s]
4531
+ Fetching 66 files: 30%|███ | 20/66 [00:01&lt;00:03, 14.24it/s]
4532
+ Fetching 66 files: 67%|██████▋ | 44/66 [00:01&lt;00:00, 37.14it/s]
4533
+ Fetching 66 files: 91%|█████████ | 60/66 [00:01&lt;00:00, 49.97it/s]
4534
+ Fetching 66 files: 100%|██████████| 66/66 [00:01&lt;00:00, 36.02it/s]
4535
+ /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4536
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4537
  warnings.warn(
4538
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
 
4559
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4560
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4561
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4562
+ /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4563
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4564
  warnings.warn(
4565
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
note_test_override.html CHANGED
@@ -3711,14 +3711,81 @@ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left:
3711
  </div>
3712
 
3713
  <div class="main-content">
3714
- <div class="cell" id="cell-setup">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3715
  <div class="cell-header">
3716
  <span class="collapse-indicators">
3717
  <span onclick="toggleCode('setup')" style="cursor: pointer;">▼ code</span>
3718
  <span onclick="toggleOutput('setup')" style="cursor: pointer;">▼ output</span>
3719
  <span id="uv-indicator-setup" onclick="toggleUvLogsFromHeader('setup')" style="cursor: pointer;">▶ uv-logs</span>
3720
  </span> |
3721
- Cell: setup | 132.82s
3722
  | <button class="run-btn" onclick="runCell('setup')">▶ run</button>
3723
  <button class="copy-btn" onclick="copyCell('setup')">Copy</button>
3724
  <a href="cells/setup.py" target="_blank" class="raw-btn">Raw</a>
@@ -3977,26 +4044,8 @@ Reasoning: low
3977
 
3978
  What is Tensor Parallelism?
3979
 
3980
- &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It&#x27;s used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather. Similarly for attention QKV projections. Provide example: GPT-3 uses tensor parallelism. Also mention frameworks: Megatron-LM, DeepSpeed, etc. Provide pros/cons. Provide typical implementation: using torch.distributed.all_reduce, gather, etc. Provide code snippet. Also mention that it&#x27;s different from data parallelism. Provide explanation of how it works in practice. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed ZeRO Stage 3&quot; or &quot;Megatron-LM&quot;. Provide mention of &quot;tensor parallelism&quot; as part of &quot;model parallelism&quot; to reduce memory usage. Provide mention of &quot;tensor parallelism&quot; as &quot;splitting weight matrices across GPUs&quot; and &quot;communication overhead&quot;.
3981
-
3982
- Also mention that it&#x27;s used for large transformer models like GPT-3, LLaMA, etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;DeepSpeed&#x27;s ZeRO-Offload&quot; or &quot;ZeRO-3&quot;.
3983
-
3984
- Also mention that &quot;tensor parallelism&quot; can be combined with &quot;pipeline parallelism&quot; and &quot;data parallelism&quot; to achieve full scaling.
3985
-
3986
- Also mention that &quot;tensor parallelism&quot; can be implemented by splitting the weight matrix along the output dimension, performing local matrix multiplication, then all-reduce to sum partial outputs.
3987
-
3988
- Also mention that &quot;tensor parallelism&quot; can be used for linear layers, self-attention, feed-forward networks, etc.
3989
-
3990
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;embedding tables&quot; by sharding them across devices.
3991
-
3992
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;attention heads&quot; by splitting across heads.
3993
-
3994
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;parameter sharding&quot;.
3995
-
3996
- Also mention that &quot;tensor parallelism&quot; can be used for &quot;model parallelism&quot; to reduce memory usage.
3997
-
3998
-
3999
- Generation took 51.90 seconds
4000
  </div>
4001
  <div class="uv-install-logs" id="uv-logs-setup">
4002
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
@@ -4006,31 +4055,31 @@ Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB)
4006
  Updating https://github.com/huggingface/transformers.git (HEAD)
4007
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4008
  Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4009
- Downloading nvidia-cusolver-cu12 (255.1MiB)
4010
  Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4011
- Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4012
- Downloading sympy (6.0MiB)
4013
  Downloading jedi (1.5MiB)
4014
- Downloading fonttools (4.7MiB)
4015
- Downloading pillow (6.3MiB)
4016
  Downloading nvidia-cusparse-cu12 (274.9MiB)
 
 
 
4017
  Downloading nvidia-curand-cu12 (60.7MiB)
4018
- Downloading numpy (15.9MiB)
4019
  Downloading nvidia-cufft-cu12 (184.2MiB)
4020
- Downloading matplotlib (8.3MiB)
 
 
 
 
 
 
 
4021
  Downloading pygments (1.2MiB)
 
4022
  Downloading nvidia-cublas-cu12 (566.8MiB)
 
 
4023
  Downloading kiwisolver (1.4MiB)
4024
  Downloading nvidia-nccl-cu12 (307.4MiB)
4025
- Downloading nvidia-cusparselt-cu12 (273.9MiB)
4026
- Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4027
- Downloading networkx (1.9MiB)
4028
- Downloading hf-xet (3.0MiB)
4029
- Downloading nvidia-cudnn-cu12 (674.0MiB)
4030
  Downloading torch (846.8MiB)
4031
- Downloading tokenizers (3.1MiB)
4032
- Downloading nvidia-cufile-cu12 (1.1MiB)
4033
- Downloading triton (148.4MiB)
4034
  Downloading nvidia-cufile-cu12
4035
  Downloading kiwisolver
4036
  Downloading pygments
@@ -4043,8 +4092,8 @@ Downloading triton (148.4MiB)
4043
  Downloading nvidia-cuda-cupti-cu12
4044
  Downloading numpy
4045
  Downloading sympy
4046
- Downloading nvidia-nvjitlink-cu12
4047
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4048
  Downloading jedi
4049
  Downloading nvidia-curand-cu12
4050
  Downloading nvidia-cuda-nvrtc-cu12
@@ -4057,27 +4106,26 @@ Downloading triton (148.4MiB)
4057
  Downloading nvidia-cublas-cu12
4058
  Downloading nvidia-cudnn-cu12
4059
  Downloading torch
4060
- Installed 69 packages in 539ms
4061
  </div>
4062
  </div>
4063
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4064
- Fetching 3 files: 33%|███▎ | 1/3 [00:07&lt;00:15, 7.55s/it]
4065
- Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.72s/it]
4066
- Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.87s/it]
4067
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4068
 
4069
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4070
  Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4071
- Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.28s/it]
4072
- Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.82s/it]
4073
- Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.95s/it]
4074
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4075
 
4076
  Fetching 6 files: 0%| | 0/6 [00:00&lt;?, ?it/s]
4077
- Fetching 6 files: 17%|█▋ | 1/6 [00:00&lt;00:00, 5.35it/s]
4078
- Fetching 6 files: 50%|█████ | 3/6 [00:00&lt;00:00, 6.55it/s]
4079
- Fetching 6 files: 100%|██████████| 6/6 [00:00&lt;00:00, 12.81it/s]
4080
- /tmp/uvnote-run-og9tszom/home/.cache/uv/environments-v2/setup-d9b6d9dd835772a9/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4081
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4082
  warnings.warn(
4083
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
@@ -4104,7 +4152,7 @@ INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for laye
4104
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4105
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4106
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4107
- /tmp/uvnote-run-og9tszom/home/.cache/uv/environments-v2/setup-d9b6d9dd835772a9/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4108
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4109
  warnings.warn(
4110
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
@@ -4141,7 +4189,7 @@ INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for laye
4141
  <span onclick="toggleOutput('setup2')" style="cursor: pointer;">▼ output</span>
4142
  <span id="uv-indicator-setup2" onclick="toggleUvLogsFromHeader('setup2')" style="cursor: pointer;">▶ uv-logs</span>
4143
  </span> |
4144
- Cell: setup2 | 140.15s
4145
  | <button class="run-btn" onclick="runCell('setup2')">▶ run</button>
4146
  <button class="copy-btn" onclick="copyCell('setup2')">Copy</button>
4147
  <a href="cells/setup2.py" target="_blank" class="raw-btn">Raw</a>
@@ -4399,7 +4447,7 @@ Reasoning: low
4399
  What is Tensor Parallelism?
4400
 
4401
  &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it&#x27;s used in large models like GPT-3, Megatron-LM, DeepSpeed, etc. Provide explanation of how it reduces memory usage, increases throughput. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate, DeepSpeed, Megatron. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in the &quot;DeepSpeed ZeRO-Offload&quot; or &quot;ZeRO-3&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&quot; and &quot;Megatron-LM&quot; and &quot;DeepSpeed&#x27;s ZeRO&quot; and &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the
4402
- Generation took 57.93 seconds
4403
  </div>
4404
  <div class="uv-install-logs" id="uv-logs-setup2">
4405
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
@@ -4408,32 +4456,32 @@ Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB)
4408
  Downloading cpython-3.13.7-linux-x86_64-gnu (download)
4409
  Updating https://github.com/huggingface/transformers.git (HEAD)
4410
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4411
- Downloading numpy (15.9MiB)
4412
  Downloading pygments (1.2MiB)
4413
- Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4414
- Downloading pillow (6.3MiB)
4415
- Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4416
- Downloading nvidia-cusolver-cu12 (255.1MiB)
4417
- Downloading networkx (1.9MiB)
4418
  Downloading nvidia-cufile-cu12 (1.1MiB)
4419
- Downloading tokenizers (3.1MiB)
4420
  Downloading hf-xet (3.0MiB)
4421
- Downloading nvidia-cublas-cu12 (566.8MiB)
4422
- Downloading nvidia-cudnn-cu12 (674.0MiB)
4423
- Downloading nvidia-cufft-cu12 (184.2MiB)
4424
  Downloading sympy (6.0MiB)
4425
- Downloading nvidia-curand-cu12 (60.7MiB)
 
 
 
 
 
 
 
4426
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4427
- Downloading jedi (1.5MiB)
4428
  Downloading nvidia-cusparselt-cu12 (273.9MiB)
4429
- Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4430
  Downloading nvidia-nccl-cu12 (307.4MiB)
4431
- Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4432
- Downloading torch (846.8MiB)
 
4433
  Downloading triton (148.4MiB)
4434
- Downloading matplotlib (8.3MiB)
4435
  Downloading kiwisolver (1.4MiB)
4436
- Downloading fonttools (4.7MiB)
4437
  Downloading nvidia-cufile-cu12
4438
  Downloading kiwisolver
4439
  Downloading pygments
@@ -4446,8 +4494,8 @@ Downloading fonttools (4.7MiB)
4446
  Downloading nvidia-cuda-cupti-cu12
4447
  Downloading numpy
4448
  Downloading sympy
4449
- Downloading nvidia-nvjitlink-cu12
4450
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4451
  Downloading jedi
4452
  Downloading nvidia-curand-cu12
4453
  Downloading nvidia-cuda-nvrtc-cu12
@@ -4460,30 +4508,31 @@ Downloading fonttools (4.7MiB)
4460
  Downloading nvidia-cublas-cu12
4461
  Downloading nvidia-cudnn-cu12
4462
  Downloading torch
4463
- Installed 69 packages in 460ms
4464
  </div>
4465
  </div>
4466
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4467
- Fetching 3 files: 33%|███▎ | 1/3 [00:07&lt;00:14, 7.31s/it]
4468
- Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.67s/it]
4469
- Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.81s/it]
4470
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4471
 
4472
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4473
- Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.35s/it]
4474
  Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4475
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4476
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4477
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4478
 
4479
  Fetching 66 files: 0%| | 0/66 [00:00&lt;?, ?it/s]
4480
- Fetching 66 files: 2%|▏ | 1/66 [00:00&lt;00:17, 3.80it/s]
4481
- Fetching 66 files: 9%|▉ | 6/66 [00:00&lt;00:03, 19.77it/s]
4482
- Fetching 66 files: 14%|█▎ | 9/66 [00:00&lt;00:02, 19.05it/s]
4483
- Fetching 66 files: 26%|██▌ | 17/66 [00:01&lt;00:04, 10.13it/s]
4484
- Fetching 66 files: 86%|████████▋ | 57/66 [00:01&lt;00:00, 47.70it/s]
4485
- Fetching 66 files: 100%|██████████| 66/66 [00:01&lt;00:00, 36.16it/s]
4486
- /tmp/uvnote-run-d2g9g4zl/home/.cache/uv/environments-v2/setup2-ea0d7cee95bc10c1/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
 
4487
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4488
  warnings.warn(
4489
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
@@ -4510,7 +4559,7 @@ INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks
4510
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4511
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4512
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4513
- /tmp/uvnote-run-d2g9g4zl/home/.cache/uv/environments-v2/setup2-ea0d7cee95bc10c1/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4514
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4515
  warnings.warn(
4516
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
 
3711
  </div>
3712
 
3713
  <div class="main-content">
3714
+ <div class="cell" id="cell-nv">
3715
+ <div class="cell-header">
3716
+ <span class="collapse-indicators">
3717
+ <span onclick="toggleCode('nv')" style="cursor: pointer;">▼ code</span>
3718
+ <span onclick="toggleOutput('nv')" style="cursor: pointer;">▼ output</span>
3719
+ <span id="uv-indicator-nv" style="cursor: default; opacity: 0.3;">▶ uv-logs</span>
3720
+ </span> |
3721
+ Cell: nv | 0.53s
3722
+ | <button class="run-btn" onclick="runCell('nv')">▶ run</button>
3723
+ <button class="copy-btn" onclick="copyCell('nv')">Copy</button>
3724
+ <a href="cells/nv.py" target="_blank" class="raw-btn">Raw</a>
3725
+ </div>
3726
+ <div id="code-nv" class="cell-code" data-lines="3">
3727
+ <div class="highlight-with-lines">
3728
+ <div class="line-numbers" id="lines-nv">
3729
+ <a class="line-number" data-cell="nv" data-line="1" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 1, true);">1</a>
3730
+ <a class="line-number" data-cell="nv" data-line="2" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 2, true);">2</a>
3731
+ <a class="line-number" data-cell="nv" data-line="3" href="#cell-nv" onclick="event.preventDefault(); selectCellLine('nv', 3, true);">3</a>
3732
+ </div>
3733
+ <div class="code-wrap">
3734
+ <div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">subprocess</span>
3735
+
3736
+ <span class="nb">print</span><span class="p">(</span><span class="n">subprocess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="s2">&quot;nvidia-smi&quot;</span><span class="p">],</span> <span class="n">capture_output</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">stdout</span><span class="p">)</span>
3737
+ </pre></div>
3738
+
3739
+ <div class="code-line-highlight" id="line-highlight-nv"></div>
3740
+ </div>
3741
+ </div>
3742
+ </div>
3743
+ <div id="output-nv" class="cell-output">
3744
+ <div class="cell-stdout">Tue Sep 23 19:46:07 2025
3745
+ +-----------------------------------------------------------------------------------------+
3746
+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
3747
+ |-----------------------------------------+------------------------+----------------------+
3748
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
3749
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
3750
+ | | | MIG M. |
3751
+ |=========================================+========================+======================|
3752
+ | 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
3753
+ | 0% 42C P0 71W / 300W | 0MiB / 23028MiB | 0% Default |
3754
+ | | | N/A |
3755
+ +-----------------------------------------+------------------------+----------------------+
3756
+ | 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
3757
+ | 0% 43C P0 44W / 300W | 0MiB / 23028MiB | 0% Default |
3758
+ | | | N/A |
3759
+ +-----------------------------------------+------------------------+----------------------+
3760
+ | 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
3761
+ | 0% 42C P0 46W / 300W | 0MiB / 23028MiB | 0% Default |
3762
+ | | | N/A |
3763
+ +-----------------------------------------+------------------------+----------------------+
3764
+ | 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
3765
+ | 0% 41C P0 43W / 300W | 0MiB / 23028MiB | 0% Default |
3766
+ | | | N/A |
3767
+ +-----------------------------------------+------------------------+----------------------+
3768
+
3769
+ +-----------------------------------------------------------------------------------------+
3770
+ | Processes: |
3771
+ | GPU GI CI PID Type Process name GPU Memory |
3772
+ | ID ID Usage |
3773
+ |=========================================================================================|
3774
+ | No running processes found |
3775
+ +-----------------------------------------------------------------------------------------+
3776
+
3777
+ </div>
3778
+ </div>
3779
+ </div>
3780
+
3781
+ <div class="cell" id="cell-setup">
3782
  <div class="cell-header">
3783
  <span class="collapse-indicators">
3784
  <span onclick="toggleCode('setup')" style="cursor: pointer;">▼ code</span>
3785
  <span onclick="toggleOutput('setup')" style="cursor: pointer;">▼ output</span>
3786
  <span id="uv-indicator-setup" onclick="toggleUvLogsFromHeader('setup')" style="cursor: pointer;">▶ uv-logs</span>
3787
  </span> |
3788
+ Cell: setup | 133.12s
3789
  | <button class="run-btn" onclick="runCell('setup')">▶ run</button>
3790
  <button class="copy-btn" onclick="copyCell('setup')">Copy</button>
3791
  <a href="cells/setup.py" target="_blank" class="raw-btn">Raw</a>
 
4044
 
4045
  What is Tensor Parallelism?
4046
 
4047
+ &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It&#x27;s used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather results. In backward, gradients are computed locally and then aggregated. Provide example: GPT-3 training uses tensor parallelism. Also mention frameworks: Megatron-LM, DeepSpeed, etc. Provide pros/cons. Provide code snippet maybe. Also mention that it&#x27;s different from data parallelism. Provide explanation of how it works in practice. Provide mention of communication overhead. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate. Provide mention of &quot;tensor parallelism&quot; in context of DeepSpeed ZeRO stage 3. Provide mention of &quot;tensor parallelism&quot; in context of Megatron-LM. Provide mention of &quot;tensor parallelism&quot; in context of GPT-NeoX. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-Offload&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-2&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor parallelism&quot; in context of &quot;DeepSpeed&#x27;s ZeRO-3&quot; maybe. Provide mention of &quot;tensor
4048
+ Generation took 51.92 seconds
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4049
  </div>
4050
  <div class="uv-install-logs" id="uv-logs-setup">
4051
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
 
4055
  Updating https://github.com/huggingface/transformers.git (HEAD)
4056
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4057
  Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
 
4058
  Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
 
 
4059
  Downloading jedi (1.5MiB)
 
 
4060
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4061
+ Downloading hf-xet (3.0MiB)
4062
+ Downloading nvidia-cusparselt-cu12 (273.9MiB)
4063
+ Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4064
  Downloading nvidia-curand-cu12 (60.7MiB)
4065
+ Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4066
  Downloading nvidia-cufft-cu12 (184.2MiB)
4067
+ Downloading nvidia-cusolver-cu12 (255.1MiB)
4068
+ Downloading pillow (6.3MiB)
4069
+ Downloading sympy (6.0MiB)
4070
+ Downloading fonttools (4.7MiB)
4071
+ Downloading numpy (15.9MiB)
4072
+ Downloading triton (148.4MiB)
4073
+ Downloading networkx (1.9MiB)
4074
+ Downloading tokenizers (3.1MiB)
4075
  Downloading pygments (1.2MiB)
4076
+ Downloading matplotlib (8.3MiB)
4077
  Downloading nvidia-cublas-cu12 (566.8MiB)
4078
+ Downloading nvidia-cufile-cu12 (1.1MiB)
4079
+ Downloading nvidia-cudnn-cu12 (674.0MiB)
4080
  Downloading kiwisolver (1.4MiB)
4081
  Downloading nvidia-nccl-cu12 (307.4MiB)
 
 
 
 
 
4082
  Downloading torch (846.8MiB)
 
 
 
4083
  Downloading nvidia-cufile-cu12
4084
  Downloading kiwisolver
4085
  Downloading pygments
 
4092
  Downloading nvidia-cuda-cupti-cu12
4093
  Downloading numpy
4094
  Downloading sympy
 
4095
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4096
+ Downloading nvidia-nvjitlink-cu12
4097
  Downloading jedi
4098
  Downloading nvidia-curand-cu12
4099
  Downloading nvidia-cuda-nvrtc-cu12
 
4106
  Downloading nvidia-cublas-cu12
4107
  Downloading nvidia-cudnn-cu12
4108
  Downloading torch
4109
+ Installed 69 packages in 467ms
4110
  </div>
4111
  </div>
4112
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4113
+ Fetching 3 files: 33%|███▎ | 1/3 [00:06&lt;00:13, 6.78s/it]
4114
+ Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.65s/it]
4115
+ Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.75s/it]
4116
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4117
 
4118
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4119
  Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4120
+ Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4121
+ Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4122
+ Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4123
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4124
 
4125
  Fetching 6 files: 0%| | 0/6 [00:00&lt;?, ?it/s]
4126
+ Fetching 6 files: 17%|█▋ | 1/6 [00:00&lt;00:01, 3.89it/s]
4127
+ Fetching 6 files: 100%|██████████| 6/6 [00:00&lt;00:00, 17.67it/s]
4128
+ /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
 
4129
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4130
  warnings.warn(
4131
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
 
4152
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4153
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4154
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
4155
+ /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4156
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4157
  warnings.warn(
4158
  INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
 
4189
  <span onclick="toggleOutput('setup2')" style="cursor: pointer;">▼ output</span>
4190
  <span id="uv-indicator-setup2" onclick="toggleUvLogsFromHeader('setup2')" style="cursor: pointer;">▶ uv-logs</span>
4191
  </span> |
4192
+ Cell: setup2 | 139.97s
4193
  | <button class="run-btn" onclick="runCell('setup2')">▶ run</button>
4194
  <button class="copy-btn" onclick="copyCell('setup2')">Copy</button>
4195
  <a href="cells/setup2.py" target="_blank" class="raw-btn">Raw</a>
 
4447
  What is Tensor Parallelism?
4448
 
4449
  &lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;analysis&lt;|message|&gt;We need to explain what Tensor Parallelism is. It&#x27;s a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it&#x27;s used in large models like GPT-3, Megatron-LM, DeepSpeed, etc. Provide explanation of how it reduces memory usage, increases throughput. Provide mention of &quot;tensor model parallelism&quot; vs &quot;tensor parallelism&quot; synonyms. Provide mention of &quot;tensor parallelism&quot; in context of huggingface accelerate, DeepSpeed, Megatron. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in the &quot;DeepSpeed ZeRO-Offload&quot; or &quot;ZeRO-3&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&quot; and &quot;Megatron-LM&quot; and &quot;DeepSpeed&#x27;s ZeRO&quot; and &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the context of &quot;tensor parallelism&quot; in &quot;DeepSpeed&#x27;s ZeRO-3&quot; and &quot;DeepSpeed&#x27;s ZeRO-2&quot; etc. Provide mention of &quot;tensor parallelism&quot; in the
4450
+ Generation took 57.98 seconds
4451
  </div>
4452
  <div class="uv-install-logs" id="uv-logs-setup2">
4453
  <div class="uv-logs-header" onclick="toggleUvLogs(this)">▶ UV Install Logs</div>
 
4456
  Downloading cpython-3.13.7-linux-x86_64-gnu (download)
4457
  Updating https://github.com/huggingface/transformers.git (HEAD)
4458
  Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2)
4459
+ Downloading jedi (1.5MiB)
4460
  Downloading pygments (1.2MiB)
 
 
 
 
 
4461
  Downloading nvidia-cufile-cu12 (1.1MiB)
4462
+ Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4463
  Downloading hf-xet (3.0MiB)
4464
+ Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
4465
+ Downloading numpy (15.9MiB)
 
4466
  Downloading sympy (6.0MiB)
4467
+ Downloading matplotlib (8.3MiB)
4468
+ Downloading nvidia-cudnn-cu12 (674.0MiB)
4469
+ Downloading networkx (1.9MiB)
4470
+ Downloading nvidia-nvjitlink-cu12 (37.4MiB)
4471
+ Downloading pillow (6.3MiB)
4472
+ Downloading nvidia-cublas-cu12 (566.8MiB)
4473
+ Downloading tokenizers (3.1MiB)
4474
+ Downloading nvidia-cusolver-cu12 (255.1MiB)
4475
  Downloading nvidia-cusparse-cu12 (274.9MiB)
4476
+ Downloading nvidia-curand-cu12 (60.7MiB)
4477
  Downloading nvidia-cusparselt-cu12 (273.9MiB)
 
4478
  Downloading nvidia-nccl-cu12 (307.4MiB)
4479
+ Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
4480
+ Downloading fonttools (4.7MiB)
4481
+ Downloading nvidia-cufft-cu12 (184.2MiB)
4482
  Downloading triton (148.4MiB)
 
4483
  Downloading kiwisolver (1.4MiB)
4484
+ Downloading torch (846.8MiB)
4485
  Downloading nvidia-cufile-cu12
4486
  Downloading kiwisolver
4487
  Downloading pygments
 
4494
  Downloading nvidia-cuda-cupti-cu12
4495
  Downloading numpy
4496
  Downloading sympy
 
4497
  Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2
4498
+ Downloading nvidia-nvjitlink-cu12
4499
  Downloading jedi
4500
  Downloading nvidia-curand-cu12
4501
  Downloading nvidia-cuda-nvrtc-cu12
 
4508
  Downloading nvidia-cublas-cu12
4509
  Downloading nvidia-cudnn-cu12
4510
  Downloading torch
4511
+ Installed 69 packages in 468ms
4512
  </div>
4513
  </div>
4514
  <div class="cell-stderr">Fetching 3 files: 0%| | 0/3 [00:00&lt;?, ?it/s]
4515
+ Fetching 3 files: 33%|███▎ | 1/3 [00:06&lt;00:12, 6.38s/it]
4516
+ Fetching 3 files: 67%|██████▋ | 2/3 [00:08&lt;00:03, 3.61s/it]
4517
+ Fetching 3 files: 100%|██████████| 3/3 [00:08&lt;00:00, 2.69s/it]
4518
  You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
4519
 
4520
  Loading checkpoint shards: 0%| | 0/3 [00:00&lt;?, ?it/s]
4521
+ Loading checkpoint shards: 33%|███▎ | 1/3 [00:02&lt;00:04, 2.34s/it]
4522
  Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04&lt;00:02, 2.25s/it]
4523
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.80s/it]
4524
  Loading checkpoint shards: 100%|██████████| 3/3 [00:05&lt;00:00, 1.93s/it]
4525
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4526
 
4527
  Fetching 66 files: 0%| | 0/66 [00:00&lt;?, ?it/s]
4528
+ Fetching 66 files: 2%|▏ | 1/66 [00:00&lt;00:10, 6.10it/s]
4529
+ Fetching 66 files: 14%|█▎ | 9/66 [00:00&lt;00:01, 30.47it/s]
4530
+ Fetching 66 files: 24%|██▍ | 16/66 [00:00&lt;00:01, 37.56it/s]
4531
+ Fetching 66 files: 30%|███ | 20/66 [00:01&lt;00:03, 14.24it/s]
4532
+ Fetching 66 files: 67%|██████▋ | 44/66 [00:01&lt;00:00, 37.14it/s]
4533
+ Fetching 66 files: 91%|█████████ | 60/66 [00:01&lt;00:00, 49.97it/s]
4534
+ Fetching 66 files: 100%|██████████| 66/66 [00:01&lt;00:00, 36.02it/s]
4535
+ /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4536
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4537
  warnings.warn(
4538
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
 
4559
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4560
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4561
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
4562
+ /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning:
4563
  No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
4564
  warnings.warn(
4565
  INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`