Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.22s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Nov 10 22:11:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   36C    P0            121W /  350W |       0MiB /  46068MiB |     27%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.73s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.69%     156.741us        50.70%       2.155ms       2.155ms       0.000us         0.00%       3.028ms       3.028ms             1  
                                       aten::layer_norm         0.35%      14.940us        47.01%       1.998ms     666.050us       0.000us         0.00%       3.028ms       1.009ms             3  
                                aten::native_layer_norm         1.75%      74.522us        46.66%       1.983ms     661.070us       2.321ms       100.00%       3.028ms       1.009ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.322ms       100.06%       2.322ms       2.322ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.321ms       100.00%       2.321ms     773.663us             3  
                                Activity Buffer Request        42.51%       1.807ms        42.51%       1.807ms       1.807ms     707.360us        30.48%     707.360us     707.360us             1  
                                            aten::empty         1.11%      47.041us         1.11%      47.041us       5.227us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.12%      47.761us         1.12%      47.761us      15.920us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.17%       7.200us         0.17%       7.200us       1.200us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        49.30%       2.095ms        49.30%       2.095ms       2.095ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.250ms
Self CUDA time total: 2.321ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.08%      72.370us        29.56%       1.986ms       1.986ms       0.000us         0.00%       6.439ms       6.439ms             1  
                                       aten::layer_norm         0.14%       9.121us        28.49%       1.914ms     637.916us       0.000us         0.00%       6.439ms       2.146ms             3  
                                aten::native_layer_norm         0.74%      49.777us        28.35%       1.905ms     634.876us       4.867ms       100.00%       6.439ms       2.146ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.868ms       100.03%       4.868ms       4.868ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.867ms       100.00%       4.867ms       1.622ms             3  
                                Activity Buffer Request        26.73%       1.796ms        26.73%       1.796ms       1.796ms       1.572ms        32.30%       1.572ms       1.572ms             1  
                                            aten::empty         0.42%      28.501us         0.42%      28.501us       3.167us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.40%      26.970us         0.40%      26.970us       8.990us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.863us         0.06%       3.863us       0.644us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        70.44%       4.732ms        70.44%       4.732ms       4.732ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.718ms
Self CUDA time total: 4.867ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.07%      70.921us        30.56%       2.021ms       2.021ms       0.000us         0.00%       6.238ms       6.238ms             1  
                                       aten::layer_norm         0.13%       8.430us        29.49%       1.951ms     650.186us       0.000us         0.00%       6.238ms       2.079ms             3  
                                aten::native_layer_norm         0.76%      50.331us        29.36%       1.942ms     647.376us       4.725ms       100.00%       6.238ms       2.079ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.726ms       100.03%       4.726ms       4.726ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.725ms       100.00%       4.725ms       1.575ms             3  
                                Activity Buffer Request        27.69%       1.832ms        27.69%       1.832ms       1.832ms       1.513ms        32.02%       1.513ms       1.513ms             1  
                                            aten::empty         0.42%      27.940us         0.42%      27.940us       3.104us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.42%      27.891us         0.42%      27.891us       9.297us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       4.260us         0.06%       4.260us       0.710us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        69.44%       4.592ms        69.44%       4.592ms       4.592ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.614ms
Self CUDA time total: 4.725ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.62%      70.560us        14.96%       1.705ms       1.705ms       0.000us         0.00%      13.056ms      13.056ms             1  
                                       aten::layer_norm         0.08%       8.830us        14.34%       1.634ms     544.695us       0.000us         0.00%      13.056ms       4.352ms             3  
                                aten::native_layer_norm         0.44%      49.828us        14.26%       1.625ms     541.752us       9.820ms       100.00%      13.056ms       4.352ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.821ms       100.01%       9.821ms       9.821ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.820ms       100.00%       9.820ms       3.273ms             3  
                                Activity Buffer Request        11.47%       1.307ms        11.47%       1.307ms       1.307ms       3.236ms        32.96%       3.236ms       3.236ms             1  
                                            aten::empty         0.24%      27.683us         0.24%      27.683us       3.076us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         2.07%     236.314us         2.07%     236.314us      78.771us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.03%       3.970us         0.03%       3.970us       0.662us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        85.04%       9.690ms        85.04%       9.690ms       9.690ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 11.395ms
Self CUDA time total: 9.820ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S2048_D4096     0.83  True
torch_layer_norm         LN_B16_S2048_D8192     1.68  True
torch_layer_norm         LN_B16_S4096_D4096     1.61  True
torch_layer_norm         LN_B16_S4096_D8192     3.33  True
▶ UV Install Logs

Artifacts:

layer_norm.jsonl