on_github: huggingface/kernels-uvnotes

Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.22s | Raw
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Oct 27 14:46:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   31C    P0             79W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.77s | Raw
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 48 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     117.951us      1284.31%     117.951us     117.951us             1  
                                       torch_layer_norm         8.74%     158.633us        99.57%       1.807ms       1.807ms       0.000us         0.00%      12.352us      12.352us             1  
                                       aten::layer_norm         0.95%      17.160us        90.83%       1.649ms     549.530us       0.000us         0.00%      12.352us       4.117us             3  
                                aten::native_layer_norm         4.49%      81.559us        89.88%       1.631ms     543.810us       9.184us       100.00%      12.352us       4.117us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.184us       100.00%       9.184us       3.061us             3  
                                Activity Buffer Request        79.88%       1.450ms        79.88%       1.450ms       1.450ms       3.168us        34.49%       3.168us       3.168us             1  
                                            aten::empty         2.58%      46.801us         2.58%      46.801us       5.200us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         2.54%      46.162us         2.54%      46.162us      15.387us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.39%       7.072us         0.39%       7.072us       1.179us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.43%       7.860us         0.43%       7.860us       7.860us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.815ms
Self CUDA time total: 9.184us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      91.263us       777.10%      91.263us      91.263us             1  
                                       torch_layer_norm         4.45%      73.631us        99.68%       1.650ms       1.650ms       0.000us         0.00%      15.616us      15.616us             1  
                                       aten::layer_norm         0.53%       8.730us        95.23%       1.577ms     525.519us       0.000us         0.00%      15.616us       5.205us             3  
                                aten::native_layer_norm         3.21%      53.200us        94.70%       1.568ms     522.609us      11.744us       100.00%      15.616us       5.205us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      11.744us       100.00%      11.744us       3.915us             3  
                                Activity Buffer Request        87.81%       1.454ms        87.81%       1.454ms       1.454ms       3.872us        32.97%       3.872us       3.872us             1  
                                            aten::empty         1.80%      29.853us         1.80%      29.853us       3.317us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.64%      27.230us         1.64%      27.230us       9.077us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       3.770us         0.23%       3.770us       0.628us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.32%       5.350us         0.32%       5.350us       5.350us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.656ms
Self CUDA time total: 11.744us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S128_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      93.407us       570.11%      93.407us      93.407us             1  
                                       torch_layer_norm         4.26%      70.071us        99.67%       1.640ms       1.640ms       0.000us         0.00%      21.856us      21.856us             1  
                                       aten::layer_norm         0.57%       9.440us        95.41%       1.570ms     523.176us       0.000us         0.00%      21.856us       7.285us             3  
                                aten::native_layer_norm         3.17%      52.082us        94.83%       1.560ms     520.029us      16.384us       100.00%      21.856us       7.285us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      16.384us       100.00%      16.384us       5.461us             3  
                                Activity Buffer Request        87.95%       1.447ms        87.95%       1.447ms       1.447ms       5.472us        33.40%       5.472us       5.472us             1  
                                            aten::empty         1.77%      29.121us         1.77%      29.121us       3.236us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.71%      28.080us         1.71%      28.080us       9.360us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.24%       4.030us         0.24%       4.030us       0.672us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.33%       5.460us         0.33%       5.460us       5.460us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.645ms
Self CUDA time total: 16.384us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S128_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     118.239us       440.39%     118.239us     118.239us             1  
                                       torch_layer_norm         5.44%      79.142us        99.61%       1.449ms       1.449ms       0.000us         0.00%      35.810us      35.810us             1  
                                       aten::layer_norm         0.75%      10.900us        94.17%       1.370ms     456.578us       0.000us         0.00%      35.810us      11.937us             3  
                                aten::native_layer_norm         4.07%      59.211us        93.42%       1.359ms     452.944us      26.849us       100.00%      35.810us      11.937us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      26.849us       100.00%      26.849us       8.950us             3  
                                Activity Buffer Request        72.70%       1.057ms        72.70%       1.057ms       1.057ms       8.961us        33.38%       8.961us       8.961us             1  
                                            aten::empty         2.44%      35.559us         2.44%      35.559us       3.951us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        13.86%     201.604us        13.86%     201.604us      67.201us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.34%       4.961us         0.34%       4.961us       0.827us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.39%       5.680us         0.39%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.455ms
Self CUDA time total: 26.849us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      95.007us       954.65%      95.007us      95.007us             1  
                                       torch_layer_norm         4.08%      72.861us        99.69%       1.782ms       1.782ms       0.000us         0.00%      13.216us      13.216us             1  
                                       aten::layer_norm         0.50%       9.010us        95.61%       1.709ms     569.593us       0.000us         0.00%      13.216us       4.405us             3  
                                aten::native_layer_norm         3.10%      55.433us        95.11%       1.700ms     566.590us       9.952us       100.00%      13.216us       4.405us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.952us       100.00%       9.952us       3.317us             3  
                                Activity Buffer Request        81.03%       1.448ms        81.03%       1.448ms       1.448ms       3.264us        32.80%       3.264us       3.264us             1  
                                            aten::empty         1.69%      30.250us         1.69%      30.250us       3.361us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         9.05%     161.792us         9.05%     161.792us      53.931us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.100us         0.23%       4.100us       0.683us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.31%       5.520us         0.31%       5.520us       5.520us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.787ms
Self CUDA time total: 9.952us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      88.574us       668.68%      88.574us      88.574us             1  
                                       torch_layer_norm        15.40%      66.901us        98.88%     429.607us     429.607us       0.000us         0.00%      17.629us      17.629us             1  
                                       aten::layer_norm         2.14%       9.290us        83.48%     362.706us     120.902us       0.000us         0.00%      17.629us       5.876us             3  
                                aten::native_layer_norm        12.03%      52.280us        81.34%     353.416us     117.805us      13.246us       100.00%      17.629us       5.876us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      13.246us       100.00%      13.246us       4.415us             3  
                                Activity Buffer Request        26.09%     113.362us        26.09%     113.362us     113.362us       4.383us        33.09%       4.383us       4.383us             1  
                                            aten::empty         6.80%      29.541us         6.80%      29.541us       3.282us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        35.53%     154.353us        35.53%     154.353us      51.451us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.89%       3.880us         0.89%       3.880us       0.647us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.12%       4.880us         1.12%       4.880us       4.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 434.487us
Self CUDA time total: 13.246us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S512_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      96.609us       488.49%      96.609us      96.609us             1  
                                       torch_layer_norm         4.03%      71.860us        99.72%       1.776ms       1.776ms       0.000us         0.00%      26.305us      26.305us             1  
                                       aten::layer_norm         0.54%       9.591us        95.68%       1.704ms     568.087us       0.000us         0.00%      26.305us       8.768us             3  
                                aten::native_layer_norm         2.97%      52.832us        95.14%       1.695ms     564.890us      19.777us       100.00%      26.305us       8.768us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      19.777us       100.00%      19.777us       6.592us             3  
                                Activity Buffer Request        81.50%       1.452ms        81.50%       1.452ms       1.452ms       6.528us        33.01%       6.528us       6.528us             1  
                                            aten::empty         1.62%      28.940us         1.62%      28.940us       3.216us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.82%     157.073us         8.82%     157.073us      52.358us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.100us         0.23%       4.100us       0.683us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       5.050us         0.28%       5.050us       5.050us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.781ms
Self CUDA time total: 19.777us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S512_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     101.087us       312.17%     101.087us     101.087us             1  
                                       torch_layer_norm         4.21%      75.141us        99.72%       1.779ms       1.779ms       0.000us         0.00%      43.134us      43.134us             1  
                                       aten::layer_norm         0.50%       9.000us        95.50%       1.703ms     567.803us       0.000us         0.00%      43.134us      14.378us             3  
                                aten::native_layer_norm         3.03%      54.032us        95.00%       1.694ms     564.803us      32.382us       100.00%      43.134us      14.378us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      32.382us       100.00%      32.382us      10.794us             3  
                                Activity Buffer Request        81.39%       1.452ms        81.39%       1.452ms       1.452ms      10.752us        33.20%      10.752us      10.752us             1  
                                            aten::empty         1.73%      30.799us         1.73%      30.799us       3.422us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.63%     153.894us         8.63%     153.894us      51.298us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       3.990us         0.22%       3.990us       0.665us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       5.050us         0.28%       5.050us       5.050us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.784ms
Self CUDA time total: 32.382us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      84.605us       738.59%      84.605us      84.605us             1  
                                       torch_layer_norm        14.65%      66.062us        98.90%     446.008us     446.008us       0.000us         0.00%      15.231us      15.231us             1  
                                       aten::layer_norm         1.88%       8.459us        84.25%     379.946us     126.649us       0.000us         0.00%      15.231us       5.077us             3  
                                aten::native_layer_norm        11.07%      49.901us        82.38%     371.487us     123.829us      11.455us       100.00%      15.231us       5.077us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      11.455us       100.00%      11.455us       3.818us             3  
                                Activity Buffer Request        30.37%     136.933us        30.37%     136.933us     136.933us       3.776us        32.96%       3.776us       3.776us             1  
                                            aten::empty         6.35%      28.620us         6.35%      28.620us       3.180us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        33.76%     152.233us        33.76%     152.233us      50.744us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.84%       3.800us         0.84%       3.800us       0.633us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.10%       4.941us         1.10%       4.941us       4.941us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 450.949us
Self CUDA time total: 11.455us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      95.615us       580.22%      95.615us      95.615us             1  
                                       torch_layer_norm         3.86%      68.250us        99.72%       1.762ms       1.762ms       0.000us         0.00%      21.951us      21.951us             1  
                                       aten::layer_norm         0.50%       8.771us        95.86%       1.694ms     564.703us       0.000us         0.00%      21.951us       7.317us             3  
                                aten::native_layer_norm         3.18%      56.263us        95.36%       1.685ms     561.780us      16.479us       100.00%      21.951us       7.317us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      16.479us       100.00%      16.479us       5.493us             3  
                                Activity Buffer Request        81.70%       1.444ms        81.70%       1.444ms       1.444ms       5.472us        33.21%       5.472us       5.472us             1  
                                            aten::empty         1.62%      28.639us         1.62%      28.639us       3.182us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.61%     152.252us         8.61%     152.252us      50.751us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.24%       4.230us         0.24%       4.230us       0.705us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       4.980us         0.28%       4.980us       4.980us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.767ms
Self CUDA time total: 16.479us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      88.894us       345.94%      88.894us      88.894us             1  
                                       torch_layer_norm        15.31%      64.511us        98.72%     416.027us     416.027us       0.000us         0.00%      34.240us      34.240us             1  
                                       aten::layer_norm         2.02%       8.530us        83.41%     351.516us     117.172us       0.000us         0.00%      34.240us      11.413us             3  
                                aten::native_layer_norm        12.31%      51.881us        81.39%     342.986us     114.329us      25.696us       100.00%      34.240us      11.413us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      25.696us       100.00%      25.696us       8.565us             3  
                                Activity Buffer Request        25.35%     106.822us        25.35%     106.822us     106.822us       8.544us        33.25%       8.544us       8.544us             1  
                                            aten::empty         6.69%      28.191us         6.69%      28.191us       3.132us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        36.17%     152.423us        36.17%     152.423us      50.808us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.87%       3.669us         0.87%       3.669us       0.612us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.28%       5.400us         1.28%       5.400us       5.400us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 421.427us
Self CUDA time total: 25.696us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.99%      70.451us        99.68%       1.760ms       1.760ms       0.000us         0.00%     110.273us     110.273us             1  
                                       aten::layer_norm         0.54%       9.469us        95.69%       1.690ms     563.186us       0.000us         0.00%     110.273us      36.758us             3  
                                aten::native_layer_norm         2.91%      51.321us        95.15%       1.680ms     560.030us      70.464us       100.00%     110.273us      36.758us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     104.384us       148.14%     104.384us     104.384us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      70.464us       100.00%      70.464us      23.488us             3  
                                Activity Buffer Request        81.54%       1.440ms        81.54%       1.440ms       1.440ms      39.809us        56.50%      39.809us      39.809us             1  
                                            aten::empty         1.69%      29.812us         1.69%      29.812us       3.312us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.79%     155.141us         8.79%     155.141us      51.714us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.141us         0.23%       4.141us       0.690us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.32%       5.631us         0.32%       5.631us       5.631us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.766ms
Self CUDA time total: 70.464us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      94.879us       526.67%      94.879us      94.879us             1  
                                       torch_layer_norm         3.90%      69.211us        99.68%       1.768ms       1.768ms       0.000us         0.00%      23.935us      23.935us             1  
                                       aten::layer_norm         0.53%       9.340us        95.78%       1.699ms     566.293us       0.000us         0.00%      23.935us       7.978us             3  
                                aten::native_layer_norm         2.96%      52.430us        95.26%       1.690ms     563.180us      18.015us       100.00%      23.935us       7.978us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      18.015us       100.00%      18.015us       6.005us             3  
                                Activity Buffer Request        81.67%       1.449ms        81.67%       1.449ms       1.449ms       5.920us        32.86%       5.920us       5.920us             1  
                                            aten::empty         1.69%      29.991us         1.69%      29.991us       3.332us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.72%     154.594us         8.72%     154.594us      51.531us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       3.890us         0.22%       3.890us       0.648us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.32%       5.590us         0.32%       5.590us       5.590us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.774ms
Self CUDA time total: 18.015us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      92.671us       343.53%      92.671us      92.671us             1  
                                       torch_layer_norm        14.22%      66.652us        98.98%     463.858us     463.858us       0.000us         0.00%      35.872us      35.872us             1  
                                       aten::layer_norm         1.92%       9.009us        84.76%     397.206us     132.402us       0.000us         0.00%      35.872us      11.957us             3  
                                aten::native_layer_norm        11.29%      52.919us        82.83%     388.197us     129.399us      26.976us       100.00%      35.872us      11.957us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      26.976us       100.00%      26.976us       8.992us             3  
                                Activity Buffer Request        32.20%     150.883us        32.20%     150.883us     150.883us       8.896us        32.98%       8.896us       8.896us             1  
                                            aten::empty         6.01%      28.182us         6.01%      28.182us       3.131us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        32.49%     152.273us        32.49%     152.273us      50.758us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.84%       3.940us         0.84%       3.940us       0.657us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.02%       4.791us         1.02%       4.791us       4.791us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 468.649us
Self CUDA time total: 26.976us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     133.341us       184.87%     133.341us     133.341us             1  
                                       torch_layer_norm         3.93%      69.900us        99.72%       1.772ms       1.772ms       0.000us         0.00%     112.892us     112.892us             1  
                                       aten::layer_norm         0.55%       9.790us        95.79%       1.702ms     567.350us       0.000us         0.00%     112.892us      37.631us             3  
                                aten::native_layer_norm         3.28%      58.200us        95.24%       1.692ms     564.087us      72.125us       100.00%     112.892us      37.631us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      72.125us       100.00%      72.125us      24.042us             3  
                                Activity Buffer Request        80.05%       1.422ms        80.05%       1.422ms       1.422ms      40.767us        56.52%      40.767us      40.767us             1  
                                            aten::empty         1.64%      29.113us         1.64%      29.113us       3.235us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        10.01%     177.823us        10.01%     177.823us      59.274us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.27%       4.770us         0.27%       4.770us       0.795us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       4.900us         0.28%       4.900us       4.900us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.777ms
Self CUDA time total: 72.125us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        14.68%      65.741us        95.47%     427.658us     427.658us       0.000us         0.00%     230.621us     230.621us             1  
                                       aten::layer_norm         2.04%       9.121us        80.79%     361.917us     120.639us       0.000us         0.00%     230.621us      76.874us             3  
                                aten::native_layer_norm        11.17%      50.059us        78.75%     352.796us     117.599us     144.510us       100.00%     230.621us      76.874us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     146.014us       101.04%     146.014us     146.014us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     144.510us       100.00%     144.510us      48.170us             3  
                                Activity Buffer Request        26.04%     116.642us        26.04%     116.642us     116.642us      86.111us        59.59%      86.111us      86.111us             1  
                                            aten::empty         6.43%      28.811us         6.43%      28.811us       3.201us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        34.20%     153.184us        34.20%     153.184us      51.061us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.92%       4.100us         0.92%       4.100us       0.683us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         4.53%      20.311us         4.53%      20.311us      20.311us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 447.969us
Self CUDA time total: 144.510us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      92.096us       943.61%      92.096us      92.096us             1  
                                       torch_layer_norm         3.85%      68.512us        99.73%       1.773ms       1.773ms       0.000us         0.00%      12.864us      12.864us             1  
                                       aten::layer_norm         0.55%       9.759us        95.87%       1.705ms     568.216us       0.000us         0.00%      12.864us       4.288us             3  
                                aten::native_layer_norm         3.00%      53.309us        95.32%       1.695ms     564.963us       9.760us       100.00%      12.864us       4.288us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.760us       100.00%       9.760us       3.253us             3  
                                Activity Buffer Request        81.26%       1.445ms        81.26%       1.445ms       1.445ms       3.104us        31.80%       3.104us       3.104us             1  
                                            aten::empty         1.70%      30.172us         1.70%      30.172us       3.352us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         9.14%     162.452us         9.14%     162.452us      54.151us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.24%       4.201us         0.24%       4.201us       0.700us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.27%       4.880us         0.27%       4.880us       4.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.778ms
Self CUDA time total: 9.760us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      91.521us       709.63%      91.521us      91.521us             1  
                                       torch_layer_norm         4.32%      76.641us        99.71%       1.771ms       1.771ms       0.000us         0.00%      17.186us      17.186us             1  
                                       aten::layer_norm         0.52%       9.251us        95.40%       1.694ms     564.620us       0.000us         0.00%      17.186us       5.729us             3  
                                aten::native_layer_norm         2.94%      52.208us        94.87%       1.685ms     561.536us      12.897us       100.00%      17.186us       5.729us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      12.897us       100.00%      12.897us       4.299us             3  
                                Activity Buffer Request        81.35%       1.444ms        81.35%       1.444ms       1.444ms       4.289us        33.26%       4.289us       4.289us             1  
                                            aten::empty         1.65%      29.223us         1.65%      29.223us       3.247us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.72%     154.793us         8.72%     154.793us      51.598us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       3.890us         0.22%       3.890us       0.648us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.29%       5.110us         0.29%       5.110us       5.110us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.776ms
Self CUDA time total: 12.897us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S128_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      88.130us       448.50%      88.130us      88.130us             1  
                                       torch_layer_norm        11.06%      64.130us        99.16%     575.190us     575.190us       0.000us         0.00%      26.147us      26.147us             1  
                                       aten::layer_norm         1.59%       9.222us        88.10%     511.060us     170.353us       0.000us         0.00%      26.147us       8.716us             3  
                                aten::native_layer_norm         8.61%      49.940us        86.51%     501.838us     167.279us      19.650us       100.00%      26.147us       8.716us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      19.650us       100.00%      19.650us       6.550us             3  
                                Activity Buffer Request        45.46%     263.724us        45.46%     263.724us     263.724us       6.497us        33.06%       6.497us       6.497us             1  
                                            aten::empty         4.97%      28.852us         4.97%      28.852us       3.206us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        26.69%     154.833us        26.69%     154.833us      51.611us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.77%       4.489us         0.77%       4.489us       0.748us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.84%       4.880us         0.84%       4.880us       4.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 580.070us
Self CUDA time total: 19.650us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S128_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      92.576us       290.74%      92.576us      92.576us             1  
                                       torch_layer_norm        10.78%      63.911us        99.14%     587.520us     587.520us       0.000us         0.00%      42.562us      42.562us             1  
                                       aten::layer_norm         1.44%       8.510us        88.35%     523.609us     174.536us       0.000us         0.00%      42.562us      14.187us             3  
                                aten::native_layer_norm         8.62%      51.095us        86.92%     515.099us     171.700us      31.841us       100.00%      42.562us      14.187us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      31.841us       100.00%      31.841us      10.614us             3  
                                Activity Buffer Request        46.87%     277.744us        46.87%     277.744us     277.744us      10.721us        33.67%      10.721us      10.721us             1  
                                            aten::empty         4.75%      28.169us         4.75%      28.169us       3.130us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        25.92%     153.632us        25.92%     153.632us      51.211us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.75%       4.459us         0.75%       4.459us       0.743us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.86%       5.110us         0.86%       5.110us       5.110us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 592.630us
Self CUDA time total: 31.841us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      95.776us       539.28%      95.776us      95.776us             1  
                                       torch_layer_norm        13.84%     112.583us        99.26%     807.595us     807.595us       0.000us         0.00%      23.680us      23.680us             1  
                                       aten::layer_norm         1.40%      11.400us        85.42%     695.012us     231.671us       0.000us         0.00%      23.680us       7.893us             3  
                                aten::native_layer_norm         7.57%      61.601us        84.02%     683.612us     227.871us      17.760us       100.00%      23.680us       7.893us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      17.760us       100.00%      17.760us       5.920us             3  
                                Activity Buffer Request        33.76%     274.664us        33.76%     274.664us     274.664us       5.920us        33.33%       5.920us       5.920us             1  
                                            aten::empty         3.69%      30.062us         3.69%      30.062us       3.340us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        38.34%     311.955us        38.34%     311.955us     103.985us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.66%       5.330us         0.66%       5.330us       0.888us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.74%       6.030us         0.74%       6.030us       6.030us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 813.625us
Self CUDA time total: 17.760us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      96.383us       353.93%      96.383us      96.383us             1  
                                       torch_layer_norm         4.14%      80.990us        99.72%       1.949ms       1.949ms       0.000us         0.00%      36.288us      36.288us             1  
                                       aten::layer_norm         0.49%       9.631us        95.58%       1.868ms     622.648us       0.000us         0.00%      36.288us      12.096us             3  
                                aten::native_layer_norm         2.77%      54.113us        95.09%       1.858ms     619.438us      27.232us       100.00%      36.288us      12.096us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      27.232us       100.00%      27.232us       9.077us             3  
                                Activity Buffer Request        75.84%       1.482ms        75.84%       1.482ms       1.482ms       9.056us        33.25%       9.056us       9.056us             1  
                                            aten::empty         1.50%      29.320us         1.50%      29.320us       3.258us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        14.76%     288.535us        14.76%     288.535us      96.178us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       4.249us         0.22%       4.249us       0.708us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       5.411us         0.28%       5.411us       5.411us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.954ms
Self CUDA time total: 27.232us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S512_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.80%      69.480us        99.73%       1.822ms       1.822ms       0.000us         0.00%     112.641us     112.641us             1  
                                       aten::layer_norm         0.50%       9.151us        95.93%       1.752ms     584.111us       0.000us         0.00%     112.641us      37.547us             3  
                                aten::native_layer_norm         2.81%      51.420us        95.43%       1.743ms     581.060us      72.033us       100.00%     112.641us      37.547us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     101.696us       141.18%     101.696us     101.696us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      72.033us       100.00%      72.033us      24.011us             3  
                                Activity Buffer Request        80.53%       1.471ms        80.53%       1.471ms       1.471ms      40.608us        56.37%      40.608us      40.608us             1  
                                            aten::empty         1.60%      29.163us         1.60%      29.163us       3.240us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        10.27%     187.683us        10.27%     187.683us      62.561us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       3.950us         0.22%       3.950us       0.658us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.27%       4.880us         0.27%       4.880us       4.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.827ms
Self CUDA time total: 72.033us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S512_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.85%      68.680us        99.71%       1.780ms       1.780ms       0.000us         0.00%     229.955us     229.955us             1  
                                       aten::layer_norm         0.61%      10.850us        95.86%       1.711ms     570.370us       0.000us         0.00%     229.955us      76.652us             3  
                                aten::native_layer_norm         3.11%      55.560us        95.26%       1.700ms     566.754us     144.066us       100.00%     229.955us      76.652us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     145.569us       101.04%     145.569us     145.569us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     144.066us       100.00%     144.066us      48.022us             3  
                                Activity Buffer Request        79.52%       1.419ms        79.52%       1.419ms       1.419ms      85.889us        59.62%      85.889us      85.889us             1  
                                            aten::empty         1.71%      30.551us         1.71%      30.551us       3.395us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        10.67%     190.375us        10.67%     190.375us      63.458us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.24%       4.330us         0.24%       4.330us       0.722us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.29%       5.130us         0.29%       5.130us       5.130us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.785ms
Self CUDA time total: 144.066us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     115.904us       398.90%     115.904us     115.904us             1  
                                       torch_layer_norm         4.36%      77.971us        99.69%       1.781ms       1.781ms       0.000us         0.00%      38.656us      38.656us             1  
                                       aten::layer_norm         0.59%      10.570us        95.33%       1.703ms     567.730us       0.000us         0.00%      38.656us      12.885us             3  
                                aten::native_layer_norm         3.31%      59.081us        94.74%       1.693ms     564.207us      29.056us       100.00%      38.656us      12.885us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      29.056us       100.00%      29.056us       9.685us             3  
                                Activity Buffer Request        80.03%       1.430ms        80.03%       1.430ms       1.430ms       9.600us        33.04%       9.600us       9.600us             1  
                                            aten::empty         1.84%      32.962us         1.84%      32.962us       3.662us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         9.29%     165.972us         9.29%     165.972us      55.324us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.27%       4.790us         0.27%       4.790us       0.798us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.31%       5.470us         0.31%       5.470us       5.470us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.787ms
Self CUDA time total: 29.056us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        14.07%      64.760us        98.95%     455.588us     455.588us       0.000us         0.00%     101.120us     101.120us             1  
                                       aten::layer_norm         1.91%       8.791us        84.88%     390.828us     130.276us       0.000us         0.00%     101.120us      33.707us             3  
                                aten::native_layer_norm        11.79%      54.281us        82.97%     382.037us     127.346us      65.344us       100.00%     101.120us      33.707us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      96.510us       147.70%      96.510us      96.510us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      65.344us       100.00%      65.344us      21.781us             3  
                                Activity Buffer Request        29.77%     137.072us        29.77%     137.072us     137.072us      35.776us        54.75%      35.776us      35.776us             1  
                                            aten::empty         6.60%      30.402us         6.60%      30.402us       3.378us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        33.93%     156.232us        33.93%     156.232us      52.077us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.88%       4.050us         0.88%       4.050us       0.675us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.05%       4.840us         1.05%       4.840us       4.840us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 460.428us
Self CUDA time total: 65.344us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.83%      67.811us        99.72%       1.767ms       1.767ms       0.000us         0.00%     207.840us     207.840us             1  
                                       aten::layer_norm         0.55%       9.819us        95.89%       1.699ms     566.320us       0.000us         0.00%     207.840us      69.280us             3  
                                aten::native_layer_norm         3.03%      53.603us        95.34%       1.689ms     563.047us     129.312us       100.00%     207.840us      69.280us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     130.911us       101.24%     130.911us     130.911us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     129.312us       100.00%     129.312us      43.104us             3  
                                Activity Buffer Request        81.49%       1.444ms        81.49%       1.444ms       1.444ms      78.528us        60.73%      78.528us      78.528us             1  
                                            aten::empty         1.74%      30.830us         1.74%      30.830us       3.426us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.86%     156.973us         8.86%     156.973us      52.324us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.020us         0.23%       4.020us       0.670us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       4.980us         0.28%       4.980us       4.980us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.772ms
Self CUDA time total: 129.312us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.13%      68.611us        81.17%       1.779ms       1.779ms       0.000us         0.00%     737.526us     737.526us             1  
                                       aten::layer_norm         0.41%       9.061us        78.04%       1.711ms     570.260us       0.000us         0.00%     737.526us     245.842us             3  
                                aten::native_layer_norm         2.43%      53.328us        77.62%       1.702ms     567.240us     547.705us       100.00%     737.526us     245.842us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     549.241us       100.28%     549.241us     549.241us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     547.705us       100.00%     547.705us     182.568us             3  
                                Activity Buffer Request        66.39%       1.455ms        66.39%       1.455ms       1.455ms     189.821us        34.66%     189.821us     189.821us             1  
                                            aten::empty         1.36%      29.741us         1.36%      29.741us       3.305us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         7.27%     159.364us         7.27%     159.364us      53.121us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.18%       3.911us         0.18%       3.911us       0.652us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        18.83%     412.857us        18.83%     412.857us     412.857us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.192ms
Self CUDA time total: 547.705us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        13.81%      64.951us        98.91%     465.198us     465.198us       0.000us         0.00%     102.813us     102.813us             1  
                                       aten::layer_norm         2.00%       9.429us        85.10%     400.247us     133.416us       0.000us         0.00%     102.813us      34.271us             3  
                                aten::native_layer_norm        10.88%      51.150us        83.10%     390.818us     130.273us      68.606us       100.00%     102.813us      34.271us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     100.893us       147.06%     100.893us     100.893us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      68.606us       100.00%      68.606us      22.869us             3  
                                Activity Buffer Request        31.07%     146.142us        31.07%     146.142us     146.142us      34.207us        49.86%      34.207us      34.207us             1  
                                            aten::empty         6.17%      29.002us         6.17%      29.002us       3.222us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        34.16%     160.644us        34.16%     160.644us      53.548us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.82%       3.880us         0.82%       3.880us       0.647us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.09%       5.121us         1.09%       5.121us       5.121us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 470.319us
Self CUDA time total: 68.606us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.85%      67.820us        99.72%       1.755ms       1.755ms       0.000us         0.00%     204.288us     204.288us             1  
                                       aten::layer_norm         0.52%       9.151us        95.86%       1.687ms     562.280us       0.000us         0.00%     204.288us      68.096us             3  
                                aten::native_layer_norm         2.95%      51.910us        95.34%       1.678ms     559.230us     129.120us       100.00%     204.288us      68.096us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     130.560us       101.12%     130.560us     130.560us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     129.120us       100.00%     129.120us      43.040us             3  
                                Activity Buffer Request        81.69%       1.437ms        81.69%       1.437ms       1.437ms      75.168us        58.22%      75.168us      75.168us             1  
                                            aten::empty         1.73%      30.362us         1.73%      30.362us       3.374us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.76%     154.112us         8.76%     154.112us      51.371us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.22%       3.910us         0.22%       3.910us       0.652us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       4.960us         0.28%       4.960us       4.960us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.760ms
Self CUDA time total: 129.120us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.24%      70.231us        80.97%       1.754ms       1.754ms       0.000us         0.00%     714.792us     714.792us             1  
                                       aten::layer_norm         0.42%       9.200us        77.73%       1.684ms     561.233us       0.000us         0.00%     714.792us     238.264us             3  
                                aten::native_layer_norm         2.38%      51.610us        77.31%       1.674ms     558.166us     542.598us       100.00%     714.792us     238.264us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     544.071us       100.27%     544.071us     544.071us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     542.598us       100.00%     542.598us     180.866us             3  
                                Activity Buffer Request        66.26%       1.435ms        66.26%       1.435ms       1.435ms     172.194us        31.74%     172.194us     172.194us             1  
                                            aten::empty         1.34%      28.942us         1.34%      28.942us       3.216us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         7.14%     154.623us         7.14%     154.623us      51.541us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.19%       4.030us         0.19%       4.030us       0.672us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        19.03%     412.116us        19.03%     412.116us     412.116us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.166ms
Self CUDA time total: 542.598us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         2.50%      69.210us        63.28%       1.753ms       1.753ms       0.000us         0.00%       1.482ms       1.482ms             1  
                                       aten::layer_norm         0.34%       9.550us        60.78%       1.684ms     561.333us       0.000us         0.00%       1.482ms     494.135us             3  
                                aten::native_layer_norm         1.89%      52.442us        60.43%       1.674ms     558.150us       1.150ms       100.00%       1.482ms     494.135us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       1.151ms       100.12%       1.151ms       1.151ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       1.150ms       100.00%       1.150ms     383.212us             3  
                                Activity Buffer Request        51.68%       1.432ms        51.68%       1.432ms       1.432ms     332.769us        28.95%     332.769us     332.769us             1  
                                            aten::empty         1.10%      30.460us         1.10%      30.460us       3.384us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         5.62%     155.772us         5.62%     155.772us      51.924us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.14%       3.891us         0.14%       3.891us       0.649us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        36.72%       1.018ms        36.72%       1.018ms       1.018ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.771ms
Self CUDA time total: 1.150ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      86.813us       481.04%      86.813us      86.813us             1  
                                       torch_layer_norm        13.94%      63.610us        98.78%     450.788us     450.788us       0.000us         0.00%      23.966us      23.966us             1  
                                       aten::layer_norm         1.92%       8.751us        84.84%     387.178us     129.059us       0.000us         0.00%      23.966us       7.989us             3  
                                aten::native_layer_norm        11.33%      51.701us        82.93%     378.427us     126.142us      18.047us       100.00%      23.966us       7.989us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      18.047us       100.00%      18.047us       6.016us             3  
                                Activity Buffer Request        30.87%     140.892us        30.87%     140.892us     140.892us       5.919us        32.80%       5.919us       5.919us             1  
                                            aten::empty         6.07%      27.691us         6.07%      27.691us       3.077us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        33.75%     154.013us        33.75%     154.013us      51.338us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.91%       4.130us         0.91%       4.130us       0.688us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.22%       5.560us         1.22%       5.560us       5.560us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 456.348us
Self CUDA time total: 18.047us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      94.272us       347.01%      94.272us      94.272us             1  
                                       torch_layer_norm         3.87%      67.581us        99.70%       1.743ms       1.743ms       0.000us         0.00%      36.063us      36.063us             1  
                                       aten::layer_norm         0.54%       9.410us        95.84%       1.675ms     558.423us       0.000us         0.00%      36.063us      12.021us             3  
                                aten::native_layer_norm         3.00%      52.431us        95.30%       1.666ms     555.286us      27.167us       100.00%      36.063us      12.021us             3  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      27.167us       100.00%      27.167us       9.056us             3  
                                Activity Buffer Request        81.64%       1.427ms        81.64%       1.427ms       1.427ms       8.896us        32.75%       8.896us       8.896us             1  
                                            aten::empty         1.64%      28.640us         1.64%      28.640us       3.182us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.79%     153.563us         8.79%     153.563us      51.188us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.090us         0.23%       4.090us       0.682us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.30%       5.160us         0.30%       5.160us       5.160us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.748ms
Self CUDA time total: 27.167us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S128_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        15.30%      64.290us        98.85%     415.327us     415.327us       0.000us         0.00%     113.182us     113.182us             1  
                                       aten::layer_norm         1.89%       7.931us        83.55%     351.037us     117.012us       0.000us         0.00%     113.182us      37.727us             3  
                                aten::native_layer_norm        12.15%      51.059us        81.66%     343.106us     114.369us      72.639us       100.00%     113.182us      37.727us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us      97.758us       134.58%      97.758us      97.758us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      72.639us       100.00%      72.639us      24.213us             3  
                                Activity Buffer Request        25.15%     105.652us        25.15%     105.652us     105.652us      40.543us        55.81%      40.543us      40.543us             1  
                                            aten::empty         7.08%      29.763us         7.08%      29.763us       3.307us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        36.37%     152.792us        36.37%     152.792us      50.931us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.91%       3.840us         0.91%       3.840us       0.640us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         1.15%       4.831us         1.15%       4.831us       4.831us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 420.158us
Self CUDA time total: 72.639us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S128_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.89%      68.361us        99.32%       1.748ms       1.748ms       0.000us         0.00%     226.432us     226.432us             1  
                                       aten::layer_norm         0.51%       8.970us        95.44%       1.679ms     559.750us       0.000us         0.00%     226.432us      75.477us             3  
                                aten::native_layer_norm         3.03%      53.343us        94.93%       1.670ms     556.760us     142.207us       100.00%     226.432us      75.477us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     143.552us       100.95%     143.552us     143.552us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     142.207us       100.00%     142.207us      47.402us             3  
                                Activity Buffer Request        81.27%       1.430ms        81.27%       1.430ms       1.430ms      84.225us        59.23%      84.225us      84.225us             1  
                                            aten::empty         1.69%      29.760us         1.69%      29.760us       3.307us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.71%     153.172us         8.71%     153.172us      51.057us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.23%       4.080us         0.23%       4.080us       0.680us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.68%      11.911us         0.68%      11.911us      11.911us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.760ms
Self CUDA time total: 142.207us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.86%      67.581us        99.71%       1.745ms       1.745ms       0.000us         0.00%     103.967us     103.967us             1  
                                       aten::layer_norm         0.51%       8.910us        95.84%       1.677ms     559.073us       0.000us         0.00%     103.967us      34.656us             3  
                                aten::native_layer_norm         3.07%      53.660us        95.33%       1.668ms     556.103us      69.343us       100.00%     103.967us      34.656us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     103.487us       149.24%     103.487us     103.487us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us      69.343us       100.00%      69.343us      23.114us             3  
                                Activity Buffer Request        81.52%       1.427ms        81.52%       1.427ms       1.427ms      34.624us        49.93%      34.624us      34.624us             1  
                                            aten::empty         1.61%      28.261us         1.61%      28.261us       3.140us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         8.90%     155.753us         8.90%     155.753us      51.918us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.24%       4.120us         0.24%       4.120us       0.687us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.29%       5.151us         0.29%       5.151us       5.151us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.750ms
Self CUDA time total: 69.343us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        11.35%      67.490us        99.15%     589.690us     589.690us       0.000us         0.00%     202.330us     202.330us             1  
                                       aten::layer_norm         1.44%       8.590us        87.80%     522.200us     174.067us       0.000us         0.00%     202.330us      67.443us             3  
                                aten::native_layer_norm         8.41%      50.041us        86.35%     513.610us     171.203us     128.124us       100.00%     202.330us      67.443us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     129.692us       101.22%     129.692us     129.692us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     128.124us       100.00%     128.124us      42.708us             3  
                                Activity Buffer Request        46.63%     277.315us        46.63%     277.315us     277.315us      74.206us        57.92%      74.206us      74.206us             1  
                                            aten::empty         4.68%      27.831us         4.68%      27.831us       3.092us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        25.89%     153.973us        25.89%     153.973us      51.324us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.75%       4.450us         0.75%       4.450us       0.742us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.85%       5.080us         0.85%       5.080us       5.080us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 594.770us
Self CUDA time total: 128.124us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S512_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         6.87%      68.511us        58.17%     579.770us     579.770us       0.000us         0.00%     720.407us     720.407us             1  
                                       aten::layer_norm         0.88%       8.821us        51.29%     511.259us     170.420us       0.000us         0.00%     720.407us     240.136us             3  
                                aten::native_layer_norm         5.17%      51.521us        50.41%     502.438us     167.479us     546.073us       100.00%     720.407us     240.136us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     547.577us       100.28%     547.577us     547.577us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     546.073us       100.00%     546.073us     182.024us             3  
                                Activity Buffer Request        26.52%     264.294us        26.52%     264.294us     264.294us     174.334us        31.93%     174.334us     174.334us             1  
                                            aten::empty         2.91%      29.030us         2.91%      29.030us       3.226us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        15.39%     153.384us        15.39%     153.384us      51.128us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.42%       4.209us         0.42%       4.209us       0.702us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        41.83%     416.987us        41.83%     416.987us     416.987us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 996.757us
Self CUDA time total: 546.073us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S512_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         4.10%      64.241us        34.57%     541.829us     541.829us       0.000us         0.00%       1.480ms       1.480ms             1  
                                       aten::layer_norm         0.55%       8.560us        30.47%     477.588us     159.196us       0.000us         0.00%       1.480ms     493.436us             3  
                                aten::native_layer_norm         3.24%      50.830us        29.93%     469.028us     156.343us       1.149ms       100.00%       1.480ms     493.436us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       1.151ms       100.12%       1.151ms       1.151ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       1.149ms       100.00%       1.149ms     383.133us             3  
                                Activity Buffer Request        14.86%     232.814us        14.86%     232.814us     232.814us     330.909us        28.79%     330.909us     330.909us             1  
                                            aten::empty         1.86%      29.081us         1.86%      29.081us       3.231us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         9.70%     152.022us         9.70%     152.022us      50.674us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.27%       4.281us         0.27%       4.281us       0.713us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        65.43%       1.025ms        65.43%       1.025ms       1.025ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.567ms
Self CUDA time total: 1.149ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm        10.87%      65.290us        97.50%     585.660us     585.660us       0.000us         0.00%     211.160us     211.160us             1  
                                       aten::layer_norm         1.49%       8.961us        86.63%     520.370us     173.457us       0.000us         0.00%     211.160us      70.387us             3  
                                aten::native_layer_norm         8.59%      51.600us        85.14%     511.409us     170.470us     139.579us       100.00%     211.160us      70.387us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     140.987us       101.01%     140.987us     140.987us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     139.579us       100.00%     139.579us      46.526us             3  
                                Activity Buffer Request        45.81%     275.144us        45.81%     275.144us     275.144us      71.581us        51.28%      71.581us      71.581us             1  
                                            aten::empty         4.65%      27.942us         4.65%      27.942us       3.105us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        25.42%     152.693us        25.42%     152.693us      50.898us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.67%       4.030us         0.67%       4.030us       0.672us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         2.50%      14.990us         2.50%      14.990us      14.990us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 600.650us
Self CUDA time total: 139.579us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         6.53%      63.420us        56.04%     544.209us     544.209us       0.000us         0.00%     725.021us     725.021us             1  
                                       aten::layer_norm         0.90%       8.770us        49.51%     480.789us     160.263us       0.000us         0.00%     725.021us     241.674us             3  
                                aten::native_layer_norm         5.25%      50.982us        48.61%     472.019us     157.340us     551.902us       100.00%     725.021us     241.674us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     553.342us       100.26%     553.342us     553.342us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     551.902us       100.00%     551.902us     183.967us             3  
                                Activity Buffer Request        24.17%     234.744us        24.17%     234.744us     234.744us     173.119us        31.37%     173.119us     173.119us             1  
                                            aten::empty         3.03%      29.450us         3.03%      29.450us       3.272us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        15.70%     152.482us        15.70%     152.482us      50.827us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.45%       4.361us         0.45%       4.361us       0.727us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        43.96%     426.887us        43.96%     426.887us     426.887us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 971.096us
Self CUDA time total: 551.902us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         4.07%      66.881us        38.72%     635.751us     635.751us       0.000us         0.00%       1.469ms       1.469ms             1  
                                       aten::layer_norm         0.55%       9.009us        34.64%     568.870us     189.623us       0.000us         0.00%       1.469ms     489.666us             3  
                                aten::native_layer_norm         3.27%      53.630us        34.10%     559.861us     186.620us       1.138ms       100.00%       1.469ms     489.666us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       1.139ms       100.13%       1.139ms       1.139ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       1.138ms       100.00%       1.138ms     379.279us             3  
                                Activity Buffer Request        19.12%     313.985us        19.12%     313.985us     313.985us     331.162us        29.10%     331.162us     331.162us             1  
                                            aten::empty         1.88%      30.903us         1.88%      30.903us       3.434us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         9.57%     157.133us         9.57%     157.133us      52.378us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.26%       4.210us         0.26%       4.210us       0.702us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        61.28%       1.006ms        61.28%       1.006ms       1.006ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.642ms
Self CUDA time total: 1.138ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         2.42%      65.690us        15.85%     430.707us     430.707us       0.000us         0.00%       3.155ms       3.155ms             1  
                                       aten::layer_norm         0.35%       9.490us        13.44%     365.017us     121.672us       0.000us         0.00%       3.155ms       1.052ms             3  
                                aten::native_layer_norm         1.79%      48.727us        13.09%     355.527us     118.509us       2.409ms       100.00%       3.155ms       1.052ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.410ms       100.06%       2.410ms       2.410ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.409ms       100.00%       2.409ms     802.859us             3  
                                Activity Buffer Request         4.38%     118.922us         4.38%     118.922us     118.922us     746.656us        31.00%     746.656us     746.656us             1  
                                            aten::empty         1.13%      30.624us         1.13%      30.624us       3.403us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         5.65%     153.412us         5.65%     153.412us      51.137us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.14%       3.842us         0.14%       3.842us       0.640us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        84.15%       2.286ms        84.15%       2.286ms       2.286ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.717ms
Self CUDA time total: 2.409ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         6.72%      66.011us        55.62%     546.350us     546.350us       0.000us         0.00%     735.937us     735.937us             1  
                                       aten::layer_norm         0.92%       8.990us        48.90%     480.339us     160.113us       0.000us         0.00%     735.937us     245.312us             3  
                                aten::native_layer_norm         5.16%      50.724us        47.98%     471.349us     157.116us     560.097us       100.00%     735.937us     245.312us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us     561.633us       100.27%     561.633us     561.633us             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us     560.097us       100.00%     560.097us     186.699us             3  
                                Activity Buffer Request        23.82%     234.014us        23.82%     234.014us     234.014us     175.840us        31.39%     175.840us     175.840us             1  
                                            aten::empty         2.88%      28.270us         2.88%      28.270us       3.141us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        15.72%     154.402us        15.72%     154.402us      51.467us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.40%       3.939us         0.40%       3.939us       0.656us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        44.38%     435.997us        44.38%     435.997us     435.997us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 982.347us
Self CUDA time total: 560.097us



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         4.56%      64.832us        29.06%     412.897us     412.897us       0.000us         0.00%       1.469ms       1.469ms             1  
                                       aten::layer_norm         0.65%       9.228us        24.50%     348.065us     116.022us       0.000us         0.00%       1.469ms     489.663us             3  
                                aten::native_layer_norm         3.69%      52.410us        23.85%     338.837us     112.946us       1.133ms       100.00%       1.469ms     489.663us             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       1.135ms       100.12%       1.135ms       1.135ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       1.133ms       100.00%       1.133ms     377.716us             3  
                                Activity Buffer Request         7.07%     100.442us         7.07%     100.442us     100.442us     335.839us        29.64%     335.839us     335.839us             1  
                                            aten::empty         2.06%      29.311us         2.06%      29.311us       3.257us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel        10.76%     152.823us        10.76%     152.823us      50.941us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.27%       3.851us         0.27%       3.851us       0.642us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        70.94%       1.008ms        70.94%       1.008ms       1.008ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.421ms
Self CUDA time total: 1.133ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         2.43%      67.770us        21.38%     597.070us     597.070us       0.000us         0.00%       3.032ms       3.032ms             1  
                                       aten::layer_norm         0.34%       9.401us        18.95%     529.300us     176.433us       0.000us         0.00%       3.032ms       1.011ms             3  
                                aten::native_layer_norm         1.84%      51.400us        18.61%     519.899us     173.300us       2.325ms       100.00%       3.032ms       1.011ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.327ms       100.06%       2.327ms       2.327ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.325ms       100.00%       2.325ms     775.112us             3  
                                Activity Buffer Request         9.90%     276.585us         9.90%     276.585us     276.585us     706.558us        30.39%     706.558us     706.558us             1  
                                            aten::empty         1.09%      30.392us         1.09%      30.392us       3.377us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         5.64%     157.652us         5.64%     157.652us      52.551us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.14%       3.870us         0.14%       3.870us       0.645us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        78.62%       2.196ms        78.62%       2.196ms       2.196ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.793ms
Self CUDA time total: 2.325ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.28%      68.262us        10.71%     572.390us     572.390us       0.000us         0.00%       6.493ms       6.493ms             1  
                                       aten::layer_norm         0.16%       8.770us         9.43%     504.128us     168.043us       0.000us         0.00%       6.493ms       2.164ms             3  
                                aten::native_layer_norm         0.96%      51.508us         9.27%     495.358us     165.119us       4.900ms       100.00%       6.493ms       2.164ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.901ms       100.03%       4.901ms       4.901ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.900ms       100.00%       4.900ms       1.633ms             3  
                                Activity Buffer Request         4.74%     253.634us         4.74%     253.634us     253.634us       1.594ms        32.53%       1.594ms       1.594ms             1  
                                            aten::empty         0.56%      29.682us         0.56%      29.682us       3.298us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         2.93%     156.523us         2.93%     156.523us      52.174us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.08%       4.011us         0.08%       4.011us       0.669us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        89.29%       4.774ms        89.29%       4.774ms       4.774ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.346ms
Self CUDA time total: 4.900ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S1024_D1024     0.05  False
torch_layer_norm         LN_B16_S1024_D2048     0.21  False
torch_layer_norm         LN_B16_S1024_D4096     0.42  False
torch_layer_norm         LN_B16_S1024_D8192     0.85  False
torch_layer_norm         LN_B16_S128_D1024      0.03  False
torch_layer_norm         LN_B16_S128_D2048      0.03  False
torch_layer_norm         LN_B16_S128_D4096      0.04  False
torch_layer_norm         LN_B16_S128_D8192      0.05  False
torch_layer_norm         LN_B16_S2048_D1024     0.21  False
torch_layer_norm         LN_B16_S2048_D2048     0.42  False
torch_layer_norm         LN_B16_S2048_D4096     0.82  False
torch_layer_norm         LN_B16_S2048_D8192     1.68  False
torch_layer_norm         LN_B16_S512_D1024      0.04  False
torch_layer_norm         LN_B16_S512_D2048      0.05  False
torch_layer_norm         LN_B16_S512_D4096      0.21  False
torch_layer_norm         LN_B16_S512_D8192      0.43  False
torch_layer_norm         LN_B1_S1024_D1024      0.03  False
torch_layer_norm         LN_B1_S1024_D2048      0.03  False
torch_layer_norm         LN_B1_S1024_D4096      0.03  False
torch_layer_norm         LN_B1_S1024_D8192      0.04  False
torch_layer_norm         LN_B1_S128_D1024       0.02  False
torch_layer_norm         LN_B1_S128_D2048       0.03  False
torch_layer_norm         LN_B1_S128_D4096       0.03  False
torch_layer_norm         LN_B1_S128_D8192       0.03  False
torch_layer_norm         LN_B1_S2048_D1024      0.03  False
torch_layer_norm         LN_B1_S2048_D2048      0.03  False
torch_layer_norm         LN_B1_S2048_D4096      0.04  False
torch_layer_norm         LN_B1_S2048_D8192      0.05  False
torch_layer_norm         LN_B1_S512_D1024       0.03  False
torch_layer_norm         LN_B1_S512_D2048       0.03  False
torch_layer_norm         LN_B1_S512_D4096       0.03  False
torch_layer_norm         LN_B1_S512_D8192       0.03  False
torch_layer_norm         LN_B4_S1024_D1024      0.03  False
torch_layer_norm         LN_B4_S1024_D2048      0.04  False
torch_layer_norm         LN_B4_S1024_D4096      0.05  False
torch_layer_norm         LN_B4_S1024_D8192      0.20  False
torch_layer_norm         LN_B4_S128_D1024       0.03  False
torch_layer_norm         LN_B4_S128_D2048       0.03  False
torch_layer_norm         LN_B4_S128_D4096       0.03  False
torch_layer_norm         LN_B4_S128_D8192       0.03  False
torch_layer_norm         LN_B4_S2048_D1024      0.04  False
torch_layer_norm         LN_B4_S2048_D2048      0.05  False
torch_layer_norm         LN_B4_S2048_D4096      0.21  False
torch_layer_norm         LN_B4_S2048_D8192      0.44  False
torch_layer_norm         LN_B4_S512_D1024       0.03  False
torch_layer_norm         LN_B4_S512_D2048       0.03  False
torch_layer_norm         LN_B4_S512_D4096       0.04  False
torch_layer_norm         LN_B4_S512_D8192       0.05  False
▶ UV Install Logs

Artifacts:

layer_norm.jsonl