PyTorch Native - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Wed Oct 29 15:50:16 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   29C    P0             78W /  350W |       0MiB /  46068MiB |     18%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark (PyTorch Native)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 3.67s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import torch.nn.functional as F
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_causal_conv1d(input_tensor, weight, bias):
    # Convert to weight dtype for computation
    x = input_tensor.to(weight.dtype)
    dim = weight.shape[0]
    width = weight.shape[1]
    seqlen = input_tensor.shape[-1]

    # Depthwise causal conv1d using PyTorch
    out = F.conv1d(x, weight.unsqueeze(1), bias, padding=width - 1, groups=dim)

    # Truncate to original sequence length
    out = out[..., :seqlen]

    # Convert back to original dtype
    return out.to(input_tensor.dtype)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="torch_eager",
    impl_tags={"family": "pytorch", "backend": "eager"},
    impl_func=torch_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     430.557us      2231.44%     430.557us     430.557us             1  
                                            torch_eager        10.26%     219.304us        99.68%       2.131ms       2.131ms       0.000us         0.00%      21.630us      21.630us             1  
                                               aten::to         0.52%      11.162us        80.31%       1.717ms     286.188us       0.000us         0.00%      14.269us       2.378us             6  
                                         aten::_to_copy         1.64%      35.099us        79.78%       1.706ms     284.328us       0.000us         0.00%      14.269us       2.378us             6  
                                            aten::copy_         2.90%      61.950us        75.45%       1.613ms     268.888us      11.934us        61.85%      14.269us       2.378us             6  
                                           aten::conv1d         0.34%       7.309us         7.15%     152.793us      50.931us       0.000us         0.00%       7.361us       2.454us             3  
                                      aten::convolution         0.74%      15.802us         6.80%     145.484us      48.495us       0.000us         0.00%       7.361us       2.454us             3  
                                     aten::_convolution         1.46%      31.300us         6.06%     129.682us      43.227us       0.000us         0.00%       7.361us       2.454us             3  
                                aten::_conv_depthwise2d         1.58%      33.771us         3.83%      81.842us      27.281us       7.361us        38.15%       7.361us       2.454us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.361us        38.15%       7.361us       2.454us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.271us        32.50%       6.271us       2.090us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.663us        29.35%       5.663us       1.888us             3  
                                Activity Buffer Request        69.43%       1.484ms        69.43%       1.484ms       1.484ms       2.335us        12.10%       2.335us       2.335us             1  
                                    aten::empty_strided         2.69%      57.542us         2.69%      57.542us       9.590us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel         4.29%      91.753us         4.29%      91.753us      10.195us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         1.26%      26.879us         1.60%      34.300us       3.811us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.54%      11.581us         0.54%      11.581us       0.772us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.53%      11.370us         0.53%      11.370us       3.790us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.55%      11.861us         0.55%      11.861us       3.954us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.32%       6.790us         0.37%       8.000us       2.667us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.138ms
Self CUDA time total: 19.295us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     335.581us      1716.35%     335.581us     335.581us             1  
                                            torch_eager         7.79%     148.313us        99.72%       1.898ms       1.898ms       0.000us         0.00%      21.664us      21.664us             1  
                                               aten::to         0.37%       7.132us        84.25%       1.604ms     267.308us       0.000us         0.00%      13.760us       2.293us             6  
                                         aten::_to_copy         1.28%      24.280us        83.88%       1.597ms     266.119us       0.000us         0.00%      13.760us       2.293us             6  
                                            aten::copy_         2.64%      50.321us        81.08%       1.543ms     257.239us      11.648us        59.57%      13.760us       2.293us             6  
                                           aten::conv1d         0.32%       6.130us         6.26%     119.243us      39.748us       0.000us         0.00%       7.904us       2.635us             3  
                                      aten::convolution         0.58%      11.131us         5.94%     113.113us      37.704us       0.000us         0.00%       7.904us       2.635us             3  
                                     aten::_convolution         1.18%      22.459us         5.36%     101.982us      33.994us       0.000us         0.00%       7.904us       2.635us             3  
                                aten::_conv_depthwise2d         1.08%      20.592us         3.32%      63.132us      21.044us       7.904us        40.43%       7.904us       2.635us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.904us        40.43%       7.904us       2.635us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.080us        31.10%       6.080us       2.027us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.568us        28.48%       5.568us       1.856us             3  
                                Activity Buffer Request        75.89%       1.445ms        75.89%       1.445ms       1.445ms       2.112us        10.80%       2.112us       2.112us             1  
                                    aten::empty_strided         1.52%      29.001us         1.52%      29.001us       4.834us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel         3.69%      70.220us         3.69%      70.220us       7.802us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.92%      17.542us         1.22%      23.251us       2.583us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.48%       9.139us         0.48%       9.139us       0.609us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.58%      11.060us         0.58%      11.060us       3.687us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.51%       9.700us         0.51%       9.700us       3.233us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.31%       5.860us         0.37%       7.020us       2.340us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.904ms
Self CUDA time total: 19.552us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     343.036us      1844.97%     343.036us     343.036us             1  
                                            torch_eager         7.68%     146.161us        99.72%       1.897ms       1.897ms       0.000us         0.00%      20.481us      20.481us             1  
                                               aten::to         0.37%       6.953us        83.90%       1.596ms     266.066us       0.000us         0.00%      13.536us       2.256us             6  
                                         aten::_to_copy         1.25%      23.842us        83.53%       1.589ms     264.907us       0.000us         0.00%      13.536us       2.256us             6  
                                            aten::copy_         2.67%      50.789us        80.65%       1.535ms     255.762us      11.648us        62.65%      13.536us       2.256us             6  
                                           aten::conv1d         0.33%       6.290us         6.66%     126.782us      42.261us       0.000us         0.00%       6.945us       2.315us             3  
                                      aten::convolution         0.51%       9.650us         6.33%     120.492us      40.164us       0.000us         0.00%       6.945us       2.315us             3  
                                     aten::_convolution         1.22%      23.120us         5.83%     110.842us      36.947us       0.000us         0.00%       6.945us       2.315us             3  
                                aten::_conv_depthwise2d         1.44%      27.490us         3.78%      71.832us      23.944us       6.945us        37.35%       6.945us       2.315us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       6.945us        37.35%       6.945us       2.315us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.952us        32.01%       5.952us       1.984us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.696us        30.64%       5.696us       1.899us             3  
                                Activity Buffer Request        75.34%       1.433ms        75.34%       1.433ms       1.433ms       1.888us        10.15%       1.888us       1.888us             1  
                                    aten::empty_strided         1.63%      31.030us         1.63%      31.030us       5.172us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel         3.99%      75.911us         3.99%      75.911us       8.435us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.90%      17.191us         1.18%      22.431us       2.492us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.46%       8.831us         0.46%       8.831us       0.589us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.55%      10.511us         0.55%      10.511us       3.504us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.43%       8.261us         0.43%       8.261us       2.754us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.31%       5.930us         0.38%       7.310us       2.437us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.903ms
Self CUDA time total: 18.593us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     329.535us      1679.93%     329.535us     329.535us             1  
                                            torch_eager         6.84%     137.444us        99.71%       2.005ms       2.005ms       0.000us         0.00%      21.760us      21.760us             1  
                                               aten::to         0.33%       6.580us        85.73%       1.724ms     287.253us       0.000us         0.00%      14.048us       2.341us             6  
                                         aten::_to_copy         1.15%      23.069us        85.41%       1.717ms     286.156us       0.000us         0.00%      14.048us       2.341us             6  
                                            aten::copy_         2.51%      50.362us        82.79%       1.664ms     277.390us      11.904us        60.69%      14.048us       2.341us             6  
                                           aten::conv1d         0.28%       5.589us         5.76%     115.862us      38.621us       0.000us         0.00%       7.712us       2.571us             3  
                                      aten::convolution         0.52%      10.381us         5.49%     110.273us      36.758us       0.000us         0.00%       7.712us       2.571us             3  
                                     aten::_convolution         1.13%      22.651us         4.97%      99.892us      33.297us       0.000us         0.00%       7.712us       2.571us             3  
                                aten::_conv_depthwise2d         1.01%      20.392us         3.08%      61.981us      20.660us       7.712us        39.31%       7.712us       2.571us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.712us        39.31%       7.712us       2.571us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.176us        31.48%       6.176us       2.059us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.728us        29.20%       5.728us       1.909us             3  
                                Activity Buffer Request        70.54%       1.418ms        70.54%       1.418ms       1.418ms       2.144us        10.93%       2.144us       2.144us             1  
                                    aten::empty_strided         1.47%      29.531us         1.47%      29.531us       4.922us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.85%     218.024us        10.85%     218.024us      24.225us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.87%      17.580us         1.13%      22.650us       2.517us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.42%       8.370us         0.42%       8.370us       0.558us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.48%       9.699us         0.48%       9.699us       3.233us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.49%       9.760us         0.49%       9.760us       3.253us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.28%       5.670us         0.34%       6.790us       2.263us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.010ms
Self CUDA time total: 19.616us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     344.351us      1401.22%     344.351us     344.351us             1  
                                            torch_eager         7.42%     151.244us        99.76%       2.034ms       2.034ms       0.000us         0.00%      26.847us      26.847us             1  
                                               aten::to         0.33%       6.730us        85.23%       1.738ms     289.583us       0.000us         0.00%      15.264us       2.544us             6  
                                         aten::_to_copy         1.15%      23.491us        84.90%       1.731ms     288.462us       0.000us         0.00%      15.264us       2.544us             6  
                                            aten::copy_         2.84%      57.871us        82.24%       1.677ms     279.428us      12.992us        52.87%      15.264us       2.544us             6  
                                           aten::conv1d         0.31%       6.410us         5.76%     117.443us      39.148us       0.000us         0.00%      11.583us       3.861us             3  
                                      aten::convolution         0.49%      10.031us         5.45%     111.033us      37.011us       0.000us         0.00%      11.583us       3.861us             3  
                                     aten::_convolution         1.08%      22.081us         4.95%     101.002us      33.667us       0.000us         0.00%      11.583us       3.861us             3  
                                aten::_conv_depthwise2d         1.04%      21.239us         3.10%      63.201us      21.067us      11.583us        47.13%      11.583us       3.861us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      11.583us        47.13%      11.583us       3.861us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.656us        27.08%       6.656us       2.219us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       6.336us        25.78%       6.336us       2.112us             3  
                                Activity Buffer Request        70.08%       1.429ms        70.08%       1.429ms       1.429ms       2.272us         9.25%       2.272us       2.272us             1  
                                    aten::empty_strided         1.51%      30.710us         1.51%      30.710us       5.118us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.42%     212.467us        10.42%     212.467us      23.607us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.89%      18.130us         1.15%      23.350us       2.594us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.43%       8.790us         0.43%       8.790us       0.586us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.48%       9.800us         0.48%       9.800us       3.267us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.47%       9.640us         0.47%       9.640us       3.213us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.28%       5.760us         0.35%       7.050us       2.350us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.039ms
Self CUDA time total: 24.575us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     332.990us      1278.37%     332.990us     332.990us             1  
                                            torch_eager         7.09%     142.971us        99.76%       2.011ms       2.011ms       0.000us         0.00%      28.288us      28.288us             1  
                                               aten::to         0.35%       7.062us        85.44%       1.723ms     287.120us       0.000us         0.00%      15.232us       2.539us             6  
                                         aten::_to_copy         1.18%      23.771us        85.09%       1.716ms     285.943us       0.000us         0.00%      15.232us       2.539us             6  
                                            aten::copy_         2.51%      50.519us        82.47%       1.663ms     277.136us      12.992us        49.88%      15.232us       2.539us             6  
                                           aten::conv1d         0.32%       6.541us         5.84%     117.833us      39.278us       0.000us         0.00%      13.056us       4.352us             3  
                                      aten::convolution         0.52%      10.410us         5.52%     111.292us      37.097us       0.000us         0.00%      13.056us       4.352us             3  
                                     aten::_convolution         1.19%      24.049us         5.00%     100.882us      33.627us       0.000us         0.00%      13.056us       4.352us             3  
                                aten::_conv_depthwise2d         1.01%      20.460us         2.98%      60.052us      20.017us      13.056us        50.12%      13.056us       4.352us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      13.056us        50.12%      13.056us       4.352us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.624us        25.43%       6.624us       2.208us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       6.368us        24.45%       6.368us       2.123us             3  
                                Activity Buffer Request        70.71%       1.426ms        70.71%       1.426ms       1.426ms       2.240us         8.60%       2.240us       2.240us             1  
                                    aten::empty_strided         1.44%      29.071us         1.44%      29.071us       4.845us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.31%     207.805us        10.31%     207.805us      23.089us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.90%      18.081us         1.15%      23.201us       2.578us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.43%       8.650us         0.43%       8.650us       0.577us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.49%       9.891us         0.49%       9.891us       3.297us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.42%       8.561us         0.42%       8.561us       2.854us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.31%       6.350us         0.38%       7.610us       2.537us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.016ms
Self CUDA time total: 26.048us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     332.220us       868.07%     332.220us     332.220us             1  
                                            torch_eager         7.10%     144.065us        99.76%       2.024ms       2.024ms       0.000us         0.00%      40.831us      40.831us             1  
                                           aten::conv1d         0.30%       6.030us         5.72%     116.102us      38.701us       0.000us         0.00%      22.464us       7.488us             3  
                                      aten::convolution         0.49%       9.861us         5.42%     110.072us      36.691us       0.000us         0.00%      22.464us       7.488us             3  
                                     aten::_convolution         1.11%      22.459us         4.94%     100.211us      33.404us       0.000us         0.00%      22.464us       7.488us             3  
                                aten::_conv_depthwise2d         1.00%      20.252us         3.07%      62.362us      20.787us      22.464us        58.70%      22.464us       7.488us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      22.464us        58.70%      22.464us       7.488us             3  
                                               aten::to         0.31%       6.271us        85.57%       1.737ms     289.428us       0.000us         0.00%      18.367us       3.061us             6  
                                         aten::_to_copy         1.14%      23.180us        85.26%       1.730ms     288.383us       0.000us         0.00%      18.367us       3.061us             6  
                                            aten::copy_         2.42%      49.061us        82.56%       1.675ms     279.226us      15.807us        41.30%      18.367us       3.061us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       8.448us        22.07%       8.448us       2.816us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       7.359us        19.23%       7.359us       2.453us             3  
                                Activity Buffer Request        70.88%       1.438ms        70.88%       1.438ms       1.438ms       2.560us         6.69%       2.560us       2.560us             1  
                                    aten::empty_strided         1.57%      31.760us         1.57%      31.760us       5.293us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.30%     209.084us        10.30%     209.084us      23.232us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.88%      17.889us         1.13%      22.980us       2.553us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.43%       8.691us         0.43%       8.691us       0.579us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.58%      11.680us         0.58%      11.680us       3.893us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.45%       9.220us         0.45%       9.220us       3.073us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.29%       5.850us         0.36%       7.240us       2.413us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.029ms
Self CUDA time total: 38.271us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     347.901us       847.38%     347.901us     347.901us             1  
                                            torch_eager         7.21%     147.111us        99.76%       2.035ms       2.035ms       0.000us         0.00%      43.616us      43.616us             1  
                                           aten::conv1d         0.33%       6.680us         5.94%     121.133us      40.378us       0.000us         0.00%      25.376us       8.459us             3  
                                      aten::convolution         0.49%      10.011us         5.61%     114.453us      38.151us       0.000us         0.00%      25.376us       8.459us             3  
                                     aten::_convolution         1.21%      24.739us         5.12%     104.442us      34.814us       0.000us         0.00%      25.376us       8.459us             3  
                                aten::_conv_depthwise2d         1.05%      21.431us         3.09%      62.981us      20.994us      25.376us        61.81%      25.376us       8.459us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      25.376us        61.81%      25.376us       8.459us             3  
                                               aten::to         0.35%       7.210us        85.28%       1.740ms     289.922us       0.000us         0.00%      18.240us       3.040us             6  
                                         aten::_to_copy         1.21%      24.639us        84.93%       1.732ms     288.720us       0.000us         0.00%      18.240us       3.040us             6  
                                            aten::copy_         2.56%      52.303us        82.23%       1.677ms     279.542us      15.680us        38.19%      18.240us       3.040us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       8.288us        20.19%       8.288us       2.763us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       7.392us        18.00%       7.392us       2.464us             3  
                                Activity Buffer Request        70.32%       1.434ms        70.32%       1.434ms       1.434ms       2.560us         6.24%       2.560us       2.560us             1  
                                    aten::empty_strided         1.49%      30.432us         1.49%      30.432us       5.072us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.49%     213.884us        10.49%     213.884us      23.765us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.88%      17.871us         1.14%      23.242us       2.582us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.44%       8.942us         0.44%       8.942us       0.596us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.48%       9.840us         0.48%       9.840us       3.280us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.42%       8.540us         0.42%       8.540us       2.847us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.31%       6.230us         0.37%       7.540us       2.513us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.040ms
Self CUDA time total: 41.056us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     334.331us       325.39%     334.331us     334.331us             1  
                                            torch_eager         7.01%     141.713us        99.76%       2.018ms       2.018ms       0.000us         0.00%     108.764us     108.764us             1  
                                           aten::conv1d         0.30%       6.090us         5.77%     116.623us      38.874us       0.000us         0.00%      70.528us      23.509us             3  
                                      aten::convolution         0.56%      11.281us         5.46%     110.533us      36.844us       0.000us         0.00%      70.528us      23.509us             3  
                                     aten::_convolution         1.11%      22.501us         4.91%      99.252us      33.084us       0.000us         0.00%      70.528us      23.509us             3  
                                aten::_conv_depthwise2d         1.02%      20.538us         3.03%      61.301us      20.434us      70.528us        68.64%      70.528us      23.509us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      70.528us        68.64%      70.528us      23.509us             3  
                                               aten::to         0.31%       6.229us        85.56%       1.731ms     288.457us       0.000us         0.00%      38.236us       6.373us             6  
                                         aten::_to_copy         1.17%      23.650us        85.25%       1.725ms     287.419us       0.000us         0.00%      38.236us       6.373us             6  
                                            aten::copy_         2.48%      50.230us        82.63%       1.672ms     278.605us      32.221us        31.36%      38.236us       6.373us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      17.598us        17.13%      17.598us       5.866us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      14.623us        14.23%      14.623us       4.874us             3  
                                Activity Buffer Request        70.91%       1.435ms        70.91%       1.435ms       1.435ms       6.015us         5.85%       6.015us       6.015us             1  
                                    aten::empty_strided         1.45%      29.232us         1.45%      29.232us       4.872us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.36%     209.517us        10.36%     209.517us      23.280us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.88%      17.770us         1.13%      22.940us       2.549us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.42%       8.560us         0.42%       8.560us       0.571us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.48%       9.771us         0.48%       9.771us       3.257us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.41%       8.351us         0.41%       8.351us       2.784us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.28%       5.590us         0.33%       6.770us       2.257us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.023ms
Self CUDA time total: 102.749us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     330.969us       293.42%     330.969us     330.969us             1  
                                            torch_eager        14.99%     119.634us        99.39%     793.059us     793.059us       0.000us         0.00%     118.814us     118.814us             1  
                                           aten::conv1d         0.68%       5.459us        14.66%     116.982us      38.994us       0.000us         0.00%      80.510us      26.837us             3  
                                      aten::convolution         1.26%      10.041us        13.98%     111.523us      37.174us       0.000us         0.00%      80.510us      26.837us             3  
                                     aten::_convolution         2.84%      22.661us        12.72%     101.482us      33.827us       0.000us         0.00%      80.510us      26.837us             3  
                                aten::_conv_depthwise2d         2.60%      20.719us         7.95%      63.401us      21.134us      80.510us        71.38%      80.510us      26.837us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      80.510us        71.38%      80.510us      26.837us             3  
                                               aten::to         0.74%       5.920us        66.51%     530.742us      88.457us       0.000us         0.00%      38.304us       6.384us             6  
                                         aten::_to_copy         2.94%      23.422us        65.77%     524.822us      87.470us       0.000us         0.00%      38.304us       6.384us             6  
                                            aten::copy_         6.43%      51.340us        58.99%     470.681us      78.447us      32.288us        28.62%      38.304us       6.384us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      17.696us        15.69%      17.696us       5.899us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      14.592us        12.94%      14.592us       4.864us             3  
                                Activity Buffer Request        29.02%     231.576us        29.02%     231.576us     231.576us       6.016us         5.33%       6.016us       6.016us             1  
                                    aten::empty_strided         3.85%      30.719us         3.85%      30.719us       5.120us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        26.56%     211.935us        26.56%     211.935us      23.548us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.12%      16.940us         2.72%      21.720us       2.413us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.02%       8.121us         1.02%       8.121us       0.541us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.20%       9.582us         1.20%       9.582us       3.194us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.12%       8.930us         1.12%       8.930us       2.977us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.72%       5.780us         0.87%       6.970us       2.323us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 797.960us
Self CUDA time total: 112.798us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager        13.79%     117.896us        93.85%     802.069us     802.069us       0.000us         0.00%     432.858us     432.858us             1  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     419.356us       106.55%     419.356us     419.356us             1  
                                           aten::conv1d         0.66%       5.648us        13.24%     113.161us      37.720us       0.000us         0.00%     251.262us      83.754us             3  
                                      aten::convolution         1.11%       9.481us        12.58%     107.513us      35.838us       0.000us         0.00%     251.262us      83.754us             3  
                                     aten::_convolution         2.53%      21.627us        11.47%      98.032us      32.677us       0.000us         0.00%     251.262us      83.754us             3  
                                aten::_conv_depthwise2d         2.35%      20.121us         7.14%      61.002us      20.334us     251.262us        63.84%     251.262us      83.754us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     251.262us        63.84%     251.262us      83.754us             3  
                                               aten::to         0.66%       5.670us        63.66%     544.101us      90.683us       0.000us         0.00%     181.596us      30.266us             6  
                                         aten::_to_copy         2.68%      22.880us        63.00%     538.431us      89.739us       0.000us         0.00%     181.596us      30.266us             6  
                                            aten::copy_         6.04%      51.591us        56.88%     486.161us      81.027us     142.333us        36.16%     181.596us      30.266us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     102.271us        25.98%     102.271us      34.090us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      40.062us        10.18%      40.062us      13.354us             3  
                                Activity Buffer Request        28.73%     245.556us        28.73%     245.556us     245.556us      39.263us         9.98%      39.263us      39.263us             1  
                                    aten::empty_strided         3.44%      29.390us         3.44%      29.390us       4.898us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        24.77%     211.714us        24.77%     211.714us      23.524us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         1.99%      17.042us         2.56%      21.904us       2.434us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.01%       8.621us         1.01%       8.621us       0.575us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.11%       9.471us         1.11%       9.471us       3.157us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.02%       8.710us         1.02%       8.710us       2.903us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.71%       6.031us         0.84%       7.190us       2.397us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 854.650us
Self CUDA time total: 393.595us



======================================================================
PROFILE TRACE: torch_eager | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager        12.73%     119.312us        88.90%     833.220us     833.220us       0.000us         0.00%     487.606us     487.606us             1  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     476.503us       106.43%     476.503us     476.503us             1  
                                           aten::conv1d         0.59%       5.550us        12.38%     116.073us      38.691us       0.000us         0.00%     298.682us      99.561us             3  
                                      aten::convolution         1.01%       9.430us        11.79%     110.523us      36.841us       0.000us         0.00%     298.682us      99.561us             3  
                                     aten::_convolution         2.32%      21.781us        10.79%     101.093us      33.698us       0.000us         0.00%     298.682us      99.561us             3  
                                aten::_conv_depthwise2d         2.19%      20.491us         6.88%      64.493us      21.498us     298.682us        66.71%     298.682us      99.561us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     298.682us        66.71%     298.682us      99.561us             3  
                                               aten::to         0.60%       5.580us        60.86%     570.404us      95.067us       0.000us         0.00%     188.924us      31.487us             6  
                                         aten::_to_copy         2.42%      22.662us        60.26%     564.824us      94.137us       0.000us         0.00%     188.924us      31.487us             6  
                                            aten::copy_         5.33%      49.982us        54.62%     511.981us      85.330us     149.053us        33.29%     188.924us      31.487us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     108.926us        24.33%     108.926us      36.309us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      40.127us         8.96%      40.127us      13.376us             3  
                                Activity Buffer Request        29.44%     275.977us        29.44%     275.977us     275.977us      39.871us         8.91%      39.871us      39.871us             1  
                                    aten::empty_strided         3.22%      30.181us         3.22%      30.181us       5.030us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        22.44%     210.343us        22.44%     210.343us      23.371us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         1.80%      16.910us         2.35%      22.009us       2.445us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.91%       8.519us         0.91%       8.519us       0.568us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.16%      10.891us         1.16%      10.891us       3.630us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.94%       8.790us         0.94%       8.790us       2.930us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.59%       5.500us         0.71%       6.690us       2.230us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 937.282us
Self CUDA time total: 447.735us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     323.326us      1733.09%     323.326us     323.326us             1  
                                            torch_eager        14.06%     116.944us        99.42%     826.859us     826.859us       0.000us         0.00%      20.544us      20.544us             1  
                                               aten::to         0.72%       5.971us        68.57%     570.283us      95.047us       0.000us         0.00%      13.344us       2.224us             6  
                                         aten::_to_copy         2.68%      22.330us        67.85%     564.312us      94.052us       0.000us         0.00%      13.344us       2.224us             6  
                                            aten::copy_         6.25%      51.969us        61.73%     513.371us      85.562us      11.456us        61.41%      13.344us       2.224us             6  
                                           aten::conv1d         0.66%       5.530us        13.54%     112.622us      37.541us       0.000us         0.00%       7.200us       2.400us             3  
                                      aten::convolution         1.25%      10.420us        12.88%     107.092us      35.697us       0.000us         0.00%       7.200us       2.400us             3  
                                     aten::_convolution         2.52%      20.950us        11.62%      96.672us      32.224us       0.000us         0.00%       7.200us       2.400us             3  
                                aten::_conv_depthwise2d         2.43%      20.241us         7.30%      60.692us      20.231us       7.200us        38.59%       7.200us       2.400us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.200us        38.59%       7.200us       2.400us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.888us        31.56%       5.888us       1.963us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.568us        29.85%       5.568us       1.856us             3  
                                Activity Buffer Request        31.20%     259.516us        31.20%     259.516us     259.516us       1.888us        10.12%       1.888us       1.888us             1  
                                    aten::empty_strided         3.44%      28.611us         3.44%      28.611us       4.768us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        26.78%     222.677us        26.78%     222.677us      24.742us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.11%      17.509us         2.72%      22.620us       2.513us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.03%       8.541us         1.03%       8.541us       0.569us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.40%      11.610us         1.40%      11.610us       3.870us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.97%       8.050us         0.97%       8.050us       2.683us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.68%       5.660us         0.82%       6.830us       2.277us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 831.660us
Self CUDA time total: 18.656us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     351.835us      1817.33%     351.835us     351.835us             1  
                                            torch_eager        14.08%     121.071us        99.43%     854.999us     854.999us       0.000us         0.00%      21.248us      21.248us             1  
                                               aten::to         0.71%       6.141us        68.62%     590.084us      98.347us       0.000us         0.00%      13.312us       2.219us             6  
                                         aten::_to_copy         2.73%      23.503us        67.91%     583.943us      97.324us       0.000us         0.00%      13.312us       2.219us             6  
                                            aten::copy_         6.25%      53.711us        59.45%     511.250us      85.208us      11.424us        59.01%      13.312us       2.219us             6  
                                           aten::conv1d         0.65%       5.630us        13.53%     116.322us      38.774us       0.000us         0.00%       7.936us       2.645us             3  
                                      aten::convolution         1.12%       9.630us        12.87%     110.692us      36.897us       0.000us         0.00%       7.936us       2.645us             3  
                                     aten::_convolution         2.70%      23.181us        11.75%     101.062us      33.687us       0.000us         0.00%       7.936us       2.645us             3  
                                aten::_conv_depthwise2d         2.42%      20.779us         7.31%      62.821us      20.940us       7.936us        40.99%       7.936us       2.645us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.936us        40.99%       7.936us       2.645us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.856us        30.25%       5.856us       1.952us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.568us        28.76%       5.568us       1.856us             3  
                                Activity Buffer Request        31.74%     272.946us        31.74%     272.946us     272.946us       1.888us         9.75%       1.888us       1.888us             1  
                                    aten::empty_strided         5.72%      49.190us         5.72%      49.190us       8.198us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        24.15%     207.684us        24.15%     207.684us      23.076us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.01%      17.302us         2.65%      22.752us       2.528us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.04%       8.971us         1.04%       8.971us       0.598us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.16%       9.970us         1.16%       9.970us       3.323us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.04%       8.981us         1.04%       8.981us       2.994us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.67%       5.729us         0.81%       6.930us       2.310us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 859.928us
Self CUDA time total: 19.360us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     336.603us      1730.16%     336.603us     336.603us             1  
                                            torch_eager         7.08%     144.052us        99.72%       2.028ms       2.028ms       0.000us         0.00%      21.631us      21.631us             1  
                                               aten::to         0.35%       7.181us        85.48%       1.738ms     289.712us       0.000us         0.00%      14.367us       2.394us             6  
                                         aten::_to_copy         1.19%      24.130us        85.13%       1.731ms     288.515us       0.000us         0.00%      14.367us       2.394us             6  
                                            aten::copy_         2.47%      50.222us        82.49%       1.678ms     279.593us      12.191us        62.66%      14.367us       2.394us             6  
                                           aten::conv1d         0.31%       6.211us         5.80%     117.993us      39.331us       0.000us         0.00%       7.264us       2.421us             3  
                                      aten::convolution         0.49%       9.910us         5.50%     111.782us      37.261us       0.000us         0.00%       7.264us       2.421us             3  
                                     aten::_convolution         1.11%      22.590us         5.01%     101.872us      33.957us       0.000us         0.00%       7.264us       2.421us             3  
                                aten::_conv_depthwise2d         1.01%      20.531us         3.08%      62.662us      20.887us       7.264us        37.34%       7.264us       2.421us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.264us        37.34%       7.264us       2.421us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.272us        32.24%       6.272us       2.091us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.919us        30.42%       5.919us       1.973us             3  
                                Activity Buffer Request        70.94%       1.443ms        70.94%       1.443ms       1.443ms       2.176us        11.18%       2.176us       2.176us             1  
                                    aten::empty_strided         1.45%      29.401us         1.45%      29.401us       4.900us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.12%     205.814us        10.12%     205.814us      22.868us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.93%      19.000us         1.20%      24.310us       2.701us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.43%       8.650us         0.43%       8.650us       0.577us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.52%      10.541us         0.52%      10.541us       3.514us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.51%      10.450us         0.51%      10.450us       3.483us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.30%       6.000us         0.36%       7.230us       2.410us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.034ms
Self CUDA time total: 19.455us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     330.813us      1648.79%     330.813us     330.813us             1  
                                            torch_eager        15.21%     119.271us        99.33%     778.918us     778.918us       0.000us         0.00%      22.208us      22.208us             1  
                                               aten::to         0.72%       5.611us        65.71%     515.312us      85.885us       0.000us         0.00%      14.305us       2.384us             6  
                                         aten::_to_copy         2.87%      22.510us        65.00%     509.701us      84.950us       0.000us         0.00%      14.305us       2.384us             6  
                                            aten::copy_         6.38%      50.021us        58.03%     455.090us      75.848us      12.161us        60.61%      14.305us       2.384us             6  
                                           aten::conv1d         0.69%       5.380us        15.11%     118.473us      39.491us       0.000us         0.00%       7.903us       2.634us             3  
                                      aten::convolution         1.31%      10.292us        14.42%     113.093us      37.698us       0.000us         0.00%       7.903us       2.634us             3  
                                     aten::_convolution         2.98%      23.360us        13.11%     102.801us      34.267us       0.000us         0.00%       7.903us       2.634us             3  
                                aten::_conv_depthwise2d         2.80%      21.952us         8.17%      64.041us      21.347us       7.903us        39.39%       7.903us       2.634us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us       7.903us        39.39%       7.903us       2.634us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       6.240us        31.10%       6.240us       2.080us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       5.921us        29.51%       5.921us       1.974us             3  
                                Activity Buffer Request        27.43%     215.065us        27.43%     215.065us     215.065us       2.144us        10.69%       2.144us       2.144us             1  
                                    aten::empty_strided         4.09%      32.101us         4.09%      32.101us       5.350us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        27.10%     212.544us        27.10%     212.544us      23.616us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.14%      16.752us         2.74%      21.481us       2.387us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.03%       8.081us         1.03%       8.081us       0.539us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.34%      10.539us         1.34%      10.539us       3.513us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.15%       9.010us         1.15%       9.010us       3.003us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.74%       5.829us         0.90%       7.070us       2.357us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 784.179us
Self CUDA time total: 20.064us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     332.020us       920.74%     332.020us     332.020us             1  
                                            torch_eager        14.90%     117.475us        99.36%     783.438us     783.438us       0.000us         0.00%      38.651us      38.651us             1  
                                           aten::conv1d         0.68%       5.380us        14.48%     114.172us      38.057us       0.000us         0.00%      20.190us       6.730us             3  
                                      aten::convolution         1.18%       9.340us        13.80%     108.792us      36.264us       0.000us         0.00%      20.190us       6.730us             3  
                                     aten::_convolution         2.79%      21.980us        12.61%      99.452us      33.151us       0.000us         0.00%      20.190us       6.730us             3  
                                aten::_conv_depthwise2d         2.62%      20.631us         7.92%      62.452us      20.817us      20.190us        55.99%      20.190us       6.730us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      20.190us        55.99%      20.190us       6.730us             3  
                                               aten::to         0.79%       6.191us        66.58%     524.962us      87.494us       0.000us         0.00%      18.461us       3.077us             6  
                                         aten::_to_copy         2.94%      23.190us        65.79%     518.771us      86.462us       0.000us         0.00%      18.461us       3.077us             6  
                                            aten::copy_         6.46%      50.920us        58.97%     464.950us      77.492us      15.870us        44.01%      18.461us       3.077us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       8.479us        23.51%       8.479us       2.826us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       7.391us        20.50%       7.391us       2.464us             3  
                                Activity Buffer Request        28.77%     226.875us        28.77%     226.875us     226.875us       2.591us         7.19%       2.591us       2.591us             1  
                                    aten::empty_strided         3.88%      30.631us         3.88%      30.631us       5.105us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        26.51%     209.015us        26.51%     209.015us      23.224us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.16%      17.000us         2.79%      22.010us       2.446us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.07%       8.441us         1.07%       8.441us       0.563us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.46%      11.521us         1.46%      11.521us       3.840us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.07%       8.440us         1.07%       8.440us       2.813us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.73%       5.720us         0.87%       6.850us       2.283us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 788.468us
Self CUDA time total: 36.060us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     323.480us       850.14%     323.480us     323.480us             1  
                                            torch_eager        14.55%     115.293us        99.34%     787.399us     787.399us       0.000us         0.00%      40.643us      40.643us             1  
                                           aten::conv1d         0.68%       5.400us        14.40%     114.153us      38.051us       0.000us         0.00%      22.336us       7.445us             3  
                                      aten::convolution         1.16%       9.210us        13.72%     108.753us      36.251us       0.000us         0.00%      22.336us       7.445us             3  
                                     aten::_convolution         2.80%      22.227us        12.56%      99.543us      33.181us       0.000us         0.00%      22.336us       7.445us             3  
                                aten::_conv_depthwise2d         2.52%      20.003us         7.87%      62.343us      20.781us      22.336us        58.70%      22.336us       7.445us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      22.336us        58.70%      22.336us       7.445us             3  
                                               aten::to         0.94%       7.450us        67.15%     532.253us      88.709us       0.000us         0.00%      18.307us       3.051us             6  
                                         aten::_to_copy         2.90%      22.999us        66.21%     524.803us      87.467us       0.000us         0.00%      18.307us       3.051us             6  
                                            aten::copy_         6.35%      50.294us        59.67%     472.953us      78.825us      15.714us        41.30%      18.307us       3.051us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       8.321us        21.87%       8.321us       2.774us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       7.393us        19.43%       7.393us       2.464us             3  
                                Activity Buffer Request        30.44%     241.286us        30.44%     241.286us     241.286us       2.593us         6.81%       2.593us       2.593us             1  
                                    aten::empty_strided         3.64%      28.851us         3.64%      28.851us       4.808us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        25.80%     204.463us        25.80%     204.463us      22.718us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.08%      16.472us         2.71%      21.512us       2.390us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.07%       8.500us         1.07%       8.500us       0.567us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.34%      10.600us         1.34%      10.600us       3.533us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.09%       8.650us         1.09%       8.650us       2.883us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.71%       5.641us         0.87%       6.891us       2.297us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 792.609us
Self CUDA time total: 38.050us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     336.347us       525.30%     336.347us     336.347us             1  
                                            torch_eager        15.07%     122.512us        99.37%     807.929us     807.929us       0.000us         0.00%      68.125us      68.125us             1  
                                           aten::conv1d         0.67%       5.471us        14.06%     114.283us      38.094us       0.000us         0.00%      41.663us      13.888us             3  
                                      aten::convolution         1.12%       9.100us        13.38%     108.812us      36.271us       0.000us         0.00%      41.663us      13.888us             3  
                                     aten::_convolution         2.67%      21.730us        12.26%      99.712us      33.237us       0.000us         0.00%      41.663us      13.888us             3  
                                aten::_conv_depthwise2d         2.50%      20.351us         7.64%      62.102us      20.701us      41.663us        65.07%      41.663us      13.888us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      41.663us        65.07%      41.663us      13.888us             3  
                                               aten::to         0.85%       6.891us        66.79%     543.023us      90.504us       0.000us         0.00%      26.462us       4.410us             6  
                                         aten::_to_copy         2.85%      23.179us        65.94%     536.132us      89.355us       0.000us         0.00%      26.462us       4.410us             6  
                                            aten::copy_         6.23%      50.622us        59.33%     482.372us      80.395us      22.367us        34.93%      26.462us       4.410us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.935us        18.64%      11.935us       3.978us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.432us        16.29%      10.432us       3.477us             3  
                                Activity Buffer Request        30.75%     250.036us        30.75%     250.036us     250.036us       4.095us         6.40%       4.095us       4.095us             1  
                                    aten::empty_strided         3.76%      30.581us         3.76%      30.581us       5.097us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        25.31%     205.766us        25.31%     205.766us      22.863us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.22%      18.070us         2.95%      23.970us       2.663us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.13%       9.220us         1.13%       9.220us       0.615us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.19%       9.690us         1.19%       9.690us       3.230us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.99%       8.009us         0.99%       8.009us       2.670us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.69%       5.650us         0.84%       6.800us       2.267us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 813.049us
Self CUDA time total: 64.030us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     325.852us       467.34%     325.852us     325.852us             1  
                                            torch_eager        14.81%     118.946us        99.39%     798.399us     798.399us       0.000us         0.00%      73.789us      73.789us             1  
                                           aten::conv1d         0.70%       5.610us        14.26%     114.513us      38.171us       0.000us         0.00%      47.294us      15.765us             3  
                                      aten::convolution         1.17%       9.382us        13.56%     108.903us      36.301us       0.000us         0.00%      47.294us      15.765us             3  
                                     aten::_convolution         2.75%      22.119us        12.39%      99.521us      33.174us       0.000us         0.00%      47.294us      15.765us             3  
                                aten::_conv_depthwise2d         2.53%      20.361us         7.64%      61.351us      20.450us      47.294us        67.83%      47.294us      15.765us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us      47.294us        67.83%      47.294us      15.765us             3  
                                               aten::to         0.79%       6.379us        67.07%     538.781us      89.797us       0.000us         0.00%      26.495us       4.416us             6  
                                         aten::_to_copy         2.79%      22.401us        66.28%     532.402us      88.734us       0.000us         0.00%      26.495us       4.416us             6  
                                            aten::copy_         6.32%      50.749us        59.88%     480.970us      80.162us      22.431us        32.17%      26.495us       4.416us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      12.031us        17.25%      12.031us       4.010us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.400us        14.92%      10.400us       3.467us             3  
                                Activity Buffer Request        31.04%     249.326us        31.04%     249.326us     249.326us       4.064us         5.83%       4.064us       4.064us             1  
                                    aten::empty_strided         3.61%      29.031us         3.61%      29.031us       4.839us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        25.24%     202.725us        25.24%     202.725us      22.525us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.08%      16.713us         2.70%      21.680us       2.409us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.07%       8.568us         1.07%       8.568us       0.571us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.32%      10.569us         1.32%      10.569us       3.523us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.07%       8.591us         1.07%       8.591us       2.864us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.82%       6.609us         1.00%       8.010us       2.670us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 803.260us
Self CUDA time total: 69.725us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     346.241us       186.20%     346.241us     346.241us             1  
                                            torch_eager         7.14%     146.093us        99.75%       2.041ms       2.041ms       0.000us         0.00%     196.000us     196.000us             1  
                                           aten::conv1d         0.31%       6.251us         5.79%     118.533us      39.511us       0.000us         0.00%     133.248us      44.416us             3  
                                      aten::convolution         0.51%      10.359us         5.49%     112.282us      37.427us       0.000us         0.00%     133.248us      44.416us             3  
                                     aten::_convolution         1.19%      24.280us         4.98%     101.923us      33.974us       0.000us         0.00%     133.248us      44.416us             3  
                                aten::_conv_depthwise2d         1.04%      21.191us         3.02%      61.762us      20.587us     133.248us        71.66%     133.248us      44.416us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     133.248us        71.66%     133.248us      44.416us             3  
                                               aten::to         0.32%       6.489us        85.48%       1.749ms     291.443us       0.000us         0.00%      62.752us      10.459us             6  
                                         aten::_to_copy         1.16%      23.832us        85.16%       1.742ms     290.362us       0.000us         0.00%      62.752us      10.459us             6  
                                            aten::copy_         2.53%      51.751us        82.59%       1.689ms     281.575us      52.704us        28.34%      62.752us      10.459us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      29.760us        16.00%      29.760us       9.920us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      22.944us        12.34%      22.944us       7.648us             3  
                                Activity Buffer Request        71.04%       1.453ms        71.04%       1.453ms       1.453ms      10.048us         5.40%      10.048us      10.048us             1  
                                    aten::empty_strided         1.41%      28.891us         1.41%      28.891us       4.815us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        10.04%     205.324us        10.04%     205.324us      22.814us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.86%      17.550us         1.11%      22.661us       2.518us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.42%       8.641us         0.42%       8.641us       0.576us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.53%      10.750us         0.53%      10.750us       3.583us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.44%       9.021us         0.44%       9.021us       3.007us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.30%       6.220us         0.37%       7.470us       2.490us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.046ms
Self CUDA time total: 185.952us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us     341.281us       162.73%     341.281us     341.281us             1  
                                            torch_eager        15.25%     117.693us        99.36%     766.878us     766.878us       0.000us         0.00%     223.168us     223.168us             1  
                                           aten::conv1d         0.68%       5.279us        14.60%     112.702us      37.567us       0.000us         0.00%     154.016us      51.339us             3  
                                      aten::convolution         1.24%       9.560us        13.92%     107.423us      35.808us       0.000us         0.00%     154.016us      51.339us             3  
                                     aten::_convolution         2.67%      20.611us        12.68%      97.863us      32.621us       0.000us         0.00%     154.016us      51.339us             3  
                                aten::_conv_depthwise2d         2.74%      21.170us         8.14%      62.852us      20.951us     154.016us        73.44%     154.016us      51.339us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     154.016us        73.44%     154.016us      51.339us             3  
                                               aten::to         0.75%       5.750us        66.22%     511.121us      85.187us       0.000us         0.00%      69.152us      11.525us             6  
                                         aten::_to_copy         2.86%      22.060us        65.48%     505.371us      84.228us       0.000us         0.00%      69.152us      11.525us             6  
                                            aten::copy_         6.81%      52.581us        58.74%     453.391us      75.565us      55.712us        26.56%      69.152us      11.525us             6  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      32.705us        15.59%      32.705us      10.902us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      23.007us        10.97%      23.007us       7.669us             3  
                                Activity Buffer Request        28.38%     219.045us        28.38%     219.045us     219.045us      13.440us         6.41%      13.440us      13.440us             1  
                                    aten::empty_strided         3.88%      29.920us         3.88%      29.920us       4.987us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        26.50%     204.546us        26.50%     204.546us      22.727us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         2.10%      16.212us         2.69%      20.781us       2.309us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         1.01%       7.798us         1.01%       7.798us       0.520us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.33%      10.250us         1.33%      10.250us       3.417us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         1.12%       8.651us         1.12%       8.651us       2.884us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.71%       5.470us         0.87%       6.730us       2.243us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 771.798us
Self CUDA time total: 209.728us



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         6.80%     123.239us        52.01%     942.341us     942.341us       0.000us         0.00%       1.520ms       1.520ms             1  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us       1.422ms       100.42%       1.422ms       1.422ms             1  
                                               aten::to         0.34%       6.231us        36.98%     669.957us     111.660us       0.000us         0.00%     824.603us     137.434us             6  
                                         aten::_to_copy         1.55%      28.122us        36.63%     663.726us     110.621us       0.000us         0.00%     824.603us     137.434us             6  
                                            aten::copy_         2.95%      53.430us        24.71%     447.748us      74.625us     720.572us        50.89%     824.603us     137.434us             6  
                                           aten::conv1d         0.34%       6.111us         6.66%     120.744us      40.248us       0.000us         0.00%     695.357us     231.786us             3  
                                      aten::convolution         0.56%      10.201us         6.33%     114.633us      38.211us       0.000us         0.00%     695.357us     231.786us             3  
                                     aten::_convolution         1.36%      24.689us         5.76%     104.432us      34.811us       0.000us         0.00%     695.357us     231.786us             3  
                                aten::_conv_depthwise2d         1.22%      22.151us         3.50%      63.431us      21.144us     695.357us        49.11%     695.357us     231.786us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     695.357us        49.11%     695.357us     231.786us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     407.263us        28.76%     407.263us     135.754us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     313.309us        22.13%     313.309us     104.436us             3  
                                Activity Buffer Request        11.46%     207.684us        11.46%     207.684us     207.684us     104.031us         7.35%     104.031us     104.031us             1  
                                    aten::empty_strided         2.02%      36.603us        10.37%     187.856us      31.309us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        11.47%     207.874us        11.47%     207.874us      23.097us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.99%      18.011us         1.30%      23.581us       2.620us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.51%       9.270us         0.51%       9.270us       0.618us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.60%      10.830us         0.60%      10.830us       3.610us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.51%       9.210us         0.51%       9.210us       3.070us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.35%       6.401us         0.43%       7.711us       2.570us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.812ms
Self CUDA time total: 1.416ms



======================================================================
PROFILE TRACE: torch_eager | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_eager         6.34%     116.852us        42.22%     778.698us     778.698us       0.000us         0.00%       1.501ms       1.501ms             1  
                                            torch_eager         0.00%       0.000us         0.00%       0.000us       0.000us       1.431ms       100.41%       1.431ms       1.431ms             1  
                                               aten::to         0.32%       5.860us        28.27%     521.352us      86.892us       0.000us         0.00%     762.491us     127.082us             6  
                                         aten::_to_copy         1.22%      22.580us        27.95%     515.492us      85.915us       0.000us         0.00%     762.491us     127.082us             6  
                                            aten::copy_         2.82%      52.081us        25.06%     462.219us      77.037us     686.779us        48.19%     762.491us     127.082us             6  
                                           aten::conv1d         0.30%       5.601us         6.22%     114.743us      38.248us       0.000us         0.00%     738.362us     246.121us             3  
                                      aten::convolution         0.51%       9.380us         5.92%     109.142us      36.381us       0.000us         0.00%     738.362us     246.121us             3  
                                     aten::_convolution         1.20%      22.202us         5.41%      99.762us      33.254us       0.000us         0.00%     738.362us     246.121us             3  
                                aten::_conv_depthwise2d         1.10%      20.251us         3.34%      61.660us      20.553us     738.362us        51.81%     738.362us     246.121us             3  
void at::native::(anonymous namespace)::conv_depthwi...         0.00%       0.000us         0.00%       0.000us       0.000us     738.362us        51.81%     738.362us     246.121us             3  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     398.333us        27.95%     398.333us     132.778us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     288.446us        20.24%     288.446us      96.149us             3  
                                Activity Buffer Request        12.41%     228.855us        12.41%     228.855us     228.855us      75.712us         5.31%      75.712us      75.712us             1  
                                    aten::empty_strided         1.66%      30.693us         1.66%      30.693us       5.115us       0.000us         0.00%       0.000us       0.000us             6  
                                       cudaLaunchKernel        11.01%     202.993us        11.01%     202.993us      22.555us       0.000us         0.00%       0.000us       0.000us             9  
                                        aten::unsqueeze         0.90%      16.659us         1.18%      21.780us       2.420us       0.000us         0.00%       0.000us       0.000us             9  
                                       aten::as_strided         0.46%       8.512us         0.46%       8.512us       0.567us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         0.53%       9.739us         0.53%       9.739us       3.246us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::resize_         0.54%       9.960us         0.54%       9.960us       3.320us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::squeeze         0.33%       6.150us         0.40%       7.440us       2.480us       0.000us         0.00%       0.000us       0.000us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.844ms
Self CUDA time total: 1.425ms


impl                     wl                  p50(ms)  ok
torch_eager              cuda_B2_D2048_S128_W2     0.08  True
torch_eager              cuda_B2_D2048_S128_W4     0.08  True
torch_eager              cuda_B2_D2048_S2048_W2     0.15  True
torch_eager              cuda_B2_D2048_S2048_W4     0.16  True
torch_eager              cuda_B2_D2048_S512_W2     0.08  True
torch_eager              cuda_B2_D2048_S512_W4     0.08  True
torch_eager              cuda_B2_D64_S128_W2     0.07  True
torch_eager              cuda_B2_D64_S128_W4     0.09  True
torch_eager              cuda_B2_D64_S2048_W2     0.08  True
torch_eager              cuda_B2_D64_S2048_W4     0.08  True
torch_eager              cuda_B2_D64_S512_W2     0.09  True
torch_eager              cuda_B2_D64_S512_W4     0.08  True
torch_eager              cuda_B4_D2048_S128_W2     0.08  True
torch_eager              cuda_B4_D2048_S128_W4     0.08  True
torch_eager              cuda_B4_D2048_S2048_W2     0.49  True
torch_eager              cuda_B4_D2048_S2048_W4     0.50  True
torch_eager              cuda_B4_D2048_S512_W2     0.09  True
torch_eager              cuda_B4_D2048_S512_W4     0.10  True
torch_eager              cuda_B4_D64_S128_W2     0.08  True
torch_eager              cuda_B4_D64_S128_W4     0.08  True
torch_eager              cuda_B4_D64_S2048_W2     0.08  True
torch_eager              cuda_B4_D64_S2048_W4     0.08  True
torch_eager              cuda_B4_D64_S512_W2     0.08  True
torch_eager              cuda_B4_D64_S512_W4     0.08  True

Artifacts:

causal_conv1d.jsonl