HF Kernels - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Wed Oct 29 15:50:16 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   29C    P0             78W /  350W |       0MiB /  46068MiB |     18%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 9.51s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the causal conv1d kernel
causal_conv1d = get_kernel("kernels-community/causal-conv1d")


def hf_kernels_causal_conv1d(input_tensor, weight, bias):
    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="hf_kernels_causal_conv1d",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     146.174us      3568.70%     146.174us     146.174us             1  
                               hf_kernels_causal_conv1d         8.17%     151.282us        99.60%       1.845ms       1.845ms       0.000us         0.00%       5.536us       5.536us             1  
                                         CausalConv1dFn         5.96%     110.474us        91.44%       1.694ms     564.683us       0.000us         0.00%       5.536us       1.845us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.25%      23.111us        81.80%       1.516ms     505.182us       4.096us       100.00%       5.536us       1.845us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.096us       100.00%       4.096us       1.365us             3  
                                Activity Buffer Request        78.05%       1.446ms        78.05%       1.446ms       1.446ms       1.440us        35.16%       1.440us       1.440us             1  
                                       aten::empty_like         1.06%      19.700us         3.67%      68.031us      22.677us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.61%      48.331us         2.61%      48.331us      16.110us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.50%      46.381us         2.50%      46.381us      15.460us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.40%       7.370us         0.40%       7.370us       7.370us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.853ms
Self CUDA time total: 4.096us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     144.319us      3789.89%     144.319us     144.319us             1  
                               hf_kernels_causal_conv1d         4.94%      83.592us        99.69%       1.687ms       1.687ms       0.000us         0.00%       5.088us       5.088us             1  
                                         CausalConv1dFn         5.57%      94.202us        94.76%       1.604ms     534.586us       0.000us         0.00%       5.088us       1.696us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.53%      25.920us        87.50%       1.481ms     493.624us       3.808us       100.00%       5.088us       1.696us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.808us       100.00%       3.808us       1.269us             3  
                                Activity Buffer Request        84.14%       1.424ms        84.14%       1.424ms       1.424ms       1.280us        33.61%       1.280us       1.280us             1  
                                       aten::empty_like         0.45%       7.561us         1.69%      28.682us       9.561us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.25%      21.121us         1.25%      21.121us       7.040us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.83%      30.901us         1.83%      30.901us      10.300us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.170us         0.31%       5.170us       5.170us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.693ms
Self CUDA time total: 3.808us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.942us      3149.95%     118.942us     118.942us             1  
                               hf_kernels_causal_conv1d         4.70%      79.942us        99.69%       1.694ms       1.694ms       0.000us         0.00%       5.024us       5.024us             1  
                                         CausalConv1dFn         4.32%      73.340us        94.98%       1.614ms     538.022us       0.000us         0.00%       5.024us       1.675us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.40%      23.852us        89.01%       1.513ms     504.182us       3.776us       100.00%       5.024us       1.675us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.776us       100.00%       3.776us       1.259us             3  
                                Activity Buffer Request        85.86%       1.459ms        85.86%       1.459ms       1.459ms       1.248us        33.05%       1.248us       1.248us             1  
                                       aten::empty_like         0.44%       7.502us         1.66%      28.182us       9.394us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.22%      20.680us         1.22%      20.680us       6.893us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.75%      29.690us         1.75%      29.690us       9.897us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.340us         0.31%       5.340us       5.340us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.699ms
Self CUDA time total: 3.776us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.814us      3251.63%     122.814us     122.814us             1  
                               hf_kernels_causal_conv1d         4.64%      85.642us        99.73%       1.840ms       1.840ms       0.000us         0.00%       5.025us       5.025us             1  
                                         CausalConv1dFn         3.90%      72.023us        95.09%       1.754ms     584.757us       0.000us         0.00%       5.025us       1.675us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.39%      25.651us        89.62%       1.653ms     551.112us       3.777us       100.00%       5.025us       1.675us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.777us       100.00%       3.777us       1.259us             3  
                                Activity Buffer Request        78.74%       1.453ms        78.74%       1.453ms       1.453ms       1.248us        33.04%       1.248us       1.248us             1  
                                       aten::empty_like         0.42%       7.802us         1.57%      28.911us       9.637us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.14%      21.109us         1.14%      21.109us       7.036us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.48%     174.913us         9.48%     174.913us      58.304us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.000us         0.27%       5.000us       5.000us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.845ms
Self CUDA time total: 3.777us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.008us      2597.30%     123.008us     123.008us             1  
                               hf_kernels_causal_conv1d         4.59%      83.953us        99.73%       1.825ms       1.825ms       0.000us         0.00%       6.337us       6.337us             1  
                                         CausalConv1dFn         3.99%      73.081us        95.14%       1.741ms     580.330us       0.000us         0.00%       6.337us       2.112us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.48%      27.090us        89.51%       1.638ms     546.026us       4.736us       100.00%       6.337us       2.112us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.736us       100.00%       4.736us       1.579us             3  
                                Activity Buffer Request        78.87%       1.443ms        78.87%       1.443ms       1.443ms       1.601us        33.80%       1.601us       1.601us             1  
                                       aten::empty_like         0.45%       8.280us         1.63%      29.831us       9.944us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.18%      21.551us         1.18%      21.551us       7.184us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.16%     167.714us         9.16%     167.714us      55.905us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.030us         0.27%       5.030us       5.030us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.830ms
Self CUDA time total: 4.736us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     113.566us      2381.84%     113.566us     113.566us             1  
                               hf_kernels_causal_conv1d        13.06%      81.391us        99.15%     617.944us     617.944us       0.000us         0.00%       6.400us       6.400us             1  
                                         CausalConv1dFn        11.13%      69.381us        86.09%     536.553us     178.851us       0.000us         0.00%       6.400us       2.133us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.09%      25.520us        70.57%     439.840us     146.613us       4.768us       100.00%       6.400us       2.133us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us       100.00%       4.768us       1.589us             3  
                                Activity Buffer Request        39.92%     248.796us        39.92%     248.796us     248.796us       1.632us        34.23%       1.632us       1.632us             1  
                                       aten::empty_like         1.16%       7.221us         4.39%      27.332us       9.111us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.23%      20.111us         3.23%      20.111us       6.704us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        26.56%     165.524us        26.56%     165.524us      55.175us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.85%       5.280us         0.85%       5.280us       5.280us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 623.224us
Self CUDA time total: 4.768us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.383us      1119.53%     120.383us     120.383us             1  
                               hf_kernels_causal_conv1d         4.38%      80.811us        99.69%       1.838ms       1.838ms       0.000us         0.00%      14.338us      14.338us             1  
                                         CausalConv1dFn         3.88%      71.502us        95.31%       1.758ms     585.854us       0.000us         0.00%      14.338us       4.779us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.42%      26.240us        89.89%       1.658ms     552.523us      10.753us       100.00%      14.338us       4.779us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.753us       100.00%      10.753us       3.584us             3  
                                Activity Buffer Request        79.47%       1.465ms        79.47%       1.465ms       1.465ms       3.585us        33.34%       3.585us       3.585us             1  
                                       aten::empty_like         0.42%       7.711us         1.54%      28.491us       9.497us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.13%      20.780us         1.13%      20.780us       6.927us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.00%     165.884us         9.00%     165.884us      55.295us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.720us         0.31%       5.720us       5.720us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.844ms
Self CUDA time total: 10.753us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     115.901us      1062.24%     115.901us     115.901us             1  
                               hf_kernels_causal_conv1d        13.49%      81.452us        99.17%     598.664us     598.664us       0.000us         0.00%      14.591us      14.591us             1  
                                         CausalConv1dFn        11.49%      69.393us        85.68%     517.212us     172.404us       0.000us         0.00%      14.591us       4.864us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.25%      25.660us        69.54%     419.779us     139.926us      10.911us       100.00%      14.591us       4.864us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.911us       100.00%      10.911us       3.637us             3  
                                Activity Buffer Request        38.07%     229.795us        38.07%     229.795us     229.795us       3.680us        33.73%       3.680us       3.680us             1  
                                       aten::empty_like         1.23%       7.430us         4.64%      28.040us       9.347us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.41%      20.610us         3.41%      20.610us       6.870us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        27.22%     164.324us        27.22%     164.324us      54.775us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.83%       5.020us         0.83%       5.020us       5.020us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 603.684us
Self CUDA time total: 10.911us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.031us      1126.74%     124.031us     124.031us             1  
                               hf_kernels_causal_conv1d         4.38%      80.211us        99.73%       1.825ms       1.825ms       0.000us         0.00%      14.688us      14.688us             1  
                                         CausalConv1dFn         3.92%      71.693us        95.35%       1.744ms     581.490us       0.000us         0.00%      14.688us       4.896us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.35%      24.770us        89.82%       1.643ms     547.796us      11.008us       100.00%      14.688us       4.896us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.008us       100.00%      11.008us       3.669us             3  
                                Activity Buffer Request        79.44%       1.453ms        79.44%       1.453ms       1.453ms       3.680us        33.43%       3.680us       3.680us             1  
                                       aten::empty_like         0.44%       8.110us         1.61%      29.390us       9.797us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.16%      21.280us         1.16%      21.280us       7.093us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.03%     165.165us         9.03%     165.165us      55.055us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.921us         0.27%       4.921us       4.921us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.830ms
Self CUDA time total: 11.008us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.078us      1080.72%     122.078us     122.078us             1  
                               hf_kernels_causal_conv1d        13.22%      78.432us        99.12%     587.944us     587.944us       0.000us         0.00%      15.072us      15.072us             1  
                                         CausalConv1dFn        12.12%      71.922us        85.89%     509.512us     169.837us       0.000us         0.00%      15.072us       5.024us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.25%      25.220us        69.07%     409.719us     136.573us      11.296us       100.00%      15.072us       5.024us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.296us       100.00%      11.296us       3.765us             3  
                                Activity Buffer Request        37.46%     222.215us        37.46%     222.215us     222.215us       3.776us        33.43%       3.776us       3.776us             1  
                                       aten::empty_like         1.25%       7.430us         4.70%      27.871us       9.290us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.45%      20.441us         3.45%      20.441us       6.814us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        27.36%     162.284us        27.36%     162.284us      54.095us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.88%       5.240us         0.88%       5.240us       5.240us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 593.184us
Self CUDA time total: 11.296us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     127.648us       252.30%     127.648us     127.648us             1  
                               hf_kernels_causal_conv1d         4.31%      79.103us        99.73%       1.830ms       1.830ms       0.000us         0.00%      84.257us      84.257us             1  
                                         CausalConv1dFn         3.94%      72.391us        95.42%       1.751ms     583.740us       0.000us         0.00%      84.257us      28.086us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.47%      26.941us        89.93%       1.650ms     550.139us      50.593us       100.00%      84.257us      28.086us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      50.593us       100.00%      50.593us      16.864us             3  
                                Activity Buffer Request        79.45%       1.458ms        79.45%       1.458ms       1.458ms      33.664us        66.54%      33.664us      33.664us             1  
                                       aten::empty_like         0.41%       7.590us         1.55%      28.411us       9.470us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.13%      20.821us         1.13%      20.821us       6.940us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.01%     165.403us         9.01%     165.403us      55.134us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.880us         0.27%       4.880us       4.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.835ms
Self CUDA time total: 50.593us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.347us       241.52%     124.347us     124.347us             1  
                               hf_kernels_causal_conv1d        13.88%      78.022us        99.09%     557.033us     557.033us       0.000us         0.00%      85.980us      85.980us             1  
                                         CausalConv1dFn        12.54%      70.483us        85.21%     479.011us     159.670us       0.000us         0.00%      85.980us      28.660us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.34%      24.401us        67.64%     380.208us     126.736us      51.486us       100.00%      85.980us      28.660us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      51.486us       100.00%      51.486us      17.162us             3  
                                Activity Buffer Request        34.82%     195.764us        34.82%     195.764us     195.764us      34.494us        67.00%      34.494us      34.494us             1  
                                       aten::empty_like         1.33%       7.470us         5.04%      28.320us       9.440us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.71%      20.850us         3.71%      20.850us       6.950us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        28.47%     160.043us        28.47%     160.043us      53.348us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.91%       5.110us         0.91%       5.110us       5.110us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 562.143us
Self CUDA time total: 51.486us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.728us      3142.18%     121.728us     121.728us             1  
                               hf_kernels_causal_conv1d         4.20%      76.603us        99.72%       1.818ms       1.818ms       0.000us         0.00%       5.123us       5.123us             1  
                                         CausalConv1dFn         3.96%      72.231us        95.52%       1.742ms     580.506us       0.000us         0.00%       5.123us       1.708us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.49%      27.119us        89.93%       1.640ms     546.545us       3.874us       100.00%       5.123us       1.708us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.874us       100.00%       3.874us       1.291us             3  
                                Activity Buffer Request        79.71%       1.453ms        79.71%       1.453ms       1.453ms       1.249us        32.24%       1.249us       1.249us             1  
                                       aten::empty_like         0.42%       7.681us         1.63%      29.652us       9.884us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.21%      21.971us         1.21%      21.971us       7.324us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.74%     159.334us         8.74%     159.334us      53.111us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.020us         0.28%       5.020us       5.020us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.823ms
Self CUDA time total: 3.874us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     112.862us      2867.43%     112.862us     112.862us             1  
                               hf_kernels_causal_conv1d        13.92%      73.542us        98.94%     522.552us     522.552us       0.000us         0.00%       5.216us       5.216us             1  
                                         CausalConv1dFn        13.17%      69.571us        85.02%     449.010us     149.670us       0.000us         0.00%       5.216us       1.739us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.04%      26.641us        66.59%     351.668us     117.223us       3.936us       100.00%       5.216us       1.739us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.936us       100.00%       3.936us       1.312us             3  
                                Activity Buffer Request        31.20%     164.773us        31.20%     164.773us     164.773us       1.280us        32.52%       1.280us       1.280us             1  
                                       aten::empty_like         1.39%       7.351us         5.26%      27.771us       9.257us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.87%      20.420us         3.87%      20.420us       6.807us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.34%     160.254us        30.34%     160.254us      53.418us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.06%       5.590us         1.06%       5.590us       5.590us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 528.142us
Self CUDA time total: 3.936us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.630us      2893.73%     117.630us     117.630us             1  
                               hf_kernels_causal_conv1d         4.22%      76.492us        99.73%       1.809ms       1.809ms       0.000us         0.00%       5.441us       5.441us             1  
                                         CausalConv1dFn         3.89%      70.602us        95.52%       1.732ms     577.480us       0.000us         0.00%       5.441us       1.814us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.32%      23.990us        90.04%       1.633ms     544.346us       4.065us       100.00%       5.441us       1.814us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.065us       100.00%       4.065us       1.355us             3  
                                Activity Buffer Request        79.17%       1.436ms        79.17%       1.436ms       1.436ms       1.376us        33.85%       1.376us       1.376us             1  
                                       aten::empty_like         0.43%       7.870us         1.59%      28.801us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.15%      20.931us         1.15%      20.931us       6.977us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.54%     173.024us         9.54%     173.024us      57.675us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.840us         0.27%       4.840us       4.840us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.814ms
Self CUDA time total: 4.065us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     112.957us      2780.14%     112.957us     112.957us             1  
                               hf_kernels_causal_conv1d        13.63%      77.442us        99.02%     562.553us     562.553us       0.000us         0.00%       5.439us       5.439us             1  
                                         CausalConv1dFn        12.09%      68.663us        85.39%     485.111us     161.704us       0.000us         0.00%       5.439us       1.813us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.90%      27.850us        68.41%     388.648us     129.549us       4.063us       100.00%       5.439us       1.813us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        32.06%     182.124us        32.06%     182.124us     182.124us       1.376us        33.87%       1.376us       1.376us             1  
                                       aten::empty_like         1.28%       7.270us         4.89%      27.800us       9.267us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.61%      20.530us         3.61%      20.530us       6.843us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.45%     178.674us        31.45%     178.674us      59.558us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.98%       5.590us         0.98%       5.590us       5.590us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 568.143us
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.623us      2193.88%     118.623us     118.623us             1  
                               hf_kernels_causal_conv1d         4.12%      74.582us        99.72%       1.807ms       1.807ms       0.000us         0.00%       7.231us       7.231us             1  
                                         CausalConv1dFn         3.94%      71.361us        95.60%       1.732ms     577.313us       0.000us         0.00%       7.231us       2.410us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.39%      25.271us        90.02%       1.631ms     543.639us       5.407us       100.00%       7.231us       2.410us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.407us       100.00%       5.407us       1.802us             3  
                                Activity Buffer Request        79.36%       1.438ms        79.36%       1.438ms       1.438ms       1.824us        33.73%       1.824us       1.824us             1  
                                       aten::empty_like         0.43%       7.860us         1.64%      29.661us       9.887us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.20%      21.801us         1.20%      21.801us       7.267us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.27%     167.954us         9.27%     167.954us      55.985us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.140us         0.28%       5.140us       5.140us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.812ms
Self CUDA time total: 5.407us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     112.607us      2057.50%     112.607us     112.607us             1  
                               hf_kernels_causal_conv1d        13.70%      73.872us        99.01%     534.033us     534.033us       0.000us         0.00%       7.361us       7.361us             1  
                                         CausalConv1dFn        13.12%      70.792us        85.31%     460.161us     153.387us       0.000us         0.00%       7.361us       2.454us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.78%      25.770us        67.08%     361.838us     120.613us       5.473us       100.00%       7.361us       2.454us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.473us       100.00%       5.473us       1.824us             3  
                                Activity Buffer Request        31.74%     171.214us        31.74%     171.214us     171.214us       1.888us        34.50%       1.888us       1.888us             1  
                                       aten::empty_like         1.37%       7.381us         5.10%      27.531us       9.177us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.74%      20.150us         3.74%      20.150us       6.717us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.56%     164.854us        30.56%     164.854us      54.951us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.99%       5.340us         0.99%       5.340us       5.340us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 539.373us
Self CUDA time total: 5.473us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.390us       706.22%     123.390us     123.390us             1  
                               hf_kernels_causal_conv1d         4.18%      75.923us        99.74%       1.812ms       1.812ms       0.000us         0.00%      23.328us      23.328us             1  
                                         CausalConv1dFn         3.97%      72.132us        95.56%       1.736ms     578.683us       0.000us         0.00%      23.328us       7.776us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.36%      24.789us        89.99%       1.635ms     544.959us      17.472us       100.00%      23.328us       7.776us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.472us       100.00%      17.472us       5.824us             3  
                                Activity Buffer Request        79.65%       1.447ms        79.65%       1.447ms       1.447ms       5.856us        33.52%       5.856us       5.856us             1  
                                       aten::empty_like         0.45%       8.169us         1.60%      29.040us       9.680us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.15%      20.871us         1.15%      20.871us       6.957us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.97%     163.034us         8.97%     163.034us      54.345us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.26%       4.790us         0.26%       4.790us       4.790us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.817ms
Self CUDA time total: 17.472us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.788us       680.84%     121.788us     121.788us             1  
                               hf_kernels_causal_conv1d        14.15%      75.583us        99.15%     529.782us     529.782us       0.000us         0.00%      23.904us      23.904us             1  
                                         CausalConv1dFn        14.61%      78.041us        85.01%     454.199us     151.400us       0.000us         0.00%      23.904us       7.968us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.06%      27.012us        64.90%     346.788us     115.596us      17.888us       100.00%      23.904us       7.968us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.888us       100.00%      17.888us       5.963us             3  
                                Activity Buffer Request        29.57%     158.003us        29.57%     158.003us     158.003us       6.016us        33.63%       6.016us       6.016us             1  
                                       aten::empty_like         1.39%       7.440us         5.50%      29.370us       9.790us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.10%      21.930us         4.10%      21.930us       7.310us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.28%     161.773us        30.28%     161.773us      53.924us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.85%       4.521us         0.85%       4.521us       4.521us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 534.303us
Self CUDA time total: 17.888us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.087us       696.82%     125.087us     125.087us             1  
                               hf_kernels_causal_conv1d         4.34%      78.522us        99.74%       1.806ms       1.806ms       0.000us         0.00%      23.998us      23.998us             1  
                                         CausalConv1dFn         4.03%      72.933us        95.40%       1.728ms     575.883us       0.000us         0.00%      23.998us       7.999us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.38%      25.019us        89.74%       1.625ms     541.689us      17.951us       100.00%      23.998us       7.999us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.951us       100.00%      17.951us       5.984us             3  
                                Activity Buffer Request        79.32%       1.436ms        79.32%       1.436ms       1.436ms       6.047us        33.69%       6.047us       6.047us             1  
                                       aten::empty_like         0.46%       8.289us         1.64%      29.650us       9.883us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.18%      21.361us         1.18%      21.361us       7.120us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.04%     163.685us         9.04%     163.685us      54.562us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.26%       4.781us         0.26%       4.781us       4.781us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.811ms
Self CUDA time total: 17.951us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.887us       630.82%     117.887us     117.887us             1  
                               hf_kernels_causal_conv1d        11.57%      72.803us        99.15%     623.975us     623.975us       0.000us         0.00%      24.960us      24.960us             1  
                                         CausalConv1dFn        11.13%      70.072us        87.58%     551.172us     183.724us       0.000us         0.00%      24.960us       8.320us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.22%      26.540us        71.97%     452.920us     150.973us      18.688us       100.00%      24.960us       8.320us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.688us       100.00%      18.688us       6.229us             3  
                                Activity Buffer Request        41.60%     261.806us        41.60%     261.806us     261.806us       6.272us        33.56%       6.272us       6.272us             1  
                                       aten::empty_like         1.19%       7.500us         4.48%      28.180us       9.393us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.29%      20.680us         3.29%      20.680us       6.893us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        26.15%     164.574us        26.15%     164.574us      54.858us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.85%       5.340us         0.85%       5.340us       5.340us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 629.315us
Self CUDA time total: 18.688us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        11.55%      73.362us        99.20%     630.015us     630.015us       0.000us         0.00%     162.555us     162.555us             1  
                                         CausalConv1dFn        11.07%      70.302us        87.65%     556.653us     185.551us       0.000us         0.00%     162.555us      54.185us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.16%      26.411us        72.21%     458.550us     152.850us      97.949us       100.00%     162.555us      54.185us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     127.645us       130.32%     127.645us     127.645us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.949us       100.00%      97.949us      32.650us             3  
                                Activity Buffer Request        41.87%     265.926us        41.87%     265.926us     265.926us      64.606us        65.96%      64.606us      64.606us             1  
                                       aten::empty_like         1.16%       7.350us         4.38%      27.801us       9.267us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.22%      20.451us         3.22%      20.451us       6.817us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        26.17%     166.213us        26.17%     166.213us      55.404us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.80%       5.050us         0.80%       5.050us       5.050us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 635.065us
Self CUDA time total: 97.949us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        11.83%      75.513us        99.19%     633.215us     633.215us       0.000us         0.00%     164.638us     164.638us             1  
                                         CausalConv1dFn        11.21%      71.532us        87.37%     557.702us     185.901us       0.000us         0.00%     164.638us      54.879us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.76%      23.990us        71.75%     458.009us     152.670us      99.103us       100.00%     164.638us      54.879us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     132.254us       133.45%     132.254us     132.254us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      99.103us       100.00%      99.103us      33.034us             3  
                                Activity Buffer Request        40.13%     256.155us        40.13%     256.155us     256.155us      65.535us        66.13%      65.535us      65.535us             1  
                                       aten::empty_like         1.16%       7.400us         4.41%      28.161us       9.387us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.25%      20.761us         3.25%      20.761us       6.920us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        27.86%     177.864us        27.86%     177.864us      59.288us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.81%       5.140us         0.81%       5.140us       5.140us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 638.355us
Self CUDA time total: 99.103us


impl                     wl                  p50(ms)  ok
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.04  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
▶ UV Install Logs
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] Fetching 11 files: 55%|█████▍ | 6/11 [00:00<00:00, 22.15it/s] Fetching 11 files: 82%|████████▏ | 9/11 [00:01<00:00, 4.06it/s] Fetching 11 files: 100%|██████████| 11/11 [00:01<00:00, 5.94it/s]

Artifacts:

causal_conv1d.jsonl