HF Kernels - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.24s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Wed Oct 29 14:27:09 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   33C    P0            109W /  350W |       0MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 5.79s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the causal conv1d kernel
causal_conv1d = get_kernel("kernels-community/causal-conv1d")


def hf_kernels_causal_conv1d(input_tensor, weight, bias):
    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="hf_kernels_causal_conv1d",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     151.393us      3724.31%     151.393us     151.393us             1  
                               hf_kernels_causal_conv1d         8.95%     166.324us        99.62%       1.852ms       1.852ms       0.000us         0.00%       5.505us       5.505us             1  
                                         CausalConv1dFn         6.05%     112.563us        90.67%       1.686ms     561.934us       0.000us         0.00%       5.505us       1.835us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.41%      26.172us        80.97%       1.505ms     501.826us       4.065us       100.00%       5.505us       1.835us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.065us       100.00%       4.065us       1.355us             3  
                                Activity Buffer Request        77.14%       1.434ms        77.14%       1.434ms       1.434ms       1.440us        35.42%       1.440us       1.440us             1  
                                       aten::empty_like         1.03%      19.059us         3.64%      67.761us      22.587us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.62%      48.702us         2.62%      48.702us      16.234us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.42%      45.061us         2.42%      45.061us      15.020us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.38%       7.150us         0.38%       7.150us       7.150us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.859ms
Self CUDA time total: 4.065us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.439us      3456.32%     129.439us     129.439us             1  
                               hf_kernels_causal_conv1d         5.79%      99.043us        99.68%       1.706ms       1.706ms       0.000us         0.00%       4.994us       4.994us             1  
                                         CausalConv1dFn         4.71%      80.562us        93.90%       1.607ms     535.793us       0.000us         0.00%       4.994us       1.665us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.47%      25.130us        87.50%       1.498ms     499.285us       3.745us       100.00%       4.994us       1.665us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.745us       100.00%       3.745us       1.248us             3  
                                Activity Buffer Request        84.17%       1.441ms        84.17%       1.441ms       1.441ms       1.249us        33.35%       1.249us       1.249us             1  
                                       aten::empty_like         0.47%       7.980us         1.69%      28.961us       9.654us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.23%      20.981us         1.23%      20.981us       6.994us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.86%      31.821us         1.86%      31.821us      10.607us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.32%       5.430us         0.32%       5.430us       5.430us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.712ms
Self CUDA time total: 3.745us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.098us      3285.62%     124.098us     124.098us             1  
                               hf_kernels_causal_conv1d         5.52%      95.683us        99.69%       1.728ms       1.728ms       0.000us         0.00%       5.057us       5.057us             1  
                                         CausalConv1dFn         4.48%      77.582us        94.17%       1.632ms     544.020us       0.000us         0.00%       5.057us       1.686us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.43%      24.830us        87.99%       1.525ms     508.322us       3.777us       100.00%       5.057us       1.686us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.777us       100.00%       3.777us       1.259us             3  
                                Activity Buffer Request        84.76%       1.469ms        84.76%       1.469ms       1.469ms       1.280us        33.89%       1.280us       1.280us             1  
                                       aten::empty_like         0.46%       7.920us         1.70%      29.511us       9.837us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.25%      21.591us         1.25%      21.591us       7.197us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.80%      31.261us         1.80%      31.261us      10.420us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.301us         0.31%       5.301us       5.301us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.733ms
Self CUDA time total: 3.777us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.729us      3378.36%     129.729us     129.729us             1  
                               hf_kernels_causal_conv1d         5.03%      97.232us        99.72%       1.927ms       1.927ms       0.000us         0.00%       5.120us       5.120us             1  
                                         CausalConv1dFn         4.11%      79.452us        94.69%       1.830ms     610.049us       0.000us         0.00%       5.120us       1.707us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.27%      24.481us        89.03%       1.721ms     573.588us       3.840us       100.00%       5.120us       1.707us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.840us       100.00%       3.840us       1.280us             3  
                                Activity Buffer Request        76.40%       1.477ms        76.40%       1.477ms       1.477ms       1.280us        33.33%       1.280us       1.280us             1  
                                       aten::empty_like         0.41%       7.951us         1.55%      29.931us       9.977us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.14%      21.980us         1.14%      21.980us       7.327us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        11.36%     219.575us        11.36%     219.575us      73.192us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.490us         0.28%       5.490us       5.490us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.933ms
Self CUDA time total: 3.840us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     126.080us      2644.30%     126.080us     126.080us             1  
                               hf_kernels_causal_conv1d         5.18%     102.863us        99.75%       1.979ms       1.979ms       0.000us         0.00%       6.368us       6.368us             1  
                                         CausalConv1dFn         3.95%      78.303us        94.57%       1.876ms     625.402us       0.000us         0.00%       6.368us       2.123us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.22%      24.140us        89.14%       1.768ms     589.491us       4.768us       100.00%       6.368us       2.123us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us       100.00%       4.768us       1.589us             3  
                                Activity Buffer Request        79.49%       1.577ms        79.49%       1.577ms       1.577ms       1.600us        33.56%       1.600us       1.600us             1  
                                       aten::empty_like         0.40%       7.900us         1.48%      29.430us       9.810us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.09%      21.530us         1.09%      21.530us       7.177us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.43%     167.184us         8.43%     167.184us      55.728us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       4.910us         0.25%       4.910us       4.910us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.984ms
Self CUDA time total: 4.768us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.055us      2488.80%     121.055us     121.055us             1  
                               hf_kernels_causal_conv1d        13.09%      78.123us        99.20%     592.205us     592.205us       0.000us         0.00%       6.528us       6.528us             1  
                                         CausalConv1dFn        13.01%      77.643us        86.11%     514.082us     171.361us       0.000us         0.00%       6.528us       2.176us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.18%      24.929us        68.36%     408.089us     136.030us       4.864us       100.00%       6.528us       2.176us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.864us       100.00%       4.864us       1.621us             3  
                                Activity Buffer Request        36.63%     218.665us        36.63%     218.665us     218.665us       1.664us        34.21%       1.664us       1.664us             1  
                                       aten::empty_like         1.31%       7.839us         4.75%      28.350us       9.450us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.44%      20.511us         3.44%      20.511us       6.837us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        27.55%     164.495us        27.55%     164.495us      54.832us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.80%       4.790us         0.80%       4.790us       4.790us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 596.995us
Self CUDA time total: 4.864us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     128.031us      1201.49%     128.031us     128.031us             1  
                               hf_kernels_causal_conv1d         5.58%     105.873us        99.72%       1.893ms       1.893ms       0.000us         0.00%      14.208us      14.208us             1  
                                         CausalConv1dFn         4.13%      78.341us        94.14%       1.787ms     595.748us       0.000us         0.00%      14.208us       4.736us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.45%      27.570us        88.49%       1.680ms     559.957us      10.656us       100.00%      14.208us       4.736us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.656us       100.00%      10.656us       3.552us             3  
                                Activity Buffer Request        77.94%       1.480ms        77.94%       1.480ms       1.480ms       3.552us        33.33%       3.552us       3.552us             1  
                                       aten::empty_like         0.41%       7.812us         1.53%      29.032us       9.677us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.12%      21.220us         1.12%      21.220us       7.073us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.09%     172.624us         9.09%     172.624us      57.541us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.330us         0.28%       5.330us       5.330us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.898ms
Self CUDA time total: 10.656us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.524us      1119.66%     122.524us     122.524us             1  
                               hf_kernels_causal_conv1d        19.00%     100.263us        99.02%     522.563us     522.563us       0.000us         0.00%      14.623us      14.623us             1  
                                         CausalConv1dFn        14.56%      76.813us        80.02%     422.300us     140.767us       0.000us         0.00%      14.623us       4.874us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.04%      26.621us        60.06%     316.927us     105.642us      10.943us       100.00%      14.623us       4.874us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.943us       100.00%      10.943us       3.648us             3  
                                Activity Buffer Request        24.63%     129.993us        24.63%     129.993us     129.993us       3.680us        33.63%       3.680us       3.680us             1  
                                       aten::empty_like         1.53%       8.070us         5.41%      28.560us       9.520us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.88%      20.490us         3.88%      20.490us       6.830us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.38%     160.313us        30.38%     160.313us      53.438us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.98%       5.160us         0.98%       5.160us       5.160us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 527.723us
Self CUDA time total: 10.943us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     130.879us      1185.50%     130.879us     130.879us             1  
                               hf_kernels_causal_conv1d         6.10%     112.423us        99.71%       1.839ms       1.839ms       0.000us         0.00%      14.752us      14.752us             1  
                                         CausalConv1dFn         4.42%      81.553us        93.62%       1.726ms     575.457us       0.000us         0.00%      14.752us       4.917us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.34%      24.629us        87.45%       1.613ms     537.533us      11.040us       100.00%      14.752us       4.917us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.040us       100.00%      11.040us       3.680us             3  
                                Activity Buffer Request        77.44%       1.428ms        77.44%       1.428ms       1.428ms       3.712us        33.62%       3.712us       3.712us             1  
                                       aten::empty_like         0.46%       8.560us         1.75%      32.220us      10.740us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.28%      23.660us         1.28%      23.660us       7.887us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.67%     159.915us         8.67%     159.915us      53.305us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.29%       5.260us         0.29%       5.260us       5.260us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.844ms
Self CUDA time total: 11.040us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.988us      1097.16%     124.988us     124.988us             1  
                               hf_kernels_causal_conv1d        14.68%      75.042us        98.95%     505.802us     505.802us       0.000us         0.00%      15.232us      15.232us             1  
                                         CausalConv1dFn        15.20%      77.712us        84.27%     430.760us     143.587us       0.000us         0.00%      15.232us       5.077us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.71%      24.091us        63.54%     324.777us     108.259us      11.392us       100.00%      15.232us       5.077us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.392us       100.00%      11.392us       3.797us             3  
                                Activity Buffer Request        26.66%     136.263us        26.66%     136.263us     136.263us       3.840us        33.71%       3.840us       3.840us             1  
                                       aten::empty_like         1.46%       7.441us         5.53%      28.271us       9.424us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.08%      20.830us         4.08%      20.830us       6.943us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.17%     164.423us        32.17%     164.423us      54.808us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.05%       5.351us         1.05%       5.351us       5.351us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 511.153us
Self CUDA time total: 11.392us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     131.775us       262.12%     131.775us     131.775us             1  
                               hf_kernels_causal_conv1d         8.81%      77.263us        99.39%     871.362us     871.362us       0.000us         0.00%      83.680us      83.680us             1  
                                         CausalConv1dFn         8.68%      76.121us        90.57%     794.099us     264.700us       0.000us         0.00%      83.680us      27.893us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.02%      26.501us        78.58%     688.947us     229.649us      50.272us       100.00%      83.680us      27.893us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      50.272us       100.00%      50.272us      16.757us             3  
                                Activity Buffer Request        55.77%     488.972us        55.77%     488.972us     488.972us      33.408us        66.45%      33.408us      33.408us             1  
                                       aten::empty_like         0.92%       8.040us         3.31%      29.031us       9.677us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.39%      20.991us         2.39%      20.991us       6.997us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        19.79%     173.474us        19.79%     173.474us      57.825us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.61%       5.370us         0.61%       5.370us       5.370us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 876.732us
Self CUDA time total: 50.272us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     127.295us       247.23%     127.295us     127.295us             1  
                               hf_kernels_causal_conv1d        15.09%      77.332us        99.04%     507.562us     507.562us       0.000us         0.00%      86.016us      86.016us             1  
                                         CausalConv1dFn        14.68%      75.241us        83.95%     430.230us     143.410us       0.000us         0.00%      86.016us      28.672us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.05%      25.861us        63.40%     324.927us     108.309us      51.488us       100.00%      86.016us      28.672us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      51.488us       100.00%      51.488us      17.163us             3  
                                Activity Buffer Request        25.26%     129.463us        25.26%     129.463us     129.463us      34.528us        67.06%      34.528us      34.528us             1  
                                       aten::empty_like         1.67%       8.561us         5.87%      30.062us      10.021us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.20%      21.501us         4.20%      21.501us       7.167us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.09%     169.603us        33.09%     169.603us      56.534us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.96%       4.929us         0.96%       4.929us       4.929us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 512.491us
Self CUDA time total: 51.488us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.214us      3104.87%     121.214us     121.214us             1  
                               hf_kernels_causal_conv1d         8.71%      75.123us        99.37%     856.672us     856.672us       0.000us         0.00%       5.184us       5.184us             1  
                                         CausalConv1dFn         8.55%      73.741us        90.66%     781.549us     260.516us       0.000us         0.00%       5.184us       1.728us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         2.92%      25.150us        78.63%     677.857us     225.952us       3.904us       100.00%       5.184us       1.728us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.904us       100.00%       3.904us       1.301us             3  
                                Activity Buffer Request        56.24%     484.832us        56.24%     484.832us     484.832us       1.280us        32.79%       1.280us       1.280us             1  
                                       aten::empty_like         1.08%       9.311us         3.47%      29.951us       9.984us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.39%      20.640us         2.39%      20.640us       6.880us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        19.47%     167.875us        19.47%     167.875us      55.958us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.63%       5.440us         0.63%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 862.112us
Self CUDA time total: 3.904us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.438us      3086.10%     121.438us     121.438us             1  
                               hf_kernels_causal_conv1d        15.37%      74.422us        98.89%     478.921us     478.921us       0.000us         0.00%       5.183us       5.183us             1  
                                         CausalConv1dFn        15.69%      75.972us        83.52%     404.499us     134.833us       0.000us         0.00%       5.183us       1.728us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.44%      26.330us        61.72%     298.936us      99.645us       3.935us       100.00%       5.183us       1.728us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.935us       100.00%       3.935us       1.312us             3  
                                Activity Buffer Request        23.74%     114.963us        23.74%     114.963us     114.963us       1.248us        31.72%       1.248us       1.248us             1  
                                       aten::empty_like         1.57%       7.609us         6.11%      29.591us       9.864us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.54%      21.982us         4.54%      21.982us       7.327us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.55%     157.643us        32.55%     157.643us      52.548us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.11%       5.391us         1.11%       5.391us       5.391us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 484.312us
Self CUDA time total: 3.935us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     152.157us      3744.94%     152.157us     152.157us             1  
                               hf_kernels_causal_conv1d        10.88%      77.931us        99.21%     710.327us     710.327us       0.000us         0.00%       5.407us       5.407us             1  
                                         CausalConv1dFn        11.39%      81.522us        88.32%     632.396us     210.799us       0.000us         0.00%       5.407us       1.802us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.86%      27.639us        72.73%     520.742us     173.581us       4.063us       100.00%       5.407us       1.802us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        44.05%     315.408us        44.05%     315.408us     315.408us       1.344us        33.08%       1.344us       1.344us             1  
                                       aten::empty_like         1.15%       8.200us         4.21%      30.132us      10.044us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.06%      21.932us         3.06%      21.932us       7.311us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        24.82%     177.695us        24.82%     177.695us      59.232us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.79%       5.681us         0.79%       5.681us       5.681us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 716.008us
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.936us      2951.18%     119.936us     119.936us             1  
                               hf_kernels_causal_conv1d        15.86%      75.552us        99.00%     471.672us     471.672us       0.000us         0.00%       5.440us       5.440us             1  
                                         CausalConv1dFn        16.03%      76.383us        83.14%     396.120us     132.040us       0.000us         0.00%       5.440us       1.813us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.35%      25.480us        61.26%     291.866us      97.289us       4.064us       100.00%       5.440us       1.813us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.064us       100.00%       4.064us       1.355us             3  
                                Activity Buffer Request        23.14%     110.243us        23.14%     110.243us     110.243us       1.376us        33.86%       1.376us       1.376us             1  
                                       aten::empty_like         1.53%       7.269us         5.85%      27.871us       9.290us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.32%      20.602us         4.32%      20.602us       6.867us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.77%     156.143us        32.77%     156.143us      52.048us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.00%       4.760us         1.00%       4.760us       4.760us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 476.432us
Self CUDA time total: 4.064us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.888us      2401.78%     129.888us     129.888us             1  
                               hf_kernels_causal_conv1d        13.50%     106.873us        99.32%     785.980us     785.980us       0.000us         0.00%       7.264us       7.264us             1  
                                         CausalConv1dFn        10.04%      79.422us        85.81%     679.107us     226.369us       0.000us         0.00%       7.264us       2.421us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.32%      26.310us        72.10%     570.564us     190.188us       5.408us       100.00%       7.264us       2.421us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.408us       100.00%       5.408us       1.803us             3  
                                Activity Buffer Request        48.81%     386.260us        48.81%     386.260us     386.260us       1.856us        34.32%       1.856us       1.856us             1  
                                       aten::empty_like         1.01%       7.981us         3.68%      29.121us       9.707us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.67%      21.140us         2.67%      21.140us       7.047us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        19.96%     157.994us        19.96%     157.994us      52.665us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.68%       5.410us         0.68%       5.410us       5.410us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 791.390us
Self CUDA time total: 5.408us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.463us      2151.92%     118.463us     118.463us             1  
                               hf_kernels_causal_conv1d        19.47%      96.181us        98.96%     488.812us     488.812us       0.000us         0.00%       7.393us       7.393us             1  
                                         CausalConv1dFn        15.19%      75.044us        79.49%     392.631us     130.877us       0.000us         0.00%       7.393us       2.464us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.31%      26.241us        58.39%     288.397us      96.132us       5.505us       100.00%       7.393us       2.464us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.505us       100.00%       5.505us       1.835us             3  
                                Activity Buffer Request        21.50%     106.222us        21.50%     106.222us     106.222us       1.888us        34.30%       1.888us       1.888us             1  
                                       aten::empty_like         1.50%       7.390us         5.91%      29.190us       9.730us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.41%      21.800us         4.41%      21.800us       7.267us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.57%     155.934us        31.57%     155.934us      51.978us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.04%       5.140us         1.04%       5.140us       5.140us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 493.952us
Self CUDA time total: 5.505us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.279us       741.28%     129.279us     129.279us             1  
                               hf_kernels_causal_conv1d         5.08%      91.861us        99.73%       1.805ms       1.805ms       0.000us         0.00%      23.296us      23.296us             1  
                                         CausalConv1dFn         4.24%      76.815us        94.65%       1.713ms     571.078us       0.000us         0.00%      23.296us       7.765us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.42%      25.791us        88.76%       1.607ms     535.516us      17.440us       100.00%      23.296us       7.765us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.440us       100.00%      17.440us       5.813us             3  
                                Activity Buffer Request        78.65%       1.424ms        78.65%       1.424ms       1.424ms       5.856us        33.58%       5.856us       5.856us             1  
                                       aten::empty_like         0.47%       8.500us         1.65%      29.870us       9.957us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.18%      21.370us         1.18%      21.370us       7.123us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.68%     157.163us         8.68%     157.163us      52.388us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.911us         0.27%       4.911us       4.911us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.810ms
Self CUDA time total: 17.440us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     139.324us       772.01%     139.324us     139.324us             1  
                               hf_kernels_causal_conv1d        18.68%      93.362us        99.02%     494.883us     494.883us       0.000us         0.00%      24.095us      24.095us             1  
                                         CausalConv1dFn        17.38%      86.843us        80.34%     401.521us     133.840us       0.000us         0.00%      24.095us       8.032us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.36%      26.789us        57.15%     285.628us      95.209us      18.047us       100.00%      24.095us       8.032us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.047us       100.00%      18.047us       6.016us             3  
                                Activity Buffer Request        20.49%     102.403us        20.49%     102.403us     102.403us       6.048us        33.51%       6.048us       6.048us             1  
                                       aten::empty_like         1.48%       7.399us         5.81%      29.050us       9.683us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.33%      21.651us         4.33%      21.651us       7.217us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.30%     156.436us        31.30%     156.436us      52.145us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.98%       4.890us         0.98%       4.890us       4.890us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 499.773us
Self CUDA time total: 18.047us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     135.103us       748.58%     135.103us     135.103us             1  
                               hf_kernels_causal_conv1d         5.37%      98.434us        99.69%       1.829ms       1.829ms       0.000us         0.00%      24.097us      24.097us             1  
                                         CausalConv1dFn         4.35%      79.821us        94.33%       1.730ms     576.697us       0.000us         0.00%      24.097us       8.032us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.36%      24.912us        88.33%       1.620ms     540.010us      18.048us       100.00%      24.097us       8.032us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.048us       100.00%      18.048us       6.016us             3  
                                Activity Buffer Request        77.78%       1.427ms        77.78%       1.427ms       1.427ms       6.049us        33.52%       6.049us       6.049us             1  
                                       aten::empty_like         0.47%       8.550us         1.65%      30.240us      10.080us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.18%      21.690us         1.18%      21.690us       7.230us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.19%     168.514us         9.19%     168.514us      56.171us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.620us         0.31%       5.620us       5.620us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.834ms
Self CUDA time total: 18.048us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     130.684us       694.54%     130.684us     130.684us             1  
                               hf_kernels_causal_conv1d        18.98%      97.223us        99.02%     507.183us     507.183us       0.000us         0.00%      25.120us      25.120us             1  
                                         CausalConv1dFn        14.58%      74.692us        80.04%     409.960us     136.653us       0.000us         0.00%      25.120us       8.373us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         6.51%      33.321us        59.71%     305.838us     101.946us      18.816us       100.00%      25.120us       8.373us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.816us       100.00%      18.816us       6.272us             3  
                                Activity Buffer Request        22.33%     114.353us        22.33%     114.353us     114.353us       6.304us        33.50%       6.304us       6.304us             1  
                                       aten::empty_like         1.71%       8.769us         5.75%      29.430us       9.810us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.03%      20.661us         4.03%      20.661us       6.887us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.88%     158.164us        30.88%     158.164us      52.721us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.98%       5.010us         0.98%       5.010us       5.010us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 512.193us
Self CUDA time total: 18.816us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         6.14%     112.394us        99.70%       1.825ms       1.825ms       0.000us         0.00%     162.754us     162.754us             1  
                                         CausalConv1dFn         4.41%      80.651us        93.56%       1.713ms     570.927us       0.000us         0.00%     162.754us      54.251us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.37%      25.010us        87.54%       1.603ms     534.193us      97.985us       100.00%     162.754us      54.251us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     144.737us       147.71%     144.737us     144.737us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.985us       100.00%      97.985us      32.662us             3  
                                Activity Buffer Request        77.36%       1.416ms        77.36%       1.416ms       1.416ms      64.769us        66.10%      64.769us      64.769us             1  
                                       aten::empty_like         0.49%       8.901us         1.61%      29.551us       9.850us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.13%      20.650us         1.13%      20.650us       6.883us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.82%     161.445us         8.82%     161.445us      53.815us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.480us         0.30%       5.480us       5.480us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.831ms
Self CUDA time total: 97.985us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        19.17%      96.654us        98.90%     498.573us     498.573us       0.000us         0.00%     163.900us     163.900us             1  
                                         CausalConv1dFn        15.33%      77.291us        79.73%     401.919us     133.973us       0.000us         0.00%     163.900us      54.633us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.17%      26.053us        58.73%     296.088us      98.696us      98.813us       100.00%     163.900us      54.633us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     133.981us       135.59%     133.981us     133.981us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      98.813us       100.00%      98.813us      32.938us             3  
                                Activity Buffer Request        22.39%     112.882us        22.39%     112.882us     112.882us      65.087us        65.87%      65.087us      65.087us             1  
                                       aten::empty_like         1.55%       7.820us         5.66%      28.540us       9.513us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.11%      20.720us         4.11%      20.720us       6.907us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.17%     157.153us        31.17%     157.153us      52.384us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.10%       5.550us         1.10%       5.550us       5.550us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 504.123us
Self CUDA time total: 98.813us


impl                     wl                  p50(ms)  ok
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.06  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.06  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.06  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.06  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.06  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
▶ UV Install Logs
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] Fetching 11 files: 64%|██████▎ | 7/11 [00:01<00:01, 3.95it/s] Fetching 11 files: 100%|██████████| 11/11 [00:01<00:00, 6.21it/s]

Artifacts:

causal_conv1d.jsonl