HF Kernels - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.22s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Nov 10 21:57:49 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   27C    P0             77W /  350W |       0MiB /  46068MiB |     18%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 10.37s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the causal conv1d kernel
causal_conv1d = get_kernel("kernels-community/causal-conv1d")


def hf_kernels_causal_conv1d(input_tensor, weight, bias):
    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="hf_kernels_causal_conv1d",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     156.321us      3758.62%     156.321us     156.321us             1  
                               hf_kernels_causal_conv1d         6.87%     159.072us        99.36%       2.300ms       2.300ms       0.000us         0.00%       5.599us       5.599us             1  
                                         CausalConv1dFn         4.82%     111.622us        92.49%       2.141ms     713.785us       0.000us         0.00%       5.599us       1.866us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.19%      27.462us        84.76%       1.962ms     654.127us       4.159us       100.00%       5.599us       1.866us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.159us       100.00%       4.159us       1.386us             3  
                                Activity Buffer Request        81.39%       1.884ms        81.39%       1.884ms       1.884ms       1.440us        34.62%       1.440us       1.440us             1  
                                       aten::empty_like         0.94%      21.650us         2.91%      67.351us      22.450us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.97%      45.701us         1.97%      45.701us      15.234us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.18%      50.500us         2.18%      50.500us      16.833us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.64%      14.811us         0.64%      14.811us      14.811us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.315ms
Self CUDA time total: 4.159us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.455us      3297.41%     123.455us     123.455us             1  
                               hf_kernels_causal_conv1d         4.13%      83.101us        99.73%       2.009ms       2.009ms       0.000us         0.00%       4.992us       4.992us             1  
                                         CausalConv1dFn         3.66%      73.760us        95.61%       1.926ms     641.917us       0.000us         0.00%       4.992us       1.664us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.15%      23.071us        90.47%       1.822ms     607.420us       3.744us       100.00%       4.992us       1.664us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.744us       100.00%       3.744us       1.248us             3  
                                Activity Buffer Request        87.83%       1.769ms        87.83%       1.769ms       1.769ms       1.248us        33.33%       1.248us       1.248us             1  
                                       aten::empty_like         0.39%       7.860us         1.48%      29.730us       9.910us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.09%      21.870us         1.09%      21.870us       7.290us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.49%      30.082us         1.49%      30.082us      10.027us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.421us         0.27%       5.421us       5.421us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.014ms
Self CUDA time total: 3.744us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.263us      3185.44%     119.263us     119.263us             1  
                               hf_kernels_causal_conv1d         3.91%      78.640us        99.72%       2.003ms       2.003ms       0.000us         0.00%       4.992us       4.992us             1  
                                         CausalConv1dFn         3.57%      71.661us        95.80%       1.925ms     641.537us       0.000us         0.00%       4.992us       1.664us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.13%      22.781us        90.75%       1.823ms     607.693us       3.744us       100.00%       4.992us       1.664us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.744us       100.00%       3.744us       1.248us             3  
                                Activity Buffer Request        88.14%       1.771ms        88.14%       1.771ms       1.771ms       1.248us        33.33%       1.248us       1.248us             1  
                                       aten::empty_like         0.41%       8.160us         1.49%      29.872us       9.957us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.08%      21.712us         1.08%      21.712us       7.237us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.48%      29.670us         1.48%      29.670us       9.890us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.669us         0.28%       5.669us       5.669us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.009ms
Self CUDA time total: 3.744us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.790us      3253.81%     121.790us     121.790us             1  
                               hf_kernels_causal_conv1d         3.48%      76.970us        99.77%       2.208ms       2.208ms       0.000us         0.00%       4.991us       4.991us             1  
                                         CausalConv1dFn         3.33%      73.753us        96.30%       2.131ms     710.368us       0.000us         0.00%       4.991us       1.664us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.03%      22.770us        91.66%       2.029ms     676.184us       3.743us       100.00%       4.991us       1.664us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.743us       100.00%       3.743us       1.248us             3  
                                Activity Buffer Request        81.47%       1.803ms        81.47%       1.803ms       1.803ms       1.248us        33.34%       1.248us       1.248us             1  
                                       aten::empty_like         0.36%       7.858us         1.30%      28.800us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.95%      20.942us         0.95%      20.942us       6.981us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.17%     202.863us         9.17%     202.863us      67.621us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       4.991us         0.23%       4.991us       4.991us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.213ms
Self CUDA time total: 3.743us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.073us      2547.57%     123.073us     123.073us             1  
                               hf_kernels_causal_conv1d         3.82%      79.680us        99.75%       2.083ms       2.083ms       0.000us         0.00%       6.463us       6.463us             1  
                                         CausalConv1dFn         3.53%      73.692us        95.93%       2.003ms     667.744us       0.000us         0.00%       6.463us       2.154us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.17%      24.371us        90.98%       1.900ms     633.257us       4.831us       100.00%       6.463us       2.154us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us       100.00%       4.831us       1.610us             3  
                                Activity Buffer Request        81.73%       1.707ms        81.73%       1.707ms       1.707ms       1.632us        33.78%       1.632us       1.632us             1  
                                       aten::empty_like         0.42%       8.791us         1.43%      29.771us       9.924us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.00%      20.980us         1.00%      20.980us       6.993us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.08%     168.682us         8.08%     168.682us      56.227us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       5.250us         0.25%       5.250us       5.250us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.088ms
Self CUDA time total: 4.831us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     113.883us      2373.55%     113.883us     113.883us             1  
                               hf_kernels_causal_conv1d        15.03%      75.250us        99.01%     495.717us     495.717us       0.000us         0.00%       6.430us       6.430us             1  
                                         CausalConv1dFn        13.70%      68.601us        83.98%     420.467us     140.156us       0.000us         0.00%       6.430us       2.143us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.03%      25.190us        64.69%     323.874us     107.958us       4.798us       100.00%       6.430us       2.143us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.798us       100.00%       4.798us       1.599us             3  
                                Activity Buffer Request        28.01%     140.222us        28.01%     140.222us     140.222us       1.632us        34.01%       1.632us       1.632us             1  
                                       aten::empty_like         1.45%       7.260us         5.59%      27.992us       9.331us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.14%      20.732us         4.14%      20.732us       6.911us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.65%     158.462us        31.65%     158.462us      52.821us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.99%       4.940us         0.99%       4.940us       4.940us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 500.657us
Self CUDA time total: 4.798us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.365us      1148.32%     122.365us     122.365us             1  
                               hf_kernels_causal_conv1d         3.51%      76.530us        99.77%       2.176ms       2.176ms       0.000us         0.00%      14.208us      14.208us             1  
                                         CausalConv1dFn         3.29%      71.713us        96.26%       2.099ms     699.771us       0.000us         0.00%      14.208us       4.736us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.11%      24.170us        91.65%       1.999ms     666.274us      10.656us       100.00%      14.208us       4.736us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.656us       100.00%      10.656us       3.552us             3  
                                Activity Buffer Request        82.90%       1.808ms        82.90%       1.808ms       1.808ms       3.552us        33.33%       3.552us       3.552us             1  
                                       aten::empty_like         0.37%       8.070us         1.32%      28.780us       9.593us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.95%      20.710us         0.95%      20.710us       6.903us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.64%     166.713us         7.64%     166.713us      55.571us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       5.051us         0.23%       5.051us       5.051us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.181ms
Self CUDA time total: 10.656us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     115.676us      1057.08%     115.676us     115.676us             1  
                               hf_kernels_causal_conv1d        15.90%      75.141us        98.97%     467.777us     467.777us       0.000us         0.00%      14.654us      14.654us             1  
                                         CausalConv1dFn        14.89%      70.359us        83.07%     392.636us     130.879us       0.000us         0.00%      14.654us       4.885us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.95%      23.391us        62.24%     294.186us      98.062us      10.943us       100.00%      14.654us       4.885us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.943us       100.00%      10.943us       3.648us             3  
                                Activity Buffer Request        23.54%     111.281us        23.54%     111.281us     111.281us       3.711us        33.91%       3.711us       3.711us             1  
                                       aten::empty_like         1.66%       7.830us         5.94%      28.091us       9.364us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.29%      20.261us         4.29%      20.261us       6.754us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.75%     159.514us        33.75%     159.514us      53.171us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.03%       4.890us         1.03%       4.890us       4.890us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 472.667us
Self CUDA time total: 10.943us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.422us      1124.47%     123.422us     123.422us             1  
                               hf_kernels_causal_conv1d         3.69%      77.100us        99.75%       2.084ms       2.084ms       0.000us         0.00%      14.656us      14.656us             1  
                                         CausalConv1dFn         3.52%      73.471us        96.06%       2.007ms     668.988us       0.000us         0.00%      14.656us       4.885us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.13%      23.660us        90.70%       1.895ms     631.647us      10.976us       100.00%      14.656us       4.885us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.976us       100.00%      10.976us       3.659us             3  
                                Activity Buffer Request        81.81%       1.709ms        81.81%       1.709ms       1.709ms       3.680us        33.53%       3.680us       3.680us             1  
                                       aten::empty_like         0.81%      17.020us         1.85%      38.551us      12.850us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.03%      21.531us         1.03%      21.531us       7.177us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.76%     162.104us         7.76%     162.104us      54.035us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       5.260us         0.25%       5.260us       5.260us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.089ms
Self CUDA time total: 10.976us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.952us      1044.29%     117.952us     117.952us             1  
                               hf_kernels_causal_conv1d        16.01%      73.960us        98.90%     456.837us     456.837us       0.000us         0.00%      15.071us      15.071us             1  
                                         CausalConv1dFn        15.53%      71.741us        82.89%     382.877us     127.626us       0.000us         0.00%      15.071us       5.024us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.93%      22.791us        61.20%     282.685us      94.228us      11.295us       100.00%      15.071us       5.024us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.295us       100.00%      11.295us       3.765us             3  
                                Activity Buffer Request        21.70%     100.232us        21.70%     100.232us     100.232us       3.776us        33.43%       3.776us       3.776us             1  
                                       aten::empty_like         1.73%       7.970us         6.16%      28.451us       9.484us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.43%      20.481us         4.43%      20.481us       6.827us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.57%     159.662us        34.57%     159.662us      53.221us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.10%       5.060us         1.10%       5.060us       5.060us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 461.897us
Self CUDA time total: 11.295us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     128.158us       256.57%     128.158us     128.158us             1  
                               hf_kernels_causal_conv1d         3.51%      75.280us        99.75%       2.140ms       2.140ms       0.000us         0.00%      83.102us      83.102us             1  
                                         CausalConv1dFn         3.36%      72.172us        96.24%       2.065ms     688.218us       0.000us         0.00%      83.102us      27.701us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.14%      24.540us        91.55%       1.964ms     654.657us      49.951us       100.00%      83.102us      27.701us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      49.951us       100.00%      49.951us      16.650us             3  
                                Activity Buffer Request        82.86%       1.778ms        82.86%       1.778ms       1.778ms      33.151us        66.37%      33.151us      33.151us             1  
                                       aten::empty_like         0.37%       7.920us         1.33%      28.510us       9.503us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.96%      20.590us         0.96%      20.590us       6.863us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.54%     161.824us         7.54%     161.824us      53.941us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       5.290us         0.25%       5.290us       5.290us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.145ms
Self CUDA time total: 49.951us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.310us       261.10%     121.310us     121.310us             1  
                               hf_kernels_causal_conv1d        16.42%      74.560us        98.88%     448.987us     448.987us       0.000us         0.00%      75.933us      75.933us             1  
                                         CausalConv1dFn        15.28%      69.392us        82.46%     374.427us     124.809us       0.000us         0.00%      75.933us      25.311us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.01%      22.740us        60.80%     276.074us      92.025us      46.462us       100.00%      75.933us      25.311us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      46.462us       100.00%      46.462us      15.487us             3  
                                Activity Buffer Request        21.27%      96.581us        21.27%      96.581us      96.581us      29.471us        63.43%      29.471us      29.471us             1  
                                       aten::empty_like         1.63%       7.411us         6.38%      28.961us       9.654us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.75%      21.550us         4.75%      21.550us       7.183us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.52%     156.753us        34.52%     156.753us      52.251us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.12%       5.090us         1.12%       5.090us       5.090us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 454.077us
Self CUDA time total: 46.462us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     128.254us      3312.35%     128.254us     128.254us             1  
                               hf_kernels_causal_conv1d         3.31%      74.540us        99.77%       2.245ms       2.245ms       0.000us         0.00%       5.120us       5.120us             1  
                                         CausalConv1dFn         3.41%      76.802us        96.46%       2.170ms     723.418us       0.000us         0.00%       5.120us       1.707us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.08%      24.209us        91.78%       2.065ms     688.374us       3.872us       100.00%       5.120us       1.707us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.872us       100.00%       3.872us       1.291us             3  
                                Activity Buffer Request        83.69%       1.883ms        83.69%       1.883ms       1.883ms       1.248us        32.23%       1.248us       1.248us             1  
                                       aten::empty_like         0.34%       7.679us         1.26%      28.331us       9.444us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.92%      20.652us         0.92%      20.652us       6.884us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.01%     157.803us         7.01%     157.803us      52.601us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       5.180us         0.23%       5.180us       5.180us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.250ms
Self CUDA time total: 3.872us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.470us      3059.11%     117.470us     117.470us             1  
                               hf_kernels_causal_conv1d        16.52%      75.490us        98.91%     451.907us     451.907us       0.000us         0.00%       5.056us       5.056us             1  
                                         CausalConv1dFn        15.55%      71.061us        82.39%     376.417us     125.472us       0.000us         0.00%       5.056us       1.685us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.27%      24.090us        60.40%     275.984us      91.995us       3.840us       100.00%       5.056us       1.685us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.840us       100.00%       3.840us       1.280us             3  
                                Activity Buffer Request        20.75%      94.821us        20.75%      94.821us      94.821us       1.216us        31.67%       1.216us       1.216us             1  
                                       aten::empty_like         1.80%       8.242us         6.43%      29.372us       9.791us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.62%      21.130us         4.62%      21.130us       7.043us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.38%     157.073us        34.38%     157.073us      52.358us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.09%       4.990us         1.09%       4.990us       4.990us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 456.897us
Self CUDA time total: 3.840us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.191us      2958.18%     120.191us     120.191us             1  
                               hf_kernels_causal_conv1d         3.64%      78.360us        99.76%       2.149ms       2.149ms       0.000us         0.00%       5.406us       5.406us             1  
                                         CausalConv1dFn         3.37%      72.531us        96.13%       2.071ms     690.275us       0.000us         0.00%       5.406us       1.802us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.05%      22.591us        91.41%       1.969ms     656.417us       4.063us       100.00%       5.406us       1.802us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        83.09%       1.790ms        83.09%       1.790ms       1.790ms       1.343us        33.05%       1.343us       1.343us             1  
                                       aten::empty_like         0.37%       8.020us         1.35%      29.041us       9.680us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.98%      21.021us         0.98%      21.021us       7.007us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.27%     156.703us         7.27%     156.703us      52.234us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.24%       5.100us         0.24%       5.100us       5.100us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.154ms
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.509us      2988.81%     120.509us     120.509us             1  
                               hf_kernels_causal_conv1d        16.24%      73.950us        98.87%     450.317us     450.317us       0.000us         0.00%       5.376us       5.376us             1  
                                         CausalConv1dFn        17.23%      78.473us        82.64%     376.367us     125.456us       0.000us         0.00%       5.376us       1.792us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.08%      23.119us        59.28%     269.974us      89.991us       4.032us       100.00%       5.376us       1.792us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
                                Activity Buffer Request        19.95%      90.851us        19.95%      90.851us      90.851us       1.344us        33.33%       1.344us       1.344us             1  
                                       aten::empty_like         1.73%       7.890us         6.13%      27.920us       9.307us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.40%      20.030us         4.40%      20.030us       6.677us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.25%     156.004us        34.25%     156.004us      52.001us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.13%       5.130us         1.13%       5.130us       5.130us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 455.447us
Self CUDA time total: 4.032us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.767us      2334.71%     124.767us     124.767us             1  
                               hf_kernels_causal_conv1d         3.64%      76.791us        99.75%       2.102ms       2.102ms       0.000us         0.00%       7.168us       7.168us             1  
                                         CausalConv1dFn         3.46%      72.920us        96.11%       2.025ms     674.997us       0.000us         0.00%       7.168us       2.389us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.08%      22.730us        91.24%       1.923ms     640.840us       5.344us       100.00%       7.168us       2.389us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.344us       100.00%       5.344us       1.781us             3  
                                Activity Buffer Request        82.66%       1.742ms        82.66%       1.742ms       1.742ms       1.824us        34.13%       1.824us       1.824us             1  
                                       aten::empty_like         0.40%       8.480us         1.40%      29.552us       9.851us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.00%      21.072us         1.00%      21.072us       7.024us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.51%     158.242us         7.51%     158.242us      52.747us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       5.220us         0.25%       5.220us       5.220us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.107ms
Self CUDA time total: 5.344us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     114.399us      2127.96%     114.399us     114.399us             1  
                               hf_kernels_causal_conv1d        16.62%      75.320us        98.88%     448.097us     448.097us       0.000us         0.00%       7.200us       7.200us             1  
                                         CausalConv1dFn        15.04%      68.172us        82.26%     372.777us     124.259us       0.000us         0.00%       7.200us       2.400us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.05%      22.881us        60.95%     276.214us      92.071us       5.376us       100.00%       7.200us       2.400us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.376us       100.00%       5.376us       1.792us             3  
                                Activity Buffer Request        20.71%      93.851us        20.71%      93.851us      93.851us       1.824us        33.93%       1.824us       1.824us             1  
                                       aten::empty_like         1.68%       7.630us         6.27%      28.391us       9.464us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.58%      20.761us         4.58%      20.761us       6.920us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.19%     159.482us        35.19%     159.482us      53.161us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.12%       5.070us         1.12%       5.070us       5.070us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 453.167us
Self CUDA time total: 5.376us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.887us       696.30%     121.887us     121.887us             1  
                               hf_kernels_causal_conv1d         3.44%      74.640us        99.77%       2.162ms       2.162ms       0.000us         0.00%      23.361us      23.361us             1  
                                         CausalConv1dFn         3.19%      69.031us        96.32%       2.087ms     695.668us       0.000us         0.00%      23.361us       7.787us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.10%      23.730us        91.78%       1.989ms     662.904us      17.505us       100.00%      23.361us       7.787us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.505us       100.00%      17.505us       5.835us             3  
                                Activity Buffer Request        82.75%       1.793ms        82.75%       1.793ms       1.793ms       5.856us        33.45%       5.856us       5.856us             1  
                                       aten::empty_like         0.40%       8.582us         1.35%      29.262us       9.754us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.95%      20.680us         0.95%      20.680us       6.893us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.94%     172.113us         7.94%     172.113us      57.371us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       5.069us         0.23%       5.069us       5.069us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.167ms
Self CUDA time total: 17.505us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.997us       664.91%     119.997us     119.997us             1  
                               hf_kernels_causal_conv1d        16.46%      76.510us        98.91%     459.857us     459.857us       0.000us         0.00%      24.063us      24.063us             1  
                                         CausalConv1dFn        14.99%      69.691us        82.45%     383.347us     127.782us       0.000us         0.00%      24.063us       8.021us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.12%      23.810us        61.53%     286.094us      95.365us      18.047us       100.00%      24.063us       8.021us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.047us       100.00%      18.047us       6.016us             3  
                                Activity Buffer Request        22.64%     105.271us        22.64%     105.271us     105.271us       6.016us        33.34%       6.016us       6.016us             1  
                                       aten::empty_like         1.59%       7.411us         5.93%      27.562us       9.187us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.33%      20.151us         4.33%      20.151us       6.717us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.77%     157.013us        33.77%     157.013us      52.338us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.09%       5.080us         1.09%       5.080us       5.080us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 464.937us
Self CUDA time total: 18.047us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.983us       701.78%     125.983us     125.983us             1  
                               hf_kernels_causal_conv1d         3.62%      75.400us        99.76%       2.076ms       2.076ms       0.000us         0.00%      23.968us      23.968us             1  
                                         CausalConv1dFn         3.51%      72.963us        96.14%       2.001ms     667.008us       0.000us         0.00%      23.968us       7.989us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.17%      24.320us        91.19%       1.898ms     632.703us      17.952us       100.00%      23.968us       7.989us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.952us       100.00%      17.952us       5.984us             3  
                                Activity Buffer Request        82.20%       1.711ms        82.20%       1.711ms       1.711ms       6.016us        33.51%       6.016us       6.016us             1  
                                       aten::empty_like         0.41%       8.499us         1.44%      29.950us       9.983us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.03%      21.451us         1.03%      21.451us       7.150us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.83%     162.893us         7.83%     162.893us      54.298us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.24%       4.969us         0.24%       4.969us       4.969us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.081ms
Self CUDA time total: 17.952us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.901us       639.40%     119.901us     119.901us             1  
                               hf_kernels_causal_conv1d        11.47%      73.600us        99.21%     636.820us     636.820us       0.000us         0.00%      25.088us      25.088us             1  
                                         CausalConv1dFn        11.28%      72.380us        87.74%     563.220us     187.740us       0.000us         0.00%      25.088us       8.363us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.65%      23.431us        72.11%     462.887us     154.296us      18.752us       100.00%      25.088us       8.363us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.752us       100.00%      18.752us       6.251us             3  
                                Activity Buffer Request        43.62%     280.014us        43.62%     280.014us     280.014us       6.336us        33.79%       6.336us       6.336us             1  
                                       aten::empty_like         1.22%       7.832us         4.35%      27.953us       9.318us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.13%      20.121us         3.13%      20.121us       6.707us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        24.84%     159.442us        24.84%     159.442us      53.147us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.79%       5.080us         0.79%       5.080us       5.080us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 641.900us
Self CUDA time total: 18.752us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        11.42%      73.310us        99.16%     636.780us     636.780us       0.000us         0.00%     162.591us     162.591us             1  
                                         CausalConv1dFn        11.12%      71.382us        87.74%     563.470us     187.823us       0.000us         0.00%     162.591us      54.197us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.58%      22.989us        72.14%     463.287us     154.429us      97.631us       100.00%     162.591us      54.197us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     130.208us       133.37%     130.208us     130.208us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.631us       100.00%      97.631us      32.544us             3  
                                Activity Buffer Request        43.38%     278.604us        43.38%     278.604us     278.604us      64.960us        66.54%      64.960us      64.960us             1  
                                       aten::empty_like         1.24%       7.950us         4.48%      28.801us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.25%      20.851us         3.25%      20.851us       6.950us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        25.18%     161.694us        25.18%     161.694us      53.898us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.84%       5.420us         0.84%       5.420us       5.420us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 642.200us
Self CUDA time total: 97.631us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        13.89%      72.060us        98.98%     513.378us     513.378us       0.000us         0.00%     163.263us     163.263us             1  
                                         CausalConv1dFn        13.96%      72.421us        85.08%     441.318us     147.106us       0.000us         0.00%     163.263us      54.421us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.45%      23.099us        65.49%     339.676us     113.225us      98.623us       100.00%     163.263us      54.421us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     130.111us       131.93%     130.111us     130.111us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      98.623us       100.00%      98.623us      32.874us             3  
                                Activity Buffer Request        30.19%     156.612us        30.19%     156.612us     156.612us      64.640us        65.54%      64.640us      64.640us             1  
                                       aten::empty_like         1.62%       8.391us         5.63%      29.221us       9.740us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.02%      20.830us         4.02%      20.830us       6.943us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.84%     159.965us        30.84%     159.965us      53.322us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.02%       5.310us         1.02%       5.310us       5.310us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 518.688us
Self CUDA time total: 98.623us


impl                     wl                  p50(ms)  ok
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.04  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
▶ UV Install Logs
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] Fetching 11 files: 64%|██████▎ | 7/11 [00:01<00:01, 3.51it/s] Fetching 11 files: 100%|██████████| 11/11 [00:01<00:00, 5.51it/s]

Artifacts:

causal_conv1d.jsonl