HF Kernels - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Oct 31 20:00:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   33C    P0             79W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 9.11s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the causal conv1d kernel
causal_conv1d = get_kernel("kernels-community/causal-conv1d")


def hf_kernels_causal_conv1d(input_tensor, weight, bias):
    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="hf_kernels_causal_conv1d",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     180.703us      4446.43%     180.703us     180.703us             1  
                               hf_kernels_causal_conv1d         8.48%     160.534us        99.62%       1.886ms       1.886ms       0.000us         0.00%       5.504us       5.504us             1  
                                         CausalConv1dFn         6.47%     122.423us        91.15%       1.726ms     575.261us       0.000us         0.00%       5.504us       1.835us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.51%      28.612us        80.84%       1.531ms     510.207us       4.064us       100.00%       5.504us       1.835us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.064us       100.00%       4.064us       1.355us             3  
                                Activity Buffer Request        76.71%       1.452ms        76.71%       1.452ms       1.452ms       1.440us        35.43%       1.440us       1.440us             1  
                                       aten::empty_like         1.07%      20.220us         3.84%      72.741us      24.247us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.77%      52.521us         2.77%      52.521us      17.507us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.62%      49.571us         2.62%      49.571us      16.524us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.38%       7.101us         0.38%       7.101us       7.101us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.893ms
Self CUDA time total: 4.064us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.791us      3331.33%     125.791us     125.791us             1  
                               hf_kernels_causal_conv1d         5.58%      96.392us        99.64%       1.721ms       1.721ms       0.000us         0.00%       5.056us       5.056us             1  
                                         CausalConv1dFn         4.40%      76.074us        94.06%       1.625ms     541.671us       0.000us         0.00%       5.056us       1.685us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.52%      26.231us        87.95%       1.519ms     506.473us       3.776us       100.00%       5.056us       1.685us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.776us       100.00%       3.776us       1.259us             3  
                                Activity Buffer Request        84.56%       1.461ms        84.56%       1.461ms       1.461ms       1.280us        33.90%       1.280us       1.280us             1  
                                       aten::empty_like         0.44%       7.590us         1.71%      29.520us       9.840us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.27%      21.930us         1.27%      21.930us       7.310us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.87%      32.290us         1.87%      32.290us      10.763us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.36%       6.200us         0.36%       6.200us       6.200us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.728ms
Self CUDA time total: 3.776us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.758us      3330.46%     125.758us     125.758us             1  
                               hf_kernels_causal_conv1d         5.23%      90.742us        99.66%       1.729ms       1.729ms       0.000us         0.00%       5.056us       5.056us             1  
                                         CausalConv1dFn         4.39%      76.092us        94.43%       1.638ms     546.081us       0.000us         0.00%       5.056us       1.685us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.50%      26.031us        88.31%       1.532ms     510.660us       3.776us       100.00%       5.056us       1.685us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.776us       100.00%       3.776us       1.259us             3  
                                Activity Buffer Request        84.98%       1.474ms        84.98%       1.474ms       1.474ms       1.280us        33.90%       1.280us       1.280us             1  
                                       aten::empty_like         0.47%       8.201us         1.74%      30.171us      10.057us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.27%      21.970us         1.27%      21.970us       7.323us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.83%      31.671us         1.83%      31.671us      10.557us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.34%       5.850us         0.34%       5.850us       5.850us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.735ms
Self CUDA time total: 3.776us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     127.584us      3350.42%     127.584us     127.584us             1  
                               hf_kernels_causal_conv1d         4.53%      88.983us        99.75%       1.962ms       1.962ms       0.000us         0.00%       5.088us       5.088us             1  
                                         CausalConv1dFn         3.93%      77.252us        95.23%       1.873ms     624.219us       0.000us         0.00%       5.088us       1.696us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.36%      26.710us        89.83%       1.766ms     588.805us       3.808us       100.00%       5.088us       1.696us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.808us       100.00%       3.808us       1.269us             3  
                                Activity Buffer Request        74.34%       1.462ms        74.34%       1.462ms       1.462ms       1.280us        33.61%       1.280us       1.280us             1  
                                       aten::empty_like         0.41%       8.060us         1.47%      28.990us       9.663us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.06%      20.930us         1.06%      20.930us       6.977us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        14.13%     277.777us        14.13%     277.777us      92.592us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       4.831us         0.25%       4.831us       4.831us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.966ms
Self CUDA time total: 3.808us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     126.686us      2639.84%     126.686us     126.686us             1  
                               hf_kernels_causal_conv1d         4.55%      87.622us        99.73%       1.920ms       1.920ms       0.000us         0.00%       6.430us       6.430us             1  
                                         CausalConv1dFn         3.92%      75.482us        95.18%       1.832ms     610.789us       0.000us         0.00%       6.430us       2.143us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.44%      27.663us        89.66%       1.726ms     575.372us       4.799us       100.00%       6.430us       2.143us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.799us       100.00%       4.799us       1.600us             3  
                                Activity Buffer Request        74.49%       1.434ms        74.49%       1.434ms       1.434ms       1.631us        33.99%       1.631us       1.631us             1  
                                       aten::empty_like         0.42%       8.140us         1.60%      30.770us      10.257us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.18%      22.630us         1.18%      22.630us       7.543us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        13.74%     264.526us        13.74%     264.526us      88.175us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.120us         0.27%       5.120us       5.120us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.925ms
Self CUDA time total: 4.799us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.083us      2423.58%     117.083us     117.083us             1  
                               hf_kernels_causal_conv1d        12.24%      83.203us        99.28%     674.957us     674.957us       0.000us         0.00%       6.463us       6.463us             1  
                                         CausalConv1dFn        10.43%      70.911us        87.04%     591.754us     197.251us       0.000us         0.00%       6.463us       2.154us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.93%      26.710us        72.18%     490.682us     163.561us       4.831us       100.00%       6.463us       2.154us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us       100.00%       4.831us       1.610us             3  
                                Activity Buffer Request        32.42%     220.416us        32.42%     220.416us     220.416us       1.632us        33.78%       1.632us       1.632us             1  
                                       aten::empty_like         1.07%       7.270us         4.44%      30.161us      10.054us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.37%      22.891us         3.37%      22.891us       7.630us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.83%     243.556us        35.83%     243.556us      81.185us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.72%       4.870us         0.72%       4.870us       4.870us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 679.827us
Self CUDA time total: 4.831us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.381us      1167.35%     124.381us     124.381us             1  
                               hf_kernels_causal_conv1d         4.48%      85.542us        99.75%       1.904ms       1.904ms       0.000us         0.00%      14.271us      14.271us             1  
                                         CausalConv1dFn         3.83%      73.182us        95.27%       1.819ms     606.282us       0.000us         0.00%      14.271us       4.757us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.41%      26.960us        89.88%       1.716ms     571.988us      10.655us       100.00%      14.271us       4.757us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.655us       100.00%      10.655us       3.552us             3  
                                Activity Buffer Request        76.01%       1.451ms        76.01%       1.451ms       1.451ms       3.616us        33.94%       3.616us       3.616us             1  
                                       aten::empty_like         0.43%       8.120us         1.56%      29.700us       9.900us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.13%      21.580us         1.13%      21.580us       7.193us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        12.45%     237.787us        12.45%     237.787us      79.262us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       4.860us         0.25%       4.860us       4.860us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.909ms
Self CUDA time total: 10.655us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.652us      1120.72%     122.652us     122.652us             1  
                               hf_kernels_causal_conv1d        12.91%      86.303us        99.27%     663.588us     663.588us       0.000us         0.00%      14.624us      14.624us             1  
                                         CausalConv1dFn        10.74%      71.821us        86.36%     577.285us     192.428us       0.000us         0.00%      14.624us       4.875us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.81%      25.480us        71.21%     476.023us     158.674us      10.944us       100.00%      14.624us       4.875us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.944us       100.00%      10.944us       3.648us             3  
                                Activity Buffer Request        32.82%     219.426us        32.82%     219.426us     219.426us       3.680us        33.63%       3.680us       3.680us             1  
                                       aten::empty_like         1.14%       7.591us         4.40%      29.441us       9.814us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.27%      21.850us         3.27%      21.850us       7.283us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.57%     231.117us        34.57%     231.117us      77.039us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.73%       4.900us         0.73%       4.900us       4.900us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 668.488us
Self CUDA time total: 10.944us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     130.430us      1181.43%     130.430us     130.430us             1  
                               hf_kernels_causal_conv1d         4.23%      79.341us        99.73%       1.871ms       1.871ms       0.000us         0.00%      14.784us      14.784us             1  
                                         CausalConv1dFn         4.03%      75.521us        95.50%       1.792ms     597.206us       0.000us         0.00%      14.784us       4.928us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.43%      26.810us        89.82%       1.685ms     561.675us      11.040us       100.00%      14.784us       4.928us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.040us       100.00%      11.040us       3.680us             3  
                                Activity Buffer Request        77.07%       1.446ms        77.07%       1.446ms       1.446ms       3.744us        33.91%       3.744us       3.744us             1  
                                       aten::empty_like         0.44%       8.272us         1.66%      31.072us      10.357us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.22%      22.800us         1.22%      22.800us       7.600us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        11.32%     212.286us        11.32%     212.286us      70.762us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.130us         0.27%       5.130us       5.130us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.876ms
Self CUDA time total: 11.040us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.097us      1060.18%     120.097us     120.097us             1  
                               hf_kernels_causal_conv1d        13.35%      76.301us        99.17%     566.674us     566.674us       0.000us         0.00%      15.168us      15.168us             1  
                                         CausalConv1dFn        12.80%      73.153us        85.81%     490.373us     163.458us       0.000us         0.00%      15.168us       5.056us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.71%      26.911us        68.00%     388.569us     129.523us      11.328us       100.00%      15.168us       5.056us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.328us       100.00%      11.328us       3.776us             3  
                                Activity Buffer Request        34.49%     197.075us        34.49%     197.075us     197.075us       3.840us        33.90%       3.840us       3.840us             1  
                                       aten::empty_like         1.29%       7.379us         5.01%      28.651us       9.550us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.72%      21.272us         3.72%      21.272us       7.091us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        28.80%     164.583us        28.80%     164.583us      54.861us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.83%       4.760us         0.83%       4.760us       4.760us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 571.434us
Self CUDA time total: 11.328us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     133.919us       265.71%     133.919us     133.919us             1  
                               hf_kernels_causal_conv1d         4.38%      80.552us        99.73%       1.836ms       1.836ms       0.000us         0.00%      83.873us      83.873us             1  
                                         CausalConv1dFn         4.09%      75.353us        95.35%       1.755ms     585.145us       0.000us         0.00%      83.873us      27.958us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.33%      24.410us        89.50%       1.648ms     549.264us      50.401us       100.00%      83.873us      27.958us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      50.401us       100.00%      50.401us      16.800us             3  
                                Activity Buffer Request        79.01%       1.455ms        79.01%       1.455ms       1.455ms      33.472us        66.41%      33.472us      33.472us             1  
                                       aten::empty_like         0.45%       8.369us         1.75%      32.290us      10.763us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.30%      23.921us         1.30%      23.921us       7.974us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.17%     168.764us         9.17%     168.764us      56.255us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.020us         0.27%       5.020us       5.020us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.841ms
Self CUDA time total: 50.401us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     131.005us       256.03%     131.005us     131.005us             1  
                               hf_kernels_causal_conv1d        11.69%      77.241us        99.25%     655.717us     655.717us       0.000us         0.00%      85.534us      85.534us             1  
                                         CausalConv1dFn        10.97%      72.503us        87.56%     578.476us     192.825us       0.000us         0.00%      85.534us      28.511us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.89%      25.692us        71.76%     474.103us     158.034us      51.167us       100.00%      85.534us      28.511us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      51.167us       100.00%      51.167us      17.056us             3  
                                Activity Buffer Request        43.08%     284.587us        43.08%     284.587us     284.587us      34.367us        67.17%      34.367us      34.367us             1  
                                       aten::empty_like         1.14%       7.549us         4.82%      31.870us      10.623us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.68%      24.321us         3.68%      24.321us       8.107us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        24.80%     163.824us        24.80%     163.824us      54.608us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.75%       4.929us         0.75%       4.929us       4.929us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 660.646us
Self CUDA time total: 51.167us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.686us      3040.89%     118.686us     118.686us             1  
                               hf_kernels_causal_conv1d        11.60%      73.750us        99.24%     631.216us     631.216us       0.000us         0.00%       5.183us       5.183us             1  
                                         CausalConv1dFn        11.30%      71.845us        87.65%     557.466us     185.822us       0.000us         0.00%       5.183us       1.728us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.22%      26.861us        71.87%     457.101us     152.367us       3.903us       100.00%       5.183us       1.728us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.903us       100.00%       3.903us       1.301us             3  
                                Activity Buffer Request        42.38%     269.577us        42.38%     269.577us     269.577us       1.280us        32.80%       1.280us       1.280us             1  
                                       aten::empty_like         1.23%       7.810us         4.48%      28.520us       9.507us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.26%      20.710us         3.26%      20.710us       6.903us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        25.26%     160.663us        25.26%     160.663us      53.554us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.76%       4.821us         0.76%       4.821us       4.821us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 636.037us
Self CUDA time total: 3.903us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.221us      3029.76%     120.221us     120.221us             1  
                               hf_kernels_causal_conv1d        13.01%      75.082us        99.09%     571.775us     571.775us       0.000us         0.00%       5.248us       5.248us             1  
                                         CausalConv1dFn        12.35%      71.241us        86.08%     496.693us     165.564us       0.000us         0.00%       5.248us       1.749us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.88%      28.181us        68.58%     395.720us     131.907us       3.968us       100.00%       5.248us       1.749us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us       100.00%       3.968us       1.323us             3  
                                Activity Buffer Request        36.26%     209.246us        36.26%     209.246us     209.246us       1.280us        32.26%       1.280us       1.280us             1  
                                       aten::empty_like         1.42%       8.172us         5.15%      29.732us       9.911us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.74%      21.560us         3.74%      21.560us       7.187us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        27.43%     158.293us        27.43%     158.293us      52.764us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.91%       5.270us         0.91%       5.270us       5.270us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 577.045us
Self CUDA time total: 3.968us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.374us      2843.36%     117.374us     117.374us             1  
                               hf_kernels_causal_conv1d        14.38%      74.792us        98.97%     514.843us     514.843us       0.000us         0.00%       5.504us       5.504us             1  
                                         CausalConv1dFn        13.25%      68.940us        84.59%     440.051us     146.684us       0.000us         0.00%       5.504us       1.835us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.99%      25.981us        65.51%     340.779us     113.593us       4.128us       100.00%       5.504us       1.835us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.128us       100.00%       4.128us       1.376us             3  
                                Activity Buffer Request        29.84%     155.214us        29.84%     155.214us     155.214us       1.376us        33.33%       1.376us       1.376us             1  
                                       aten::empty_like         1.55%       8.080us         5.83%      30.332us      10.111us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.28%      22.252us         4.28%      22.252us       7.417us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        30.68%     159.584us        30.68%     159.584us      53.195us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.03%       5.380us         1.03%       5.380us       5.380us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 520.223us
Self CUDA time total: 4.128us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     116.831us      2875.49%     116.831us     116.831us             1  
                               hf_kernels_causal_conv1d        13.78%      75.282us        99.09%     541.484us     541.484us       0.000us         0.00%       5.439us       5.439us             1  
                                         CausalConv1dFn        12.58%      68.741us        85.32%     466.202us     155.401us       0.000us         0.00%       5.439us       1.813us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.76%      26.021us        67.34%     367.980us     122.660us       4.063us       100.00%       5.439us       1.813us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        33.52%     183.175us        33.52%     183.175us     183.175us       1.376us        33.87%       1.376us       1.376us             1  
                                       aten::empty_like         1.37%       7.489us         5.40%      29.481us       9.827us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.02%      21.992us         4.02%      21.992us       7.331us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        29.06%     158.784us        29.06%     158.784us      52.928us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.91%       4.951us         0.91%       4.951us       4.951us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 546.435us
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.806us      2228.53%     119.806us     119.806us             1  
                               hf_kernels_causal_conv1d        11.93%      76.073us        99.21%     632.507us     632.507us       0.000us         0.00%       7.200us       7.200us             1  
                                         CausalConv1dFn        11.21%      71.480us        87.28%     556.434us     185.478us       0.000us         0.00%       7.200us       2.400us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.13%      26.361us        71.46%     455.612us     151.871us       5.376us       100.00%       7.200us       2.400us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.376us       100.00%       5.376us       1.792us             3  
                                Activity Buffer Request        42.49%     270.867us        42.49%     270.867us     270.867us       1.824us        33.93%       1.824us       1.824us             1  
                                       aten::empty_like         1.24%       7.892us         4.60%      29.342us       9.781us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.36%      21.450us         3.36%      21.450us       7.150us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        24.84%     158.384us        24.84%     158.384us      52.795us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.79%       5.050us         0.79%       5.050us       5.050us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 637.557us
Self CUDA time total: 5.376us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.676us      2174.35%     119.676us     119.676us             1  
                               hf_kernels_causal_conv1d        14.25%      74.352us        99.01%     516.513us     516.513us       0.000us         0.00%       7.392us       7.392us             1  
                                         CausalConv1dFn        14.02%      73.122us        84.76%     442.161us     147.387us       0.000us         0.00%       7.392us       2.464us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.04%      26.281us        65.18%     340.038us     113.346us       5.504us       100.00%       7.392us       2.464us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.504us       100.00%       5.504us       1.835us             3  
                                Activity Buffer Request        30.19%     157.524us        30.19%     157.524us     157.524us       1.888us        34.30%       1.888us       1.888us             1  
                                       aten::empty_like         1.50%       7.800us         5.56%      29.001us       9.667us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.06%      21.201us         4.06%      21.201us       7.067us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        29.95%     156.233us        29.95%     156.233us      52.078us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.99%       5.180us         0.99%       5.180us       5.180us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 521.693us
Self CUDA time total: 5.504us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.798us       715.63%     124.798us     124.798us             1  
                               hf_kernels_causal_conv1d        11.85%      75.293us        99.15%     630.167us     630.167us       0.000us         0.00%      23.295us      23.295us             1  
                                         CausalConv1dFn        11.06%      70.310us        87.30%     554.874us     184.958us       0.000us         0.00%      23.295us       7.765us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.18%      26.540us        71.39%     453.732us     151.244us      17.439us       100.00%      23.295us       7.765us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.439us       100.00%      17.439us       5.813us             3  
                                Activity Buffer Request        42.20%     268.237us        42.20%     268.237us     268.237us       5.856us        33.58%       5.856us       5.856us             1  
                                       aten::empty_like         1.25%       7.951us         4.85%      30.832us      10.277us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.60%      22.881us         3.60%      22.881us       7.627us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        25.01%     158.955us        25.01%     158.955us      52.985us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.85%       5.410us         0.85%       5.410us       5.410us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 635.577us
Self CUDA time total: 17.439us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.252us       695.89%     124.252us     124.252us             1  
                               hf_kernels_causal_conv1d        15.28%      76.213us        99.04%     494.053us     494.053us       0.000us         0.00%      23.839us      23.839us             1  
                                         CausalConv1dFn        14.60%      72.841us        83.76%     417.840us     139.280us       0.000us         0.00%      23.839us       7.946us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.38%      26.851us        63.27%     315.607us     105.202us      17.855us       100.00%      23.839us       7.946us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.855us       100.00%      17.855us       5.952us             3  
                                Activity Buffer Request        26.40%     131.703us        26.40%     131.703us     131.703us       5.984us        33.51%       5.984us       5.984us             1  
                                       aten::empty_like         1.62%       8.090us         5.89%      29.392us       9.797us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.27%      21.302us         4.27%      21.302us       7.101us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        31.48%     157.053us        31.48%     157.053us      52.351us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.96%       4.810us         0.96%       4.810us       4.810us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 498.863us
Self CUDA time total: 17.855us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.253us       695.94%     124.253us     124.253us             1  
                               hf_kernels_causal_conv1d        14.09%      92.581us        99.22%     652.096us     652.096us       0.000us         0.00%      23.838us      23.838us             1  
                                         CausalConv1dFn        11.45%      75.254us        85.13%     559.515us     186.505us       0.000us         0.00%      23.838us       7.946us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.84%      25.251us        69.30%     455.481us     151.827us      17.854us       100.00%      23.838us       7.946us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.854us       100.00%      17.854us       5.951us             3  
                                Activity Buffer Request        41.42%     272.247us        41.42%     272.247us     272.247us       5.984us        33.52%       5.984us       5.984us             1  
                                       aten::empty_like         1.19%       7.849us         4.38%      28.780us       9.593us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.18%      20.931us         3.18%      20.931us       6.977us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        24.04%     157.983us        24.04%     157.983us      52.661us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.78%       5.140us         0.78%       5.140us       5.140us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 657.236us
Self CUDA time total: 17.854us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.982us       651.61%     121.982us     121.982us             1  
                               hf_kernels_causal_conv1d        16.26%      76.273us        99.00%     464.343us     464.343us       0.000us         0.00%      25.088us      25.088us             1  
                                         CausalConv1dFn        15.20%      71.302us        82.74%     388.070us     129.357us       0.000us         0.00%      25.088us       8.363us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.49%      25.750us        61.15%     286.808us      95.603us      18.720us       100.00%      25.088us       8.363us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.720us       100.00%      18.720us       6.240us             3  
                                Activity Buffer Request        22.13%     103.813us        22.13%     103.813us     103.813us       6.368us        34.02%       6.368us       6.368us             1  
                                       aten::empty_like         1.75%       8.210us         6.39%      29.960us       9.987us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.64%      21.750us         4.64%      21.750us       7.250us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.53%     157.245us        33.53%     157.245us      52.415us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.00%       4.680us         1.00%       4.680us       4.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 469.023us
Self CUDA time total: 18.720us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         4.40%      80.973us        99.73%       1.837ms       1.837ms       0.000us         0.00%     162.749us     162.749us             1  
                                         CausalConv1dFn         4.14%      76.301us        95.33%       1.756ms     585.285us       0.000us         0.00%     162.749us      54.250us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.45%      26.730us        89.50%       1.648ms     549.474us      97.918us       100.00%     162.749us      54.250us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     141.950us       144.97%     141.950us     141.950us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.918us       100.00%      97.918us      32.639us             3  
                                Activity Buffer Request        78.99%       1.455ms        78.99%       1.455ms       1.455ms      64.831us        66.21%      64.831us      64.831us             1  
                                       aten::empty_like         0.45%       8.340us         1.69%      31.131us      10.377us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.24%      22.791us         1.24%      22.791us       7.597us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.06%     166.885us         9.06%     166.885us      55.628us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.980us         0.27%       4.980us       4.980us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.842ms
Self CUDA time total: 97.918us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        16.07%      76.871us        98.94%     473.172us     473.172us       0.000us         0.00%     163.803us     163.803us             1  
                                         CausalConv1dFn        14.96%      71.532us        82.87%     396.301us     132.100us       0.000us         0.00%     163.803us      54.601us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.75%      27.501us        61.56%     294.418us      98.139us      98.685us       100.00%     163.803us      54.601us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     133.180us       134.95%     133.180us     133.180us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      98.685us       100.00%      98.685us      32.895us             3  
                                Activity Buffer Request        21.65%     103.543us        21.65%     103.543us     103.543us      65.118us        65.99%      65.118us      65.118us             1  
                                       aten::empty_like         1.52%       7.251us         6.35%      30.351us      10.117us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.83%      23.100us         4.83%      23.100us       7.700us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.16%     163.374us        34.16%     163.374us      54.458us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.06%       5.061us         1.06%       5.061us       5.061us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 478.233us
Self CUDA time total: 98.685us


impl                     wl                  p50(ms)  ok
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
▶ UV Install Logs
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] Fetching 11 files: 9%|▉ | 1/11 [00:00<00:01, 9.42it/s] Fetching 11 files: 64%|██████▎ | 7/11 [00:01<00:00, 4.98it/s] Fetching 11 files: 100%|██████████| 11/11 [00:01<00:00, 7.98it/s]

Artifacts:

causal_conv1d.jsonl