HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.26s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Oct 31 20:00:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   33C    P0            108W /  350W |       0MiB /  46068MiB |     88%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 4.19s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us     105.055us      2585.65%     105.055us     105.055us             1  
                                      hf_kernels_swiglu        11.41%     202.714us        99.64%       1.770ms       1.770ms       0.000us         0.00%       5.471us       5.471us             1  
                      _activation_beeaae6::silu_and_mul         1.18%      21.050us        84.47%       1.501ms     500.190us       4.063us       100.00%       5.471us       1.824us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        80.70%       1.434ms        80.70%       1.434ms       1.434ms       1.408us        34.65%       1.408us       1.408us             1  
                                            aten::empty         3.76%      66.772us         3.76%      66.772us      22.257us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.58%      45.872us         2.58%      45.872us      15.291us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.36%       6.420us         0.36%       6.420us       6.420us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.776ms
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      61.119us      1540.69%      61.119us      61.119us             1  
                                      hf_kernels_swiglu         6.50%     104.811us        99.67%       1.607ms       1.607ms       0.000us         0.00%       5.279us       5.279us             1  
                      _activation_beeaae6::silu_and_mul         1.26%      20.331us        91.95%       1.482ms     494.073us       3.967us       100.00%       5.279us       1.760us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.967us       100.00%       3.967us       1.322us             3  
                                Activity Buffer Request        89.13%       1.437ms        89.13%       1.437ms       1.437ms       1.312us        33.07%       1.312us       1.312us             1  
                                            aten::empty         1.22%      19.632us         1.22%      19.632us       6.544us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.56%      25.120us         1.56%      25.120us       8.373us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.33%       5.360us         0.33%       5.360us       5.360us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.612ms
Self CUDA time total: 3.967us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.488us      1288.31%      63.488us      63.488us             1  
                                      hf_kernels_swiglu         6.89%     111.363us        99.67%       1.611ms       1.611ms       0.000us         0.00%       6.592us       6.592us             1  
                      _activation_beeaae6::silu_and_mul         1.36%      22.028us        91.47%       1.479ms     492.912us       4.928us       100.00%       6.592us       2.197us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.928us       100.00%       4.928us       1.643us             3  
                                Activity Buffer Request        88.52%       1.431ms        88.52%       1.431ms       1.431ms       1.664us        33.77%       1.664us       1.664us             1  
                                            aten::empty         1.30%      21.081us         1.30%      21.081us       7.027us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.59%      25.652us         1.59%      25.652us       8.551us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.33%       5.390us         0.33%       5.390us       5.390us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.617ms
Self CUDA time total: 4.928us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.000us      1585.82%      68.000us      68.000us             1  
                                      hf_kernels_swiglu         5.97%     106.915us        99.70%       1.784ms       1.784ms       0.000us         0.00%       5.760us       5.760us             1  
                      _activation_beeaae6::silu_and_mul         1.16%      20.770us        92.62%       1.658ms     552.564us       4.288us       100.00%       5.760us       1.920us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us       100.00%       4.288us       1.429us             3  
                                Activity Buffer Request        80.58%       1.442ms        80.58%       1.442ms       1.442ms       1.472us        34.33%       1.472us       1.472us             1  
                                            aten::empty         1.10%      19.770us         1.10%      19.770us       6.590us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        10.88%     194.785us        10.88%     194.785us      64.928us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.350us         0.30%       5.350us       5.350us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.790ms
Self CUDA time total: 4.288us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.599us      1108.28%      65.599us      65.599us             1  
                                      hf_kernels_swiglu        18.75%      89.073us        98.88%     469.813us     469.813us       0.000us         0.00%       7.903us       7.903us             1  
                      _activation_beeaae6::silu_and_mul         4.69%      22.280us        76.20%     362.069us     120.690us       5.919us       100.00%       7.903us       2.634us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.919us       100.00%       5.919us       1.973us             3  
                                Activity Buffer Request        38.23%     181.645us        38.23%     181.645us     181.645us       1.984us        33.52%       1.984us       1.984us             1  
                                            aten::empty         3.93%      18.671us         3.93%      18.671us       6.224us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.28%     158.144us        33.28%     158.144us      52.715us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.12%       5.330us         1.12%       5.330us       5.330us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 475.143us
Self CUDA time total: 5.919us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      70.207us       906.60%      70.207us      70.207us             1  
                                      hf_kernels_swiglu         6.12%     106.261us        99.74%       1.733ms       1.733ms       0.000us         0.00%      10.336us      10.336us             1  
                      _activation_beeaae6::silu_and_mul         1.25%      21.782us        92.41%       1.606ms     535.254us       7.744us       100.00%      10.336us       3.445us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.744us       100.00%       7.744us       2.581us             3  
                                Activity Buffer Request        82.36%       1.431ms        82.36%       1.431ms       1.431ms       2.592us        33.47%       2.592us       2.592us             1  
                                            aten::empty         1.21%      21.081us         1.21%      21.081us       7.027us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.80%     152.893us         8.80%     152.893us      50.964us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.26%       4.511us         0.26%       4.511us       4.511us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.738ms
Self CUDA time total: 7.744us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      69.214us      1045.06%      69.214us      69.214us             1  
                                      hf_kernels_swiglu         7.00%     122.783us        99.73%       1.750ms       1.750ms       0.000us         0.00%       8.830us       8.830us             1  
                      _activation_beeaae6::silu_and_mul         1.22%      21.430us        91.58%       1.607ms     535.694us       6.623us       100.00%       8.830us       2.943us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.623us       100.00%       6.623us       2.208us             3  
                                Activity Buffer Request        81.74%       1.434ms        81.74%       1.434ms       1.434ms       2.207us        33.32%       2.207us       2.207us             1  
                                            aten::empty         1.15%      20.211us         1.15%      20.211us       6.737us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.62%     151.304us         8.62%     151.304us      50.435us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.780us         0.27%       4.780us       4.780us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.755ms
Self CUDA time total: 6.623us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.152us       692.52%      65.152us      65.152us             1  
                                      hf_kernels_swiglu        21.62%      91.474us        98.93%     418.571us     418.571us       0.000us         0.00%      12.576us      12.576us             1  
                      _activation_beeaae6::silu_and_mul         4.88%      20.631us        69.03%     292.067us      97.356us       9.408us       100.00%      12.576us       4.192us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.408us       100.00%       9.408us       3.136us             3  
                                Activity Buffer Request        28.63%     121.143us        28.63%     121.143us     121.143us       3.168us        33.67%       3.168us       3.168us             1  
                                            aten::empty         8.28%      35.030us         8.28%      35.030us      11.677us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.52%     150.293us        35.52%     150.293us      50.098us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.07%       4.530us         1.07%       4.530us       4.530us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 423.101us
Self CUDA time total: 9.408us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.197us       514.72%      67.197us      67.197us             1  
                                      hf_kernels_swiglu        22.39%      97.642us        98.93%     431.481us     431.481us       0.000us         0.00%      17.439us      17.439us             1  
                      _activation_beeaae6::silu_and_mul         4.99%      21.781us        71.94%     313.789us     104.596us      13.055us       100.00%      17.439us       5.813us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.055us       100.00%      13.055us       4.352us             3  
                                Activity Buffer Request        32.48%     141.684us        32.48%     141.684us     141.684us       4.384us        33.58%       4.384us       4.384us             1  
                                            aten::empty         4.60%      20.050us         4.60%      20.050us       6.683us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.47%     150.324us        34.47%     150.324us      50.108us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.07%       4.681us         1.07%       4.681us       4.681us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 436.162us
Self CUDA time total: 13.055us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 15.31it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 21.41it/s]

Artifacts:

activation.jsonl