HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.22s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Nov 10 21:58:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   28C    P0             78W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.29s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      81.151us      1892.96%      81.151us      81.151us             1  
                                      hf_kernels_swiglu         8.90%     185.545us        99.31%       2.071ms       2.071ms       0.000us         0.00%       5.727us       5.727us             1  
                      _activation_beeaae6::silu_and_mul         0.90%      18.858us        88.30%       1.842ms     613.846us       4.287us       100.00%       5.727us       1.909us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.287us       100.00%       4.287us       1.429us             3  
                                Activity Buffer Request        85.28%       1.779ms        85.28%       1.779ms       1.779ms       1.440us        33.59%       1.440us       1.440us             1  
                                            aten::empty         2.11%      44.080us         2.11%      44.080us      14.693us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.11%      44.091us         2.11%      44.091us      14.697us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.69%      14.370us         0.69%      14.370us      14.370us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.086ms
Self CUDA time total: 4.287us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.344us      1660.16%      65.344us      65.344us             1  
                                      hf_kernels_swiglu         4.80%      90.161us        99.69%       1.871ms       1.871ms       0.000us         0.00%       5.280us       5.280us             1  
                      _activation_beeaae6::silu_and_mul         1.05%      19.620us        93.88%       1.762ms     587.343us       3.936us       100.00%       5.280us       1.760us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.936us       100.00%       3.936us       1.312us             3  
                                Activity Buffer Request        91.30%       1.714ms        91.30%       1.714ms       1.714ms       1.344us        34.15%       1.344us       1.344us             1  
                                            aten::empty         1.01%      18.871us         1.01%      18.871us       6.290us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.53%      28.801us         1.53%      28.801us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.880us         0.31%       5.880us       5.880us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.877ms
Self CUDA time total: 3.936us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.967us      1388.21%      67.967us      67.967us             1  
                                      hf_kernels_swiglu         4.59%      88.711us        99.72%       1.927ms       1.927ms       0.000us         0.00%       6.560us       6.560us             1  
                      _activation_beeaae6::silu_and_mul         0.94%      18.080us        94.11%       1.819ms     606.193us       4.896us       100.00%       6.560us       2.187us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us       100.00%       4.896us       1.632us             3  
                                Activity Buffer Request        91.80%       1.774ms        91.80%       1.774ms       1.774ms       1.664us        33.99%       1.664us       1.664us             1  
                                            aten::empty         1.02%      19.730us         1.02%      19.730us       6.577us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.37%      26.441us         1.37%      26.441us       8.814us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.470us         0.28%       5.470us       5.470us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.932ms
Self CUDA time total: 4.896us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.448us      1584.08%      68.448us      68.448us             1  
                                      hf_kernels_swiglu         4.10%      87.981us        99.77%       2.141ms       2.141ms       0.000us         0.00%       5.794us       5.794us             1  
                      _activation_beeaae6::silu_and_mul         0.89%      19.190us        94.80%       2.034ms     678.097us       4.321us       100.00%       5.794us       1.931us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.321us       100.00%       4.321us       1.440us             3  
                                Activity Buffer Request        83.35%       1.789ms        83.35%       1.789ms       1.789ms       1.473us        34.09%       1.473us       1.473us             1  
                                            aten::empty         0.87%      18.670us         0.87%      18.670us       6.223us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        10.55%     226.443us        10.55%     226.443us      75.481us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       4.930us         0.23%       4.930us       4.930us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.146ms
Self CUDA time total: 4.321us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      66.816us      1128.46%      66.816us      66.816us             1  
                                      hf_kernels_swiglu         4.29%      87.791us        99.73%       2.043ms       2.043ms       0.000us         0.00%       7.906us       7.906us             1  
                      _activation_beeaae6::silu_and_mul         1.03%      21.101us        94.53%       1.936ms     645.491us       5.921us       100.00%       7.906us       2.635us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.921us       100.00%       5.921us       1.974us             3  
                                Activity Buffer Request        84.88%       1.739ms        84.88%       1.739ms       1.739ms       1.985us        33.52%       1.985us       1.985us             1  
                                            aten::empty         0.92%      18.779us         0.92%      18.779us       6.260us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.62%     176.604us         8.62%     176.604us      58.868us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.500us         0.27%       5.500us       5.500us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.049ms
Self CUDA time total: 5.921us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.807us       824.06%      63.807us      63.807us             1  
                                      hf_kernels_swiglu        17.99%      83.441us        98.85%     458.487us     458.487us       0.000us         0.00%      10.335us      10.335us             1  
                      _activation_beeaae6::silu_and_mul         4.27%      19.820us        76.93%     356.816us     118.939us       7.743us       100.00%      10.335us       3.445us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.743us       100.00%       7.743us       2.581us             3  
                                Activity Buffer Request        37.06%     171.903us        37.06%     171.903us     171.903us       2.592us        33.48%       2.592us       2.592us             1  
                                            aten::empty         3.93%      18.230us         3.93%      18.230us       6.077us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.60%     165.093us        35.60%     165.093us      55.031us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.15%       5.320us         1.15%       5.320us       5.320us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 463.807us
Self CUDA time total: 7.743us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.231us       959.06%      63.231us      63.231us             1  
                                      hf_kernels_swiglu        19.32%      83.900us        98.89%     429.436us     429.436us       0.000us         0.00%       8.802us       8.802us             1  
                      _activation_beeaae6::silu_and_mul         4.57%      19.830us        75.32%     327.085us     109.028us       6.593us       100.00%       8.802us       2.934us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.593us       100.00%       6.593us       2.198us             3  
                                Activity Buffer Request        34.73%     150.793us        34.73%     150.793us     150.793us       2.209us        33.51%       2.209us       2.209us             1  
                                            aten::empty         4.25%      18.451us         4.25%      18.451us       6.150us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        36.03%     156.462us        36.03%     156.462us      52.154us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.11%       4.800us         1.11%       4.800us       4.800us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 434.236us
Self CUDA time total: 6.593us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.544us       726.10%      68.544us      68.544us             1  
                                      hf_kernels_swiglu         4.25%      86.402us        99.73%       2.027ms       2.027ms       0.000us         0.00%      12.608us      12.608us             1  
                      _activation_beeaae6::silu_and_mul         1.00%      20.252us        94.52%       1.921ms     640.494us       9.440us       100.00%      12.608us       4.203us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.440us       100.00%       9.440us       3.147us             3  
                                Activity Buffer Request        85.77%       1.743ms        85.77%       1.743ms       1.743ms       3.168us        33.56%       3.168us       3.168us             1  
                                            aten::empty         0.96%      19.489us         0.96%      19.489us       6.496us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.76%     157.752us         7.76%     157.752us      52.584us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.440us         0.27%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.033ms
Self CUDA time total: 9.440us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      61.247us       467.96%      61.247us      61.247us             1  
                                      hf_kernels_swiglu        19.95%      80.811us        98.74%     399.916us     399.916us       0.000us         0.00%      17.504us      17.504us             1  
                      _activation_beeaae6::silu_and_mul         4.55%      18.440us        74.43%     301.465us     100.488us      13.088us       100.00%      17.504us       5.835us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.088us       100.00%      13.088us       4.363us             3  
                                Activity Buffer Request        32.08%     129.932us        32.08%     129.932us     129.932us       4.416us        33.74%       4.416us       4.416us             1  
                                            aten::empty         4.36%      17.640us         4.36%      17.640us       5.880us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        37.80%     153.093us        37.80%     153.093us      51.031us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.26%       5.090us         1.26%       5.090us       5.090us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 405.006us
Self CUDA time total: 13.088us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 12.91it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 18.06it/s]

Artifacts:

activation.jsonl