HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.26s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Oct 27 14:46:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   32C    P0            153W /  350W |       0MiB /  46068MiB |     75%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 4.32s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      80.128us      1940.62%      80.128us      80.128us             1  
                                      hf_kernels_swiglu        11.19%     199.383us        99.56%       1.774ms       1.774ms       0.000us         0.00%       5.634us       5.634us             1  
                      _activation_beeaae6::silu_and_mul         1.10%      19.601us        85.64%       1.526ms     508.618us       4.129us       100.00%       5.634us       1.878us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.129us       100.00%       4.129us       1.376us             3  
                                Activity Buffer Request        82.30%       1.466ms        82.30%       1.466ms       1.466ms       1.505us        36.45%       1.505us       1.505us             1  
                                            aten::empty         2.73%      48.641us         2.73%      48.641us      16.214us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.24%      39.931us         2.24%      39.931us      13.310us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.44%       7.891us         0.44%       7.891us       7.891us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.782ms
Self CUDA time total: 4.129us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      77.823us      1961.76%      77.823us      77.823us             1  
                                      hf_kernels_swiglu         7.28%     119.722us        99.70%       1.640ms       1.640ms       0.000us         0.00%       5.311us       5.311us             1  
                      _activation_beeaae6::silu_and_mul         1.57%      25.841us        91.18%       1.500ms     499.858us       3.967us       100.00%       5.311us       1.770us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.967us       100.00%       3.967us       1.322us             3  
                                Activity Buffer Request        87.74%       1.443ms        87.74%       1.443ms       1.443ms       1.344us        33.88%       1.344us       1.344us             1  
                                            aten::empty         1.24%      20.410us         1.24%      20.410us       6.803us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.86%      30.650us         1.86%      30.650us      10.217us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       4.930us         0.30%       4.930us       4.930us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.645ms
Self CUDA time total: 3.967us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.487us      1369.46%      67.487us      67.487us             1  
                                      hf_kernels_swiglu         6.70%     107.400us        99.69%       1.598ms       1.598ms       0.000us         0.00%       6.592us       6.592us             1  
                      _activation_beeaae6::silu_and_mul         1.32%      21.191us        91.79%       1.471ms     490.438us       4.928us       100.00%       6.592us       2.197us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.928us       100.00%       4.928us       1.643us             3  
                                Activity Buffer Request        88.89%       1.425ms        88.89%       1.425ms       1.425ms       1.664us        33.77%       1.664us       1.664us             1  
                                            aten::empty         1.20%      19.281us         1.20%      19.281us       6.427us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.57%      25.210us         1.57%      25.210us       8.403us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       4.970us         0.31%       4.970us       4.970us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.603ms
Self CUDA time total: 4.928us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      75.265us      1768.03%      75.265us      75.265us             1  
                                      hf_kernels_swiglu         6.51%     118.032us        99.70%       1.807ms       1.807ms       0.000us         0.00%       5.697us       5.697us             1  
                      _activation_beeaae6::silu_and_mul         1.22%      22.071us        92.05%       1.668ms     556.119us       4.257us       100.00%       5.697us       1.899us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.257us       100.00%       4.257us       1.419us             3  
                                Activity Buffer Request        79.39%       1.439ms        79.39%       1.439ms       1.439ms       1.440us        33.83%       1.440us       1.440us             1  
                                            aten::empty         1.14%      20.640us         1.14%      20.640us       6.880us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        11.45%     207.513us        11.45%     207.513us      69.171us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.350us         0.30%       5.350us       5.350us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.812ms
Self CUDA time total: 4.257us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.471us      1111.94%      65.471us      65.471us             1  
                                      hf_kernels_swiglu        19.52%      89.390us        98.84%     452.537us     452.537us       0.000us         0.00%       7.872us       7.872us             1  
                      _activation_beeaae6::silu_and_mul         5.02%      23.003us        75.04%     343.547us     114.516us       5.888us       100.00%       7.872us       2.624us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.888us       100.00%       5.888us       1.963us             3  
                                Activity Buffer Request        33.89%     155.152us        33.89%     155.152us     155.152us       1.984us        33.70%       1.984us       1.984us             1  
                                            aten::empty         4.28%      19.600us         4.28%      19.600us       6.533us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        36.13%     165.392us        36.13%     165.392us      55.131us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.16%       5.290us         1.16%       5.290us       5.290us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 457.827us
Self CUDA time total: 5.888us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.383us       879.52%      68.383us      68.383us             1  
                                      hf_kernels_swiglu         6.83%     118.711us        99.72%       1.734ms       1.734ms       0.000us         0.00%      10.367us      10.367us             1  
                      _activation_beeaae6::silu_and_mul         1.25%      21.741us        91.78%       1.596ms     531.855us       7.775us       100.00%      10.367us       3.456us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.775us       100.00%       7.775us       2.592us             3  
                                Activity Buffer Request        81.74%       1.421ms        81.74%       1.421ms       1.421ms       2.592us        33.34%       2.592us       2.592us             1  
                                            aten::empty         1.11%      19.311us         1.11%      19.311us       6.437us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.79%     152.752us         8.79%     152.752us      50.917us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       4.930us         0.28%       4.930us       4.930us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.739ms
Self CUDA time total: 7.775us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      70.527us      1069.89%      70.527us      70.527us             1  
                                      hf_kernels_swiglu         6.20%     108.691us        99.73%       1.749ms       1.749ms       0.000us         0.00%       8.800us       8.800us             1  
                      _activation_beeaae6::silu_and_mul         1.29%      22.622us        92.35%       1.619ms     539.785us       6.592us       100.00%       8.800us       2.933us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.592us       100.00%       6.592us       2.197us             3  
                                Activity Buffer Request        82.48%       1.446ms        82.48%       1.446ms       1.446ms       2.208us        33.50%       2.208us       2.208us             1  
                                            aten::empty         1.18%      20.650us         1.18%      20.650us       6.883us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.58%     150.492us         8.58%     150.492us      50.164us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.780us         0.27%       4.780us       4.780us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.753ms
Self CUDA time total: 6.592us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      66.591us       703.03%      66.591us      66.591us             1  
                                      hf_kernels_swiglu        22.91%      88.512us        98.75%     381.506us     381.506us       0.000us         0.00%      12.640us      12.640us             1  
                      _activation_beeaae6::silu_and_mul         5.22%      20.151us        70.42%     272.064us      90.688us       9.472us       100.00%      12.640us       4.213us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.472us       100.00%       9.472us       3.157us             3  
                                Activity Buffer Request        26.21%     101.241us        26.21%     101.241us     101.241us       3.168us        33.45%       3.168us       3.168us             1  
                                            aten::empty         5.42%      20.930us         5.42%      20.930us       6.977us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        39.00%     150.672us        39.00%     150.672us      50.224us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.25%       4.820us         1.25%       4.820us       4.820us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 386.326us
Self CUDA time total: 9.472us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.295us       514.21%      67.295us      67.295us             1  
                                      hf_kernels_swiglu        24.05%     101.492us        98.90%     417.266us     417.266us       0.000us         0.00%      17.503us      17.503us             1  
                      _activation_beeaae6::silu_and_mul         5.33%      22.480us        70.08%     295.684us      98.561us      13.087us       100.00%      17.503us       5.834us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.087us       100.00%      13.087us       4.362us             3  
                                Activity Buffer Request        28.92%     122.012us        28.92%     122.012us     122.012us       4.416us        33.74%       4.416us       4.416us             1  
                                            aten::empty         4.76%      20.090us         4.76%      20.090us       6.697us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.83%     151.192us        35.83%     151.192us      50.397us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.10%       4.660us         1.10%       4.660us       4.660us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 421.926us
Self CUDA time total: 13.087us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.03  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 13.68it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 19.14it/s]

Artifacts:

activation.jsonl