HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Wed Oct 29 15:50:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   28C    P0             78W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.78s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      79.968us      1983.33%      79.968us      79.968us             1  
                                      hf_kernels_swiglu        10.58%     184.424us        99.57%       1.736ms       1.736ms       0.000us         0.00%       5.408us       5.408us             1  
                      _activation_beeaae6::silu_and_mul         1.26%      21.900us        86.25%       1.504ms     501.188us       4.032us       100.00%       5.408us       1.803us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
                                Activity Buffer Request        82.49%       1.438ms        82.49%       1.438ms       1.438ms       1.376us        34.13%       1.376us       1.376us             1  
                                            aten::empty         2.74%      47.772us         2.74%      47.772us      15.924us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.50%      43.631us         2.50%      43.631us      14.544us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.43%       7.440us         0.43%       7.440us       7.440us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.743ms
Self CUDA time total: 4.032us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      60.192us      1516.94%      60.192us      60.192us             1  
                                      hf_kernels_swiglu         5.66%      89.803us        99.62%       1.581ms       1.581ms       0.000us         0.00%       5.312us       5.312us             1  
                      _activation_beeaae6::silu_and_mul         1.35%      21.470us        92.79%       1.473ms     491.035us       3.968us       100.00%       5.312us       1.771us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us       100.00%       3.968us       1.323us             3  
                                Activity Buffer Request        89.86%       1.427ms        89.86%       1.427ms       1.427ms       1.344us        33.87%       1.344us       1.344us             1  
                                            aten::empty         1.17%      18.590us         1.17%      18.590us       6.197us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.58%      25.022us         1.58%      25.022us       8.341us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.38%       6.110us         0.38%       6.110us       6.110us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.588ms
Self CUDA time total: 3.968us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.535us      1338.54%      65.535us      65.535us             1  
                                      hf_kernels_swiglu         5.56%      88.483us        99.64%       1.586ms       1.586ms       0.000us         0.00%       6.528us       6.528us             1  
                      _activation_beeaae6::silu_and_mul         1.35%      21.452us        92.87%       1.478ms     492.822us       4.896us       100.00%       6.528us       2.176us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us       100.00%       4.896us       1.632us             3  
                                Activity Buffer Request        89.90%       1.431ms        89.90%       1.431ms       1.431ms       1.632us        33.33%       1.632us       1.632us             1  
                                            aten::empty         1.21%      19.310us         1.21%      19.310us       6.437us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.63%      25.910us         1.63%      25.910us       8.637us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.36%       5.661us         0.36%       5.661us       5.661us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.592ms
Self CUDA time total: 4.896us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.008us      1562.69%      67.008us      67.008us             1  
                                      hf_kernels_swiglu         4.93%      90.832us        99.72%       1.836ms       1.836ms       0.000us         0.00%       5.728us       5.728us             1  
                      _activation_beeaae6::silu_and_mul         1.23%      22.581us        93.74%       1.726ms     575.177us       4.288us       100.00%       5.728us       1.909us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us       100.00%       4.288us       1.429us             3  
                                Activity Buffer Request        81.40%       1.498ms        81.40%       1.498ms       1.498ms       1.440us        33.58%       1.440us       1.440us             1  
                                            aten::empty         1.04%      19.180us         1.04%      19.180us       6.393us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        11.11%     204.595us        11.11%     204.595us      68.198us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.180us         0.28%       5.180us       5.180us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.841ms
Self CUDA time total: 4.288us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.800us      1106.37%      64.800us      64.800us             1  
                                      hf_kernels_swiglu         5.65%      97.973us        99.69%       1.728ms       1.728ms       0.000us         0.00%       7.810us       7.810us             1  
                      _activation_beeaae6::silu_and_mul         1.27%      22.090us        92.96%       1.611ms     536.996us       5.857us       100.00%       7.810us       2.603us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.857us       100.00%       5.857us       1.952us             3  
                                Activity Buffer Request        82.37%       1.427ms        82.37%       1.427ms       1.427ms       1.953us        33.34%       1.953us       1.953us             1  
                                            aten::empty         1.09%      18.810us         1.09%      18.810us       6.270us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.31%     161.434us         9.31%     161.434us      53.811us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.300us         0.31%       5.300us       5.300us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.733ms
Self CUDA time total: 5.857us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      77.311us      1002.48%      77.311us      77.311us             1  
                                      hf_kernels_swiglu        20.04%      98.272us        98.88%     484.972us     484.972us       0.000us         0.00%      10.304us      10.304us             1  
                      _activation_beeaae6::silu_and_mul         4.97%      24.390us        74.66%     366.210us     122.070us       7.712us       100.00%      10.304us       3.435us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.712us       100.00%       7.712us       2.571us             3  
                                Activity Buffer Request        34.13%     167.415us        34.13%     167.415us     167.415us       2.592us        33.61%       2.592us       2.592us             1  
                                            aten::empty         4.18%      20.490us         4.18%      20.490us       6.830us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.56%     174.405us        35.56%     174.405us      58.135us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.12%       5.511us         1.12%       5.511us       5.511us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 490.483us
Self CUDA time total: 7.712us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.327us       965.35%      63.327us      63.327us             1  
                                      hf_kernels_swiglu        20.14%      83.823us        98.84%     411.400us     411.400us       0.000us         0.00%       8.768us       8.768us             1  
                      _activation_beeaae6::silu_and_mul         5.43%      22.601us        74.29%     309.187us     103.062us       6.560us       100.00%       8.768us       2.923us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.560us       100.00%       6.560us       2.187us             3  
                                Activity Buffer Request        32.27%     134.313us        32.27%     134.313us     134.313us       2.208us        33.66%       2.208us       2.208us             1  
                                            aten::empty         4.42%      18.390us         4.42%      18.390us       6.130us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        36.59%     152.273us        36.59%     152.273us      50.758us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.16%       4.810us         1.16%       4.810us       4.810us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 416.210us
Self CUDA time total: 6.560us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      69.952us       743.54%      69.952us      69.952us             1  
                                      hf_kernels_swiglu         5.37%      93.270us        99.70%       1.733ms       1.733ms       0.000us         0.00%      12.544us      12.544us             1  
                      _activation_beeaae6::silu_and_mul         1.28%      22.251us        93.17%       1.619ms     539.830us       9.408us       100.00%      12.544us       4.181us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.408us       100.00%       9.408us       3.136us             3  
                                Activity Buffer Request        83.02%       1.443ms        83.02%       1.443ms       1.443ms       3.136us        33.33%       3.136us       3.136us             1  
                                            aten::empty         1.17%      20.271us         1.17%      20.271us       6.757us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.87%     154.165us         8.87%     154.165us      51.388us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.210us         0.30%       5.210us       5.210us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.738ms
Self CUDA time total: 9.408us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.278us       502.45%      65.278us      65.278us             1  
                                      hf_kernels_swiglu        20.56%      86.143us        98.78%     413.910us     413.910us       0.000us         0.00%      17.344us      17.344us             1  
                      _activation_beeaae6::silu_and_mul         5.61%      23.493us        73.70%     308.818us     102.939us      12.992us       100.00%      17.344us       5.781us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      12.992us       100.00%      12.992us       4.331us             3  
                                Activity Buffer Request        31.64%     132.592us        31.64%     132.592us     132.592us       4.352us        33.50%       4.352us       4.352us             1  
                                            aten::empty         4.52%      18.949us         4.52%      18.949us       6.316us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        36.45%     152.733us        36.45%     152.733us      50.911us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.22%       5.130us         1.22%       5.130us       5.130us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 419.040us
Self CUDA time total: 12.992us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 17.75it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 24.82it/s]

Artifacts:

activation.jsonl