Compiled SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 4.01s | Raw GitHub

import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)

Wed Oct 22 18:32:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:38:00.0 Off |                    0 |
| N/A   38C    P0             27W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00000000:3A:00.0 Off |                    0 |
| N/A   36C    P0             28W /   72W |       1MiB /  23034MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      Off |   00000000:3C:00.0 Off |                    0 |
| N/A   37C    P0             27W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      Off |   00000000:3E:00.0 Off |                    0 |
| N/A   35C    P0             27W /   72W |       1MiB /  23034MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark (torch.compile)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 48.06s | Raw GitHub

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
import kernels_benchmark_tools as kbt


def torch_swiglu_base(input_tensor):
    """Base PyTorch SwiGLU implementation"""
    d = input_tensor.shape[-1] // 2
    x1 = input_tensor[..., :d]
    x2 = input_tensor[..., d:]
    return torch.nn.functional.silu(x1) * x2


# Compile the function
compiled_swiglu = torch.compile(torch_swiglu_base, mode="max-autotune", fullgraph=True, dynamic=False)


# Register the implementation
kbt.add(
    "compiled_swiglu_max_autotune",
    compiled_swiglu,
    tags={"family": "torch", "backend": "compiled", "compile": "max-autotune"},
)

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype = "float32" if device == "cpu" else "bfloat16"

    # Generate workloads - using a subset for faster testing
    if device == "cuda":
        wl = list(kbt.activation.llama_workloads(dtype=dtype))[:3]
    else:
        wl = list(kbt.activation.cpu_workloads(dtype=dtype))[:3]

    print(f"Running SwiGLU benchmarks on {device} with {dtype}")
    print(f"Testing {len(wl)} workloads")

    # Run benchmark
    kbt.run(
        wl,
        jsonl="activation.jsonl",
        reps=5,
        warmup=2,
        gen=kbt.activation.gen_inputs,
        ref=kbt.activation.ref_swiglu,
        cmp=kbt.activation.cmp_allclose,
        profile_trace=True
    )

    kbt.summarize(["activation.jsonl"])

Running SwiGLU benchmarks on cuda with bfloat16
Testing 3 workloads

======================================================================
PROFILE TRACE: compiled_swiglu_max_autotune | llama_T512_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                           compiled_swiglu_max_autotune         0.00%       0.000us         0.00%       0.000us       0.000us       2.296ms      1293.96%       2.296ms       1.148ms             2  
                           compiled_swiglu_max_autotune         0.11%     188.163us        99.97%     170.504ms     170.504ms       0.000us         0.00%     223.743us     223.743us             1  
                             Torch-Compiled Region: 0/1         1.59%       2.709ms        99.83%     170.252ms      56.751ms      63.073us        35.54%     223.743us      74.581us             3  
                                   aten::_foreach_copy_         0.02%      40.801us         0.06%      94.471us      31.490us     111.167us        62.64%     111.167us      37.056us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     111.167us        62.64%     111.167us      37.056us             3  
                            triton_poi_fused_mul_silu_0         0.00%       0.000us         0.00%       0.000us       0.000us      63.073us        35.54%      63.073us      21.024us             3  
                                Activity Buffer Request         0.88%       1.493ms         0.88%       1.493ms       1.493ms      46.272us        26.07%      46.272us      46.272us             1  
                    CUDAGraphNode.record (dynamo_timed)         0.00%       0.000us         0.00%       0.000us       0.000us       3.904us         2.20%       3.904us       3.904us             1  
                    CUDAGraphNode.record (dynamo_timed)        96.58%     164.707ms        97.15%     165.693ms     165.693ms       0.000us         0.00%       3.231us       3.231us             1  
                                            aten::fill_         0.02%      34.792us         0.05%      79.243us      39.622us       3.231us         1.82%       3.231us       1.616us             2  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.231us         1.82%       3.231us       1.616us             2  
                               TorchDynamo Cache Lookup         0.04%      63.742us         0.04%      63.742us      21.247us       0.000us         0.00%       0.000us       0.000us             3  
                                      Pregraph bytecode         0.01%      10.491us         0.01%      10.491us       3.497us       0.000us         0.00%       0.000us       0.000us             3  
                 AOTDispatcher Runtime Wrapper Prologue         0.01%      25.570us         0.01%      25.570us       8.523us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.11%     186.823us         0.11%     186.823us      31.137us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaStreamIsCapturing         0.01%       9.469us         0.01%       9.469us       0.728us       0.000us         0.00%       0.000us       0.000us            13  
                               cudaEventRecordWithFlags         0.00%       5.160us         0.00%       5.160us       1.720us       0.000us         0.00%       0.000us       0.000us             3  
                                    cudaStreamWaitEvent         0.00%       4.431us         0.00%       4.431us       1.477us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.02%      38.771us         0.02%      38.771us      12.924us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.06%      98.121us         0.06%      98.121us      19.624us       0.000us         0.00%       0.000us       0.000us             5  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 170.547ms
Self CUDA time total: 177.471us



======================================================================
PROFILE TRACE: compiled_swiglu_max_autotune | llama_T512_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                           compiled_swiglu_max_autotune         0.00%       0.000us         0.00%       0.000us       0.000us       2.405ms       619.69%       2.405ms       1.203ms             2  
                           compiled_swiglu_max_autotune         0.08%     131.853us        99.91%     172.119ms     172.119ms       0.000us         0.00%     479.775us     479.775us             1  
                             Torch-Compiled Region: 0/3         1.36%       2.338ms        99.81%     171.946ms      57.315ms     128.384us        33.08%     479.775us     159.925us             3  
                                   aten::_foreach_copy_         0.02%      37.442us         0.05%      84.203us      28.068us     251.231us        64.72%     251.231us      83.744us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     251.231us        64.72%     251.231us      83.744us             3  
                            triton_poi_fused_mul_silu_0         0.00%       0.000us         0.00%       0.000us       0.000us     128.384us        33.08%     128.384us      42.795us             3  
                    CUDAGraphNode.record (dynamo_timed)         0.00%       0.000us         0.00%       0.000us       0.000us      92.256us        23.77%      92.256us      92.256us             1  
                                Activity Buffer Request         0.92%       1.587ms         0.92%       1.587ms       1.587ms      91.616us        23.60%      91.616us      91.616us             1  
                    CUDAGraphNode.record (dynamo_timed)        95.79%     165.024ms        96.96%     167.028ms     167.028ms       0.000us         0.00%       8.544us       8.544us             1  
                                            aten::fill_         0.02%      32.491us         0.04%      75.082us      37.541us       8.544us         2.20%       8.544us       4.272us             2  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       8.544us         2.20%       8.544us       4.272us             2  
                               TorchDynamo Cache Lookup         0.02%      40.732us         0.02%      40.732us      13.577us       0.000us         0.00%       0.000us       0.000us             3  
                                      Pregraph bytecode         0.00%       8.310us         0.00%       8.310us       2.770us       0.000us         0.00%       0.000us       0.000us             3  
                 AOTDispatcher Runtime Wrapper Prologue         0.01%      16.041us         0.01%      16.041us       5.347us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.57%     983.751us         0.57%     983.751us     163.958us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaStreamIsCapturing         0.01%      11.222us         0.01%      11.222us       0.863us       0.000us         0.00%       0.000us       0.000us            13  
                               cudaEventRecordWithFlags         0.00%       3.600us         0.00%       3.600us       1.200us       0.000us         0.00%       0.000us       0.000us             3  
                                    cudaStreamWaitEvent         0.00%       2.570us         0.00%       2.570us       0.857us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.01%      12.031us         0.01%      12.031us       4.010us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.05%      89.352us         0.05%      89.352us      17.870us       0.000us         0.00%       0.000us       0.000us             5  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 172.272ms
Self CUDA time total: 388.159us



======================================================================
PROFILE TRACE: compiled_swiglu_max_autotune | llama_T512_D11008
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                           compiled_swiglu_max_autotune         0.00%       0.000us         0.00%       0.000us       0.000us       2.450ms       377.03%       2.450ms       1.225ms             2  
                           compiled_swiglu_max_autotune         0.08%     134.723us        99.92%     178.441ms     178.441ms       0.000us         0.00%     781.947us     781.947us             1  
                             Torch-Compiled Region: 0/5         1.30%       2.325ms        99.82%     178.262ms      59.421ms     197.791us        30.43%     781.947us     260.649us             3  
                                   aten::_foreach_copy_         0.02%      39.550us         0.05%      84.550us      28.183us     448.893us        69.07%     448.893us     149.631us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     448.893us        69.07%     448.893us     149.631us             3  
                            triton_poi_fused_mul_silu_0         0.00%       0.000us         0.00%       0.000us       0.000us     197.791us        30.43%     197.791us      65.930us             3  
                                Activity Buffer Request         0.76%       1.355ms         0.76%       1.355ms       1.355ms     132.031us        20.32%     132.031us     132.031us             1  
                    CUDAGraphNode.record (dynamo_timed)         0.00%       0.000us         0.00%       0.000us       0.000us       3.904us         0.60%       3.904us       3.904us             1  
                    CUDAGraphNode.record (dynamo_timed)        95.83%     171.134ms        97.55%     174.211ms     174.211ms       0.000us         0.00%       3.232us       3.232us             1  
                                            aten::fill_         0.02%      34.980us         0.04%      68.671us      34.336us       3.232us         0.50%       3.232us       1.616us             2  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.232us         0.50%       3.232us       1.616us             2  
                               TorchDynamo Cache Lookup         0.02%      44.150us         0.02%      44.150us      14.717us       0.000us         0.00%       0.000us       0.000us             3  
                                      Pregraph bytecode         0.01%       8.940us         0.01%       8.940us       2.980us       0.000us         0.00%       0.000us       0.000us             3  
                 AOTDispatcher Runtime Wrapper Prologue         0.01%      15.980us         0.01%      15.980us       5.327us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.20%     354.147us         0.20%     354.147us      59.024us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaStreamIsCapturing         0.01%      10.008us         0.01%      10.008us       0.770us       0.000us         0.00%       0.000us       0.000us            13  
                               cudaEventRecordWithFlags         0.00%       3.610us         0.00%       3.610us       1.203us       0.000us         0.00%       0.000us       0.000us             3  
                                    cudaStreamWaitEvent         0.00%       2.580us         0.00%       2.580us       0.860us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.01%      12.510us         0.01%      12.510us       4.170us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.04%      78.691us         0.04%      78.691us      15.738us       0.000us         0.00%       0.000us       0.000us             5  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 178.587ms
Self CUDA time total: 649.916us


impl                     wl                  p50(ms)  ok
compiled_swiglu_max_autotune llama_T512_D11008      0.38  True
compiled_swiglu_max_autotune llama_T512_D4096       2.38  True
compiled_swiglu_max_autotune llama_T512_D8192       0.20  True

▶ UV Install Logs

Artifacts:

activation.jsonl