HF Kernels - Flash Attention

HuggingFace Kernels Flash Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 5.83s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the flash attention kernel
hf_kernels_flash_attn = get_kernel("kernels-community/flash-attn")


def hf_flash_attention(query, key, value):
    """HuggingFace Kernels Flash Attention"""
    return hf_kernels_flash_attn.fwd(query, key, value, is_causal=False)[0]


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="hf_kernels_flash_attn",
    impl_tags={"family": "hf-kernels", "backend": "flash-attn", "compile": "none"},
    impl_func=hf_flash_attention,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         3.51%     153.413us        41.11%       1.797ms       1.797ms       0.000us         0.00%       3.733ms       3.733ms             1  
                               _flash_attn_9e27194::fwd         1.62%      70.702us        37.60%       1.644ms     547.894us       2.785ms       100.00%       3.733ms       1.244ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       2.786ms       100.05%       2.786ms       2.786ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       2.785ms       100.00%       2.785ms     928.303us             3  
                                Activity Buffer Request        32.92%       1.439ms        32.92%       1.439ms       1.439ms     947.706us        34.03%     947.706us     947.706us             1  
                                 cudaDeviceGetAttribute         0.11%       4.891us         0.11%       4.891us       0.326us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.37%      16.181us         1.17%      51.061us      17.020us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.80%      34.880us         0.80%      34.880us      11.627us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.59%      25.681us         0.59%      25.681us       2.853us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.26%      11.340us         0.26%      11.340us       3.780us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.93%      40.731us         0.93%      40.731us      13.577us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.89%       2.575ms        58.89%       2.575ms       2.575ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.372ms
Self CUDA time total: 2.785ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         1.94%      86.682us        37.50%       1.676ms       1.676ms       0.000us         0.00%       3.929ms       3.929ms             1  
                               _flash_attn_9e27194::fwd         1.06%      47.570us        35.56%       1.589ms     529.734us       2.938ms       100.00%       3.929ms       1.310ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       2.939ms       100.05%       2.939ms       2.939ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       2.938ms       100.00%       2.938ms     979.209us             3  
                                Activity Buffer Request        32.66%       1.460ms        32.66%       1.460ms       1.460ms     991.166us        33.74%     991.166us     991.166us             1  
                                 cudaDeviceGetAttribute         0.10%       4.450us         0.10%       4.450us       0.297us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.19%       8.440us         0.55%      24.690us       8.230us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.36%      16.250us         0.36%      16.250us       5.417us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.51%      22.872us         0.51%      22.872us       2.541us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.07%       3.350us         0.07%       3.350us       1.117us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.60%      26.611us         0.60%      26.611us       8.870us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.50%       2.794ms        62.50%       2.794ms       2.794ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.469ms
Self CUDA time total: 2.938ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.38%     109.313us        36.70%       1.683ms       1.683ms       0.000us         0.00%       4.081ms       4.081ms             1  
                               _flash_attn_9e27194::fwd         1.05%      48.167us        34.31%       1.574ms     524.567us       3.048ms       100.00%       4.081ms       1.360ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.049ms       100.05%       3.049ms       3.049ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.048ms       100.00%       3.048ms       1.016ms             3  
                                Activity Buffer Request        31.46%       1.443ms        31.46%       1.443ms       1.443ms       1.033ms        33.90%       1.033ms       1.033ms             1  
                                 cudaDeviceGetAttribute         0.09%       4.231us         0.09%       4.231us       0.282us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.16%       7.250us         0.52%      23.960us       7.987us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.36%      16.710us         0.36%      16.710us       5.570us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.46%      21.300us         0.46%      21.300us       2.367us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.08%       3.561us         0.08%       3.561us       1.187us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.64%      29.473us         0.64%      29.473us       9.824us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        63.30%       2.903ms        63.30%       2.903ms       2.903ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.586ms
Self CUDA time total: 3.048ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.13%     103.094us        38.83%       1.884ms       1.884ms       0.000us         0.00%       4.165ms       4.165ms             1  
                               _flash_attn_9e27194::fwd         0.99%      47.838us        36.71%       1.781ms     593.521us       3.114ms       100.00%       4.165ms       1.388ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.116ms       100.05%       3.116ms       3.116ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.114ms       100.00%       3.114ms       1.038ms             3  
                                Activity Buffer Request        29.59%       1.435ms        29.59%       1.435ms       1.435ms       1.051ms        33.75%       1.051ms       1.051ms             1  
                                 cudaDeviceGetAttribute         0.08%       3.800us         0.08%       3.800us       0.253us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.16%       7.891us         0.53%      25.811us       8.604us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.37%      17.920us         0.37%      17.920us       5.973us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.45%      21.731us         0.45%      21.731us       2.415us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.08%       3.740us         0.08%       3.740us       1.247us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.99%     242.187us         4.99%     242.187us      80.729us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        61.17%       2.967ms        61.17%       2.967ms       2.967ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.851ms
Self CUDA time total: 3.114ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.00%     105.522us        34.61%       1.828ms       1.828ms       0.000us         0.00%       4.806ms       4.806ms             1  
                               _flash_attn_9e27194::fwd         0.94%      49.622us        32.62%       1.723ms     574.192us       3.597ms       100.00%       4.806ms       1.602ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.599ms       100.05%       3.599ms       3.599ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.597ms       100.00%       3.597ms       1.199ms             3  
                                Activity Buffer Request        27.37%       1.446ms        27.37%       1.446ms       1.446ms       1.209ms        33.59%       1.209ms       1.209ms             1  
                                 cudaDeviceGetAttribute         0.08%       3.991us         0.08%       3.991us       0.266us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.14%       7.250us         0.47%      24.620us       8.207us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.33%      17.370us         0.33%      17.370us       5.790us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.41%      21.681us         0.41%      21.681us       2.409us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.07%       3.770us         0.07%       3.770us       1.257us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.28%     173.384us         3.28%     173.384us      57.795us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        65.39%       3.453ms        65.39%       3.453ms       3.453ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.281ms
Self CUDA time total: 3.597ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.02%     107.892us        33.82%       1.810ms       1.810ms       0.000us         0.00%       4.930ms       4.930ms             1  
                               _flash_attn_9e27194::fwd         0.91%      48.918us        31.80%       1.702ms     567.268us       3.687ms       100.00%       4.930ms       1.643ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.689ms       100.04%       3.689ms       3.689ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.687ms       100.00%       3.687ms       1.229ms             3  
                                Activity Buffer Request        26.86%       1.437ms        26.86%       1.437ms       1.437ms       1.242ms        33.69%       1.242ms       1.242ms             1  
                                 cudaDeviceGetAttribute         0.07%       3.881us         0.07%       3.881us       0.259us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.14%       7.591us         0.49%      26.111us       8.704us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.35%      18.520us         0.35%      18.520us       6.173us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.39%      20.640us         0.39%      20.640us       2.293us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.07%       3.561us         0.07%       3.561us       1.187us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.01%     161.306us         3.01%     161.306us      53.769us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        66.18%       3.542ms        66.18%       3.542ms       3.542ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.351ms
Self CUDA time total: 3.687ms


impl                     wl                  p50(ms)  ok
hf_kernels_flash_attn    cuda_attn_L128_bfloat16     0.95  True
hf_kernels_flash_attn    cuda_attn_L256_bfloat16     1.00  True
hf_kernels_flash_attn    cuda_attn_L320_bfloat16     1.05  True
hf_kernels_flash_attn    cuda_attn_L384_bfloat16     1.06  True
hf_kernels_flash_attn    cuda_attn_L448_bfloat16     1.23  True
hf_kernels_flash_attn    cuda_attn_L512_bfloat16     1.23  True
Fetching 20 files: 0%| | 0/20 [00:00<?, ?it/s] Fetching 20 files: 10%|█ | 2/20 [00:01<00:15, 1.19it/s] Fetching 20 files: 100%|██████████| 20/20 [00:01<00:00, 11.87it/s]

Artifacts:

attention.jsonl