HF Kernels - Flash Attention 3

HuggingFace Kernels Flash Attention 3 Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 5.52s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the flash attention 3 kernel
hf_kernels_flash_attn3 = get_kernel("kernels-community/flash-attn3")


def hf_flash_attention3(query, key, value):
    return hf_kernels_flash_attn3.flash_attn_func(query, key, value, causal=False)[0]


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="hf_kernels_flash_attn3",
    impl_tags={"family": "hf-kernels", "backend": "flash-attn3", "compile": "none"},
    impl_func=hf_flash_attention3,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         3.72%     161.222us        44.67%       1.935ms       1.935ms       0.000us         0.00%       3.599ms       3.599ms             1  
                                          FlashAttnFunc         2.81%     121.834us        40.95%       1.774ms     591.218us       0.000us         0.00%       3.599ms       1.200ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.85%      79.992us        38.14%       1.652ms     550.607us       2.693ms       100.00%       3.599ms       1.200ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.695ms       100.05%       2.695ms       2.695ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.693ms       100.00%       2.693ms     897.759us             3  
                                Activity Buffer Request        33.93%       1.470ms        33.93%       1.470ms       1.470ms     905.439us        33.62%     905.439us     905.439us             1  
                                            aten::empty         1.00%      43.311us         1.00%      43.311us       7.219us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.32%      13.891us         0.32%      13.891us       4.630us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.04%      45.121us         1.04%      45.121us      15.040us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        55.33%       2.396ms        55.33%       2.396ms       2.396ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.331ms
Self CUDA time total: 2.693ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.17%      96.772us        39.76%       1.770ms       1.770ms       0.000us         0.00%       3.876ms       3.876ms             1  
                                          FlashAttnFunc         2.04%      90.694us        37.59%       1.674ms     557.834us       0.000us         0.00%       3.876ms       1.292ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.15%      51.142us        35.55%       1.583ms     527.603us       2.896ms       100.00%       3.876ms       1.292ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.898ms       100.05%       2.898ms       2.898ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.896ms       100.00%       2.896ms     965.387us             3  
                                Activity Buffer Request        33.04%       1.471ms        33.04%       1.471ms       1.471ms     979.809us        33.83%     979.809us     979.809us             1  
                                            aten::empty         0.58%      25.610us         0.58%      25.610us       4.268us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       5.240us         0.12%       5.240us       1.747us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.67%      29.750us         0.67%      29.750us       9.917us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        60.24%       2.682ms        60.24%       2.682ms       2.682ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.452ms
Self CUDA time total: 2.896ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.19%      98.331us        39.82%       1.786ms       1.786ms       0.000us         0.00%       3.885ms       3.885ms             1  
                                          FlashAttnFunc         1.99%      89.333us        37.63%       1.688ms     562.551us       0.000us         0.00%       3.885ms       1.295ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.08%      48.311us        35.64%       1.598ms     532.773us       2.912ms       100.00%       3.885ms       1.295ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.914ms       100.05%       2.914ms       2.914ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.912ms       100.00%       2.912ms     970.802us             3  
                                Activity Buffer Request        33.18%       1.488ms        33.18%       1.488ms       1.488ms     972.637us        33.40%     972.637us     972.637us             1  
                                            aten::empty         0.57%      25.370us         0.57%      25.370us       4.228us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.13%       5.730us         0.13%       5.730us       1.910us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.69%      30.861us         0.69%      30.861us      10.287us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        60.18%       2.699ms        60.18%       2.699ms       2.699ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.485ms
Self CUDA time total: 2.912ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.51%     118.553us        41.81%       1.973ms       1.973ms       0.000us         0.00%       3.964ms       3.964ms             1  
                                          FlashAttnFunc         1.94%      91.662us        39.30%       1.855ms     618.205us       0.000us         0.00%       3.964ms       1.321ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.07%      50.373us        37.36%       1.763ms     587.651us       2.962ms       100.00%       3.964ms       1.321ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.964ms       100.05%       2.964ms       2.964ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.962ms       100.00%       2.962ms     987.401us             3  
                                Activity Buffer Request        30.92%       1.459ms        30.92%       1.459ms       1.459ms       1.002ms        33.82%       1.002ms       1.002ms             1  
                                            aten::empty         0.56%      26.451us         0.56%      26.451us       4.408us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.11%       5.270us         0.11%       5.270us       1.757us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.70%     221.845us         4.70%     221.845us      73.948us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.19%       2.746ms        58.19%       2.746ms       2.746ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.719ms
Self CUDA time total: 2.962ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.19%     114.453us        37.34%       1.953ms       1.953ms       0.000us         0.00%       4.662ms       4.662ms             1  
                                          FlashAttnFunc         1.73%      90.401us        35.15%       1.838ms     612.822us       0.000us         0.00%       4.662ms       1.554ms             3  
                        _flash_attn3_48fe103_dirty::fwd         0.97%      50.643us        33.42%       1.748ms     582.688us       3.490ms       100.00%       4.662ms       1.554ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       3.492ms       100.04%       3.492ms       3.492ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.490ms       100.00%       3.490ms       1.163ms             3  
                                Activity Buffer Request        28.44%       1.487ms        28.44%       1.487ms       1.487ms       1.171ms        33.56%       1.171ms       1.171ms             1  
                                            aten::empty         0.52%      27.271us         0.52%      27.271us       4.545us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.09%       4.950us         0.09%       4.950us       1.650us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.40%     178.024us         3.40%     178.024us      59.341us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.66%       3.277ms        62.66%       3.277ms       3.277ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.230ms
Self CUDA time total: 3.490ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.26%     115.663us        36.27%       1.854ms       1.854ms       0.000us         0.00%       4.679ms       4.679ms             1  
                                          FlashAttnFunc         2.25%     114.773us        34.01%       1.738ms     579.364us       0.000us         0.00%       4.679ms       1.560ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.02%      51.933us        31.76%       1.623ms     541.107us       3.499ms       100.00%       4.679ms       1.560ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       3.500ms       100.04%       3.500ms       3.500ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.499ms       100.00%       3.499ms       1.166ms             3  
                                Activity Buffer Request        26.80%       1.370ms        26.80%       1.370ms       1.370ms       1.181ms        33.75%       1.181ms       1.181ms             1  
                                            aten::empty         0.54%      27.681us         0.54%      27.681us       4.613us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.10%       5.079us         0.10%       5.079us       1.693us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.30%     168.813us         3.30%     168.813us      56.271us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        63.73%       3.257ms        63.73%       3.257ms       3.257ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.111ms
Self CUDA time total: 3.499ms


impl                     wl                  p50(ms)  ok
hf_kernels_flash_attn3   cuda_attn_L128_bfloat16     0.94  True
hf_kernels_flash_attn3   cuda_attn_L256_bfloat16     0.97  True
hf_kernels_flash_attn3   cuda_attn_L320_bfloat16     1.04  True
hf_kernels_flash_attn3   cuda_attn_L384_bfloat16     1.05  True
hf_kernels_flash_attn3   cuda_attn_L448_bfloat16     1.20  True
hf_kernels_flash_attn3   cuda_attn_L512_bfloat16     1.18  True
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s] Fetching 4 files: 50%|█████ | 2/4 [00:01<00:01, 1.38it/s] Fetching 4 files: 100%|██████████| 4/4 [00:01<00:00, 2.75it/s]

Artifacts:

attention.jsonl