HF Kernels - Flash Attention 3

HuggingFace Kernels Flash Attention 3 Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 5.53s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the flash attention 3 kernel
hf_kernels_flash_attn3 = get_kernel("kernels-community/flash-attn3")


def hf_flash_attention3(query, key, value):
    return hf_kernels_flash_attn3.flash_attn_func(query, key, value, causal=False)[0]


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="hf_kernels_flash_attn3",
    impl_tags={"family": "hf-kernels", "backend": "flash-attn3", "compile": "none"},
    impl_func=hf_flash_attention3,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         3.85%     171.193us        46.01%       2.045ms       2.045ms       0.000us         0.00%       3.614ms       3.614ms             1  
                                          FlashAttnFunc         3.07%     136.295us        42.15%       1.874ms     624.570us       0.000us         0.00%       3.614ms       1.205ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.94%      86.341us        39.09%       1.737ms     579.138us       2.720ms       100.00%       3.614ms       1.205ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.722ms       100.05%       2.722ms       2.722ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.720ms       100.00%       2.720ms     906.698us             3  
                                Activity Buffer Request        34.72%       1.543ms        34.72%       1.543ms       1.543ms     893.600us        32.85%     893.600us     893.600us             1  
                                            aten::empty         1.07%      47.441us         1.07%      47.441us       7.907us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.31%      13.761us         0.31%      13.761us       4.587us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.05%      46.772us         1.05%      46.772us      15.591us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        53.99%       2.400ms        53.99%       2.400ms       2.400ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.445ms
Self CUDA time total: 2.720ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.41%     104.370us        41.13%       1.784ms       1.784ms       0.000us         0.00%       3.700ms       3.700ms             1  
                                          FlashAttnFunc         2.00%      86.685us        38.73%       1.679ms     559.738us       0.000us         0.00%       3.700ms       1.233ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.21%      52.631us        36.73%       1.593ms     530.843us       2.768ms       100.00%       3.700ms       1.233ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.769ms       100.06%       2.769ms       2.769ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.768ms       100.00%       2.768ms     922.559us             3  
                                Activity Buffer Request        34.10%       1.479ms        34.10%       1.479ms       1.479ms     932.127us        33.68%     932.127us     932.127us             1  
                                            aten::empty         0.60%      25.981us         0.60%      25.981us       4.330us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       5.050us         0.12%       5.050us       1.683us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.70%      30.140us         0.70%      30.140us      10.047us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.87%       2.553ms        58.87%       2.553ms       2.553ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.336ms
Self CUDA time total: 2.768ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.29%     102.411us        40.10%       1.791ms       1.791ms       0.000us         0.00%       3.875ms       3.875ms             1  
                                          FlashAttnFunc         2.01%      89.903us        37.81%       1.688ms     562.801us       0.000us         0.00%       3.875ms       1.292ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.18%      52.613us        35.79%       1.599ms     532.834us       2.892ms       100.00%       3.875ms       1.292ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.893ms       100.05%       2.893ms       2.893ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.892ms       100.00%       2.892ms     963.972us             3  
                                Activity Buffer Request        33.24%       1.485ms        33.24%       1.485ms       1.485ms     983.097us        33.99%     983.097us     983.097us             1  
                                            aten::empty         0.58%      25.770us         0.58%      25.770us       4.295us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.11%       4.820us         0.11%       4.820us       1.607us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.69%      30.740us         0.69%      30.740us      10.247us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        59.90%       2.675ms        59.90%       2.675ms       2.675ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.466ms
Self CUDA time total: 2.892ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.68%     125.944us        42.11%       1.982ms       1.982ms       0.000us         0.00%       3.932ms       3.932ms             1  
                                          FlashAttnFunc         1.98%      92.983us        39.44%       1.856ms     618.639us       0.000us         0.00%       3.932ms       1.311ms             3  
                        _flash_attn3_48fe103_dirty::fwd         1.14%      53.661us        37.46%       1.763ms     587.645us       2.953ms       100.00%       3.932ms       1.311ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       2.954ms       100.06%       2.954ms       2.954ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.953ms       100.00%       2.953ms     984.176us             3  
                                Activity Buffer Request        30.48%       1.434ms        30.48%       1.434ms       1.434ms     979.803us        33.19%     979.803us     979.803us             1  
                                            aten::empty         0.58%      27.450us         0.58%      27.450us       4.575us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.11%       5.150us         0.11%       5.150us       1.717us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         5.15%     242.396us         5.15%     242.396us      80.799us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        57.89%       2.724ms        57.89%       2.724ms       2.724ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.706ms
Self CUDA time total: 2.953ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.36%     122.892us        37.59%       1.960ms       1.960ms       0.000us         0.00%       4.622ms       4.622ms             1  
                                          FlashAttnFunc         1.74%      90.533us        35.23%       1.837ms     612.429us       0.000us         0.00%       4.622ms       1.541ms             3  
                        _flash_attn3_48fe103_dirty::fwd         0.97%      50.750us        33.49%       1.747ms     582.252us       3.470ms       100.00%       4.622ms       1.541ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       3.472ms       100.05%       3.472ms       3.472ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.470ms       100.00%       3.470ms       1.157ms             3  
                                Activity Buffer Request        27.49%       1.433ms        27.49%       1.433ms       1.433ms       1.152ms        33.20%       1.152ms       1.152ms             1  
                                            aten::empty         0.51%      26.592us         0.51%      26.592us       4.432us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.10%       5.060us         0.10%       5.060us       1.687us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.43%     230.856us         4.43%     230.856us      76.952us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.41%       3.255ms        62.41%       3.255ms       3.255ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.215ms
Self CUDA time total: 3.470ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn3 | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 hf_kernels_flash_attn3         2.32%     120.892us        37.51%       1.951ms       1.951ms       0.000us         0.00%       4.639ms       4.639ms             1  
                                          FlashAttnFunc         1.74%      90.773us        35.18%       1.830ms     610.133us       0.000us         0.00%       4.639ms       1.546ms             3  
                        _flash_attn3_48fe103_dirty::fwd         0.99%      51.351us        33.44%       1.740ms     579.875us       3.468ms       100.00%       4.639ms       1.546ms             3  
                                 hf_kernels_flash_attn3         0.00%       0.000us         0.00%       0.000us       0.000us       3.469ms       100.05%       3.469ms       3.469ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.468ms       100.00%       3.468ms       1.156ms             3  
                                Activity Buffer Request        27.26%       1.418ms        27.26%       1.418ms       1.418ms       1.172ms        33.79%       1.172ms       1.172ms             1  
                                            aten::empty         0.51%      26.560us         0.51%      26.560us       4.427us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.10%       5.101us         0.10%       5.101us       1.700us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.58%     238.367us         4.58%     238.367us      79.456us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.49%       3.251ms        62.49%       3.251ms       3.251ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.202ms
Self CUDA time total: 3.468ms


impl                     wl                  p50(ms)  ok
hf_kernels_flash_attn3   cuda_attn_L128_bfloat16     0.92  True
hf_kernels_flash_attn3   cuda_attn_L256_bfloat16     0.96  True
hf_kernels_flash_attn3   cuda_attn_L320_bfloat16     1.01  True
hf_kernels_flash_attn3   cuda_attn_L384_bfloat16     1.03  True
hf_kernels_flash_attn3   cuda_attn_L448_bfloat16     1.20  True
hf_kernels_flash_attn3   cuda_attn_L512_bfloat16     1.18  True
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s] Fetching 4 files: 50%|█████ | 2/4 [00:01<00:01, 1.42it/s] Fetching 4 files: 100%|██████████| 4/4 [00:01<00:00, 2.84it/s]

Artifacts:

attention.jsonl