Flash Attention Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.28s | Raw GitHub

import subprocess

print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)

Wed Oct 29 14:25:53 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   27C    P8             21W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Flash Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 32.77s | Raw GitHub

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_flash(q, k, v):
    qt, kt, vt = (x.transpose(1, 2).contiguous() for x in (q, k, v))
    with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.FLASH_ATTENTION):
        o = torch.nn.functional.scaled_dot_product_attention(qt, kt, vt)
    return o.transpose(1, 2).contiguous()


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="torch_flash_ma",
    impl_tags={"family": "torch-sdpa", "backend": "FLASH", "compile": "max-autotune"},
    impl_func=torch_flash,
)

Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.644ms       102.02%       3.644ms       3.644ms             1  
                                         torch_flash_ma         6.80%     356.846us        47.04%       2.468ms       2.468ms       0.000us         0.00%       3.612ms       3.612ms             1  
                     aten::scaled_dot_product_attention         0.82%      43.042us         4.47%     234.776us      78.259us       0.000us         0.00%       2.857ms     952.201us             3  
              aten::_scaled_dot_product_flash_attention         0.56%      29.330us         3.65%     191.734us      63.911us       0.000us         0.00%       2.857ms     952.201us             3  
                         aten::_flash_attention_forward         0.75%      39.581us         2.59%     135.674us      45.225us       2.857ms        79.97%       2.857ms     952.201us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.857ms        79.97%       2.857ms     952.201us             3  
                                       aten::contiguous         0.27%      14.180us        34.32%       1.801ms     150.051us       0.000us         0.00%     755.680us      62.973us            12  
                                            aten::clone         0.74%      38.791us        34.04%       1.786ms     148.870us       0.000us         0.00%     755.680us      62.973us            12  
                                            aten::copy_         1.85%      97.030us        31.43%       1.649ms     137.429us     715.456us        20.03%     755.680us      62.973us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     715.456us        20.03%     715.456us      59.621us            12  
                                Activity Buffer Request        27.38%       1.437ms        27.38%       1.437ms       1.437ms      40.224us         1.13%      40.224us      40.224us             1  
                                        aten::transpose         1.47%      77.273us         1.96%     102.714us       4.280us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.48%      25.441us         0.48%      25.441us       1.060us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.70%      36.821us         2.35%     123.326us       8.222us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.93%     101.493us         1.93%     101.493us       4.229us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         2.70%     141.775us         2.70%     141.775us       9.452us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.35%      18.402us         0.35%      18.402us       6.134us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.05%       2.540us         0.05%       2.540us       0.423us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.17%       8.890us         0.17%       8.890us       2.963us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        52.96%       2.779ms        52.96%       2.779ms       2.779ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.247ms
Self CUDA time total: 3.572ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.70%     246.528us        41.73%       2.189ms       2.189ms       0.000us         0.00%       3.817ms       3.817ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.772ms       100.28%       3.772ms       3.772ms             1  
                     aten::scaled_dot_product_attention         0.51%      26.610us         3.43%     180.143us      60.048us       0.000us         0.00%       2.999ms     999.573us             3  
              aten::_scaled_dot_product_flash_attention         0.37%      19.600us         2.93%     153.533us      51.178us       0.000us         0.00%       2.999ms     999.573us             3  
                         aten::_flash_attention_forward         0.63%      32.980us         2.12%     111.443us      37.148us       2.999ms        79.71%       2.999ms     999.573us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.999ms        79.71%       2.999ms     999.573us             3  
                                       aten::contiguous         0.19%      10.030us        32.68%       1.715ms     142.893us       0.000us         0.00%     818.210us      68.184us            12  
                                            aten::clone         0.55%      29.002us        32.49%       1.705ms     142.057us       0.000us         0.00%     818.210us      68.184us            12  
                                            aten::copy_         2.09%     109.441us        30.74%       1.613ms     134.399us     763.297us        20.29%     818.210us      68.184us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     763.297us        20.29%     763.297us      63.608us            12  
                                Activity Buffer Request        26.94%       1.413ms        26.94%       1.413ms       1.413ms      54.913us         1.46%      54.913us      54.913us             1  
                                        aten::transpose         1.00%      52.652us         1.34%      70.433us       2.935us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.34%      17.781us         0.34%      17.781us       0.741us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.38%      19.980us         1.61%      84.581us       5.639us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.45%      76.201us         1.45%      76.201us       3.175us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         2.16%     113.102us         2.16%     113.102us       7.540us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.31%      16.430us         0.31%      16.430us       5.477us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.751us         0.03%       1.751us       0.292us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.771us         0.07%       3.771us       1.257us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.27%       3.058ms        58.27%       3.058ms       3.058ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.247ms
Self CUDA time total: 3.762ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.50%     237.986us        41.18%       2.178ms       2.178ms       0.000us         0.00%       3.833ms       3.833ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.785ms       100.29%       3.785ms       3.785ms             1  
                     aten::scaled_dot_product_attention         0.46%      24.381us         3.40%     179.915us      59.972us       0.000us         0.00%       2.998ms     999.221us             3  
              aten::_scaled_dot_product_flash_attention         0.36%      19.171us         2.94%     155.534us      51.845us       0.000us         0.00%       2.998ms     999.221us             3  
                         aten::_flash_attention_forward         0.65%      34.259us         2.15%     113.691us      37.897us       2.998ms        79.44%       2.998ms     999.221us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.998ms        79.44%       2.998ms     999.221us             3  
                                       aten::contiguous         0.19%       9.800us        32.38%       1.712ms     142.708us       0.000us         0.00%     835.263us      69.605us            12  
                                            aten::clone         0.53%      28.211us        32.20%       1.703ms     141.891us       0.000us         0.00%     835.263us      69.605us            12  
                                            aten::copy_         1.60%      84.650us        30.46%       1.611ms     134.247us     776.063us        20.56%     835.263us      69.605us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     776.063us        20.56%     776.063us      64.672us            12  
                                Activity Buffer Request        27.18%       1.437ms        27.18%       1.437ms       1.437ms      59.200us         1.57%      59.200us      59.200us             1  
                                        aten::transpose         0.99%      52.225us         1.33%      70.125us       2.922us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.34%      17.900us         0.34%      17.900us       0.746us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.37%      19.782us         1.60%      84.803us       5.654us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.45%      76.431us         1.45%      76.431us       3.185us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         2.16%     114.204us         2.16%     114.204us       7.614us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.30%      16.100us         0.30%      16.100us       5.367us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.730us         0.03%       1.730us       0.288us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.730us         0.07%       3.730us       1.243us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.82%       3.110ms        58.82%       3.110ms       3.110ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.288ms
Self CUDA time total: 3.774ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.36%     241.837us        43.33%       2.405ms       2.405ms       0.000us         0.00%       3.884ms       3.884ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.837ms       100.27%       3.837ms       3.837ms             1  
                     aten::scaled_dot_product_attention         0.48%      26.802us         3.27%     181.715us      60.572us       0.000us         0.00%       3.042ms       1.014ms             3  
              aten::_scaled_dot_product_flash_attention         0.35%      19.308us         2.79%     154.913us      51.638us       0.000us         0.00%       3.042ms       1.014ms             3  
                         aten::_flash_attention_forward         0.60%      33.361us         2.03%     112.712us      37.571us       3.042ms        79.50%       3.042ms       1.014ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.042ms        79.50%       3.042ms       1.014ms             3  
                                       aten::contiguous         0.17%       9.659us        34.84%       1.934ms     161.162us       0.000us         0.00%     841.829us      70.152us            12  
                                            aten::clone         0.50%      27.830us        34.67%       1.924ms     160.357us       0.000us         0.00%     841.829us      70.152us            12  
                                            aten::copy_         1.56%      86.702us        32.55%       1.807ms     150.547us     784.548us        20.50%     841.829us      70.152us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     784.548us        20.50%     784.548us      65.379us            12  
                                Activity Buffer Request        25.45%       1.413ms        25.45%       1.413ms       1.413ms      57.281us         1.50%      57.281us      57.281us             1  
                                        aten::transpose         0.95%      52.620us         1.27%      70.404us       2.933us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.32%      17.784us         0.32%      17.784us       0.741us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.78%      43.221us         2.00%     111.194us       7.413us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.45%      80.673us         1.45%      80.673us       3.361us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         5.96%     331.078us         5.96%     331.078us      22.072us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.28%      15.800us         0.28%      15.800us       5.267us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.730us         0.03%       1.730us       0.288us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.850us         0.07%       3.850us       1.283us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        56.67%       3.146ms        56.67%       3.146ms       3.146ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.551ms
Self CUDA time total: 3.827ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.46%     268.165us        40.09%       2.413ms       2.413ms       0.000us         0.00%       4.405ms       4.405ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       4.355ms       100.25%       4.355ms       4.355ms             1  
                     aten::scaled_dot_product_attention         0.46%      27.642us         3.64%     218.806us      72.935us       0.000us         0.00%       3.540ms       1.180ms             3  
              aten::_scaled_dot_product_flash_attention         0.75%      45.250us         3.18%     191.164us      63.721us       0.000us         0.00%       3.540ms       1.180ms             3  
                         aten::_flash_attention_forward         0.61%      36.651us         2.01%     120.923us      40.308us       3.540ms        81.48%       3.540ms       1.180ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.540ms        81.48%       3.540ms       1.180ms             3  
                                       aten::contiguous         0.18%      10.862us        31.11%       1.873ms     156.050us       0.000us         0.00%     865.606us      72.134us            12  
                                            aten::clone         0.51%      30.490us        30.93%       1.862ms     155.145us       0.000us         0.00%     865.606us      72.134us            12  
                                            aten::copy_         1.51%      90.931us        29.34%       1.766ms     147.155us     804.645us        18.52%     865.606us      72.134us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     804.645us        18.52%     804.645us      67.054us            12  
                                Activity Buffer Request        21.61%       1.300ms        21.61%       1.300ms       1.300ms      60.961us         1.40%      60.961us      60.961us             1  
                                        aten::transpose         0.99%      59.753us         1.30%      78.501us       3.271us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.31%      18.748us         0.31%      18.748us       0.781us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.35%      20.935us         1.45%      87.165us       5.811us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.32%      79.690us         1.32%      79.690us       3.320us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         6.67%     401.680us         6.67%     401.680us      26.779us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.27%      16.081us         0.27%      16.081us       5.360us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       2.030us         0.03%       2.030us       0.338us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.06%       3.810us         0.06%       3.810us       1.270us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        59.91%       3.605ms        59.91%       3.605ms       3.605ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.018ms
Self CUDA time total: 4.344ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.01%     246.839us        39.75%       2.447ms       2.447ms       0.000us         0.00%       4.458ms       4.458ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       4.407ms       100.23%       4.407ms       4.407ms             1  
                     aten::scaled_dot_product_attention         0.40%      24.621us         2.95%     181.474us      60.491us       0.000us         0.00%       3.579ms       1.193ms             3  
              aten::_scaled_dot_product_flash_attention         0.34%      20.980us         2.55%     156.853us      52.284us       0.000us         0.00%       3.579ms       1.193ms             3  
                         aten::_flash_attention_forward         0.58%      35.588us         1.84%     113.003us      37.668us       3.579ms        81.40%       3.579ms       1.193ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.579ms        81.40%       3.579ms       1.193ms             3  
                                       aten::contiguous         0.16%      10.061us        32.01%       1.971ms     164.244us       0.000us         0.00%     878.818us      73.235us            12  
                                            aten::clone         0.50%      30.903us        31.85%       1.961ms     163.406us       0.000us         0.00%     878.818us      73.235us            12  
                                            aten::copy_         1.35%      82.841us        30.27%       1.864ms     155.305us     817.634us        18.60%     878.818us      73.235us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     817.634us        18.60%     817.634us      68.136us            12  
                                Activity Buffer Request        23.50%       1.447ms        23.50%       1.447ms       1.447ms      61.184us         1.39%      61.184us      61.184us             1  
                                        aten::transpose         0.85%      52.630us         1.15%      70.790us       2.950us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.29%      18.160us         0.29%      18.160us       0.757us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.33%      20.456us         1.41%      86.700us       5.780us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.28%      78.794us         1.28%      78.794us       3.283us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         5.81%     357.919us         5.81%     357.919us      23.861us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.25%      15.401us         0.25%      15.401us       5.134us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.632us         0.03%       1.632us       0.272us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.06%       3.720us         0.06%       3.720us       1.240us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        60.25%       3.709ms        60.25%       3.709ms       3.709ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.156ms
Self CUDA time total: 4.397ms


impl                     wl                  p50(ms)  ok
torch_flash_ma           cuda_attn_L128_bfloat16     1.22  True
torch_flash_ma           cuda_attn_L256_bfloat16     1.28  True
torch_flash_ma           cuda_attn_L320_bfloat16     1.29  True
torch_flash_ma           cuda_attn_L384_bfloat16     1.33  True
torch_flash_ma           cuda_attn_L448_bfloat16     1.47  True
torch_flash_ma           cuda_attn_L512_bfloat16     1.50  True

▶ UV Install Logs

Artifacts:

attention.jsonl