xFormers Memory Efficient Attention

xFormers Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 33.71s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "xformers",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
import xformers.ops as xops


def xformers_attention(q, k, v):
    """xFormers memory efficient attention"""
    # xFormers expects [batch, seq_len, heads, head_dim]
    return xops.memory_efficient_attention(q, k, v)


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="xformers_meff",
    impl_tags={"family": "xformers", "backend": "memory_efficient", "compile": "none"},
    impl_func=xformers_attention,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff        10.98%     488.134us        52.82%       2.349ms       2.349ms       0.000us         0.00%       3.539ms       3.539ms             1  
                             xformers_flash3::flash_fwd         4.45%     198.034us        41.02%       1.824ms     608.009us       0.000us         0.00%       3.539ms       1.180ms             3  
                                      flash_attn_3::fwd         1.81%      80.354us        36.57%       1.626ms     541.997us       2.647ms       100.00%       3.539ms       1.180ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       2.648ms       100.06%       2.648ms       2.648ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.647ms       100.00%       2.647ms     882.203us             3  
                                Activity Buffer Request        32.65%       1.452ms        32.65%       1.452ms       1.452ms     892.891us        33.74%     892.891us     892.891us             1  
                                            aten::empty         0.78%      34.470us         0.78%      34.470us       5.745us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.26%      11.370us         0.26%      11.370us       3.790us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.08%      47.851us         1.08%      47.851us      15.950us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.28%      12.261us         0.82%      36.420us       6.070us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.54%      24.159us         0.54%      24.159us       4.026us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        47.18%       2.098ms        47.18%       2.098ms       2.098ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.447ms
Self CUDA time total: 2.647ms



======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff         7.22%     318.208us        46.97%       2.070ms       2.070ms       0.000us         0.00%       3.700ms       3.700ms             1  
                             xformers_flash3::flash_fwd         3.33%     146.973us        39.20%       1.728ms     575.898us       0.000us         0.00%       3.700ms       1.233ms             3  
                                      flash_attn_3::fwd         1.20%      53.004us        35.87%       1.581ms     526.907us       2.767ms       100.00%       3.700ms       1.233ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       2.769ms       100.05%       2.769ms       2.769ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.767ms       100.00%       2.767ms     922.499us             3  
                                Activity Buffer Request        33.12%       1.459ms        33.12%       1.459ms       1.459ms     932.857us        33.71%     932.857us     932.857us             1  
                                            aten::empty         0.65%      28.790us         0.65%      28.790us       4.798us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.13%       5.860us         0.13%       5.860us       1.953us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.76%      33.580us         0.76%      33.580us      11.193us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.21%       9.291us         0.54%      23.901us       3.983us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.33%      14.610us         0.33%      14.610us       2.435us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        53.03%       2.337ms        53.03%       2.337ms       2.337ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.407ms
Self CUDA time total: 2.767ms



======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff         6.87%     306.279us        45.67%       2.036ms       2.036ms       0.000us         0.00%       3.803ms       3.803ms             1  
                             xformers_flash3::flash_fwd         3.28%     146.193us        38.29%       1.707ms     568.871us       0.000us         0.00%       3.803ms       1.268ms             3  
                                      flash_attn_3::fwd         1.22%      54.360us        35.01%       1.560ms     520.140us       2.841ms       100.00%       3.803ms       1.268ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       2.843ms       100.05%       2.843ms       2.843ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.841ms       100.00%       2.841ms     947.064us             3  
                                Activity Buffer Request        32.21%       1.435ms        32.21%       1.435ms       1.435ms     961.848us        33.85%     961.848us     961.848us             1  
                                            aten::empty         0.68%      30.200us         0.68%      30.200us       5.033us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       5.560us         0.12%       5.560us       1.853us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.78%      34.863us         0.78%      34.863us      11.621us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.20%       8.808us         0.51%      22.610us       3.768us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.31%      13.802us         0.31%      13.802us       2.300us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        54.33%       2.422ms        54.33%       2.422ms       2.422ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.457ms
Self CUDA time total: 2.841ms



======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff         6.67%     311.798us        48.16%       2.253ms       2.253ms       0.000us         0.00%       3.854ms       3.854ms             1  
                             xformers_flash3::flash_fwd         3.68%     172.144us        40.98%       1.917ms     638.949us       0.000us         0.00%       3.854ms       1.285ms             3  
                                      flash_attn_3::fwd         1.19%      55.670us        37.30%       1.745ms     581.568us       2.881ms       100.00%       3.854ms       1.285ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       2.883ms       100.05%       2.883ms       2.883ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       2.881ms       100.00%       2.881ms     960.465us             3  
                                Activity Buffer Request        30.77%       1.440ms        30.77%       1.440ms       1.440ms     972.603us        33.75%     972.603us     972.603us             1  
                                            aten::empty         0.63%      29.580us         0.63%      29.580us       4.930us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       5.801us         0.12%       5.801us       1.934us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.58%     214.036us         4.58%     214.036us      71.345us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.19%       9.019us         0.51%      24.051us       4.009us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.32%      15.032us         0.32%      15.032us       2.505us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        51.84%       2.425ms        51.84%       2.425ms       2.425ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.678ms
Self CUDA time total: 2.881ms



======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff         5.88%     304.576us        42.22%       2.188ms       2.188ms       0.000us         0.00%       4.552ms       4.552ms             1  
                             xformers_flash3::flash_fwd         2.84%     147.154us        35.91%       1.861ms     620.213us       0.000us         0.00%       4.552ms       1.517ms             3  
                                      flash_attn_3::fwd         1.02%      52.961us        33.07%       1.713ms     571.161us       3.412ms       100.00%       4.552ms       1.517ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       3.414ms       100.04%       3.414ms       3.414ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.412ms       100.00%       3.412ms       1.137ms             3  
                                Activity Buffer Request        27.95%       1.448ms        27.95%       1.448ms       1.448ms       1.140ms        33.41%       1.140ms       1.140ms             1  
                                            aten::empty         0.56%      29.272us         0.56%      29.272us       4.879us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       6.180us         0.12%       6.180us       2.060us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.41%     176.624us         3.41%     176.624us      58.875us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.17%       9.052us         0.44%      22.882us       3.814us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.27%      13.830us         0.27%      13.830us       2.305us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        57.78%       2.994ms        57.78%       2.994ms       2.994ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.182ms
Self CUDA time total: 3.412ms



======================================================================
PROFILE TRACE: xformers_meff | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          xformers_meff         5.58%     285.697us        41.87%       2.143ms       2.143ms       0.000us         0.00%       4.544ms       4.544ms             1  
                             xformers_flash3::flash_fwd         2.91%     148.714us        35.83%       1.834ms     611.255us       0.000us         0.00%       4.544ms       1.515ms             3  
                                      flash_attn_3::fwd         1.04%      53.311us        32.92%       1.685ms     561.684us       3.402ms       100.00%       4.544ms       1.515ms             3  
                                          xformers_meff         0.00%       0.000us         0.00%       0.000us       0.000us       3.403ms       100.05%       3.403ms       3.403ms             1  
void cutlass::device_kernel<flash::enable_sm80_to_sm...         0.00%       0.000us         0.00%       0.000us       0.000us       3.402ms       100.00%       3.402ms       1.134ms             3  
                                Activity Buffer Request        27.78%       1.422ms        27.78%       1.422ms       1.422ms       1.142ms        33.57%       1.142ms       1.142ms             1  
                                            aten::empty         0.58%      29.640us         0.58%      29.640us       4.940us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.12%       5.990us         0.12%       5.990us       1.997us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.40%     174.134us         3.40%     174.134us      58.045us       0.000us         0.00%       0.000us       0.000us             3  
                                          aten::reshape         0.17%       8.543us         0.45%      23.191us       3.865us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.29%      14.648us         0.29%      14.648us       2.441us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        58.13%       2.975ms        58.13%       2.975ms       2.975ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.118ms
Self CUDA time total: 3.402ms


impl                     wl                  p50(ms)  ok
xformers_meff            cuda_attn_L128_bfloat16     1.00  True
xformers_meff            cuda_attn_L256_bfloat16     1.03  True
xformers_meff            cuda_attn_L320_bfloat16     1.08  True
xformers_meff            cuda_attn_L384_bfloat16     1.08  True
xformers_meff            cuda_attn_L448_bfloat16     1.25  True
xformers_meff            cuda_attn_L512_bfloat16     1.23  True
▶ UV Install Logs

Artifacts:

attention.jsonl