Flash Attention 3 compatible with `torch.compile`. See [this PR](https://github.com/Dao-AILab/flash-attention/pull/1769) by guilhermeleobas for more details.

There is a build here for Torch 2.8.0 and a build for Torch Nightlies from 08/30 onward.

Reproduce:

## Torch 2.8.0 Build

Compiled from `https://github.com/varunneal/flash-attention` on branch `guilhermeleobas/fa3-compile`.

Compilation commands:

```
pip install -U pip wheel setuptools ninja numpy packaging psutil
pip install torch==2.8.0

git clone https://github.com/varunneal/flash-attention
cd flash-attention/hopper
git switch fa3-compile

export MAX_JOBS=32
export FLASH_ATTENTION_FORCE_BUILD=TRUE        # skip prebuilt wheel fetch
export FLASH_ATTENTION_DISABLE_SM80=TRUE       # Hopper-only
export FLASH_ATTENTION_DISABLE_FP16=TRUE       # leave BF16, FP8

# Optional, for faster compilation time
export FLASH_ATTENTION_DISABLE_HDIM64=TRUE
export FLASH_ATTENTION_DISABLE_HDIM96=TRUE
export FLASH_ATTENTION_DISABLE_HDIM192=TRUE
export FLASH_ATTENTION_DISABLE_HDIM256=TRUE

python setup.py bdist_wheel
```

## Torch Nightlies build

Compiled from `https://github.com/varunneal/flash-attention` on branch `stable`.

This is a custom fork that combines [ABI Compatibility](https://github.com/Dao-AILab/flash-attention/pull/1791) with `torch.compile` compatbility.
This build should be consistent with Torch Nightlies from 08/30 onward.

Compilation commands:


```
pip install -U pip wheel setuptools ninja numpy packaging psutil
# Any Torch Nightly after 08/30 should be alright
pip install --pre "torch==2.10.0.dev20250926+cu126" --index-url https://download.pytorch.org/whl/nightly/cu126

git clone https://github.com/varunneal/flash-attention
cd flash-attention/hopper
git switch stable

export MAX_JOBS=32
export FLASH_ATTENTION_FORCE_BUILD=TRUE        # skip prebuilt wheel fetch
export FLASH_ATTENTION_DISABLE_SM80=TRUE       # Hopper-only
export FLASH_ATTENTION_DISABLE_FP16=TRUE       # leave BF16, FP8


python setup.py bdist_wheel
```

## Tips for ARM builds

On an aarch64/ARM64 system, such as a GH200 server, building requires a bit of finesse. Try:

```
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
export MAX_JOBS=4
```


Please contact me if you would like me to build wheels for any other version of python or torch.