KERNELS COMMUNITY BENCHMARKS
This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks
TABLE OF CONTENTS
- ACTIVATION FUNCTIONS
- FLASH ATTENTION
- DEFORMABLE DETR
- OPENAI-STYLE MOE
- ROTARY POSITION EMBEDDINGS
- CAUSAL CONV1D
- LAYER NORMALIZATION
RUN YOURSELF
To run the benchmarks locally, clone the repository and use uvx to build and run the benchmarks:
Note benches are made to run on a machine with a compatible NVIDIA GPU and CUDA installed, other hardware may not not work as expected.
git clone https://github.com/huggingface/kernels-benchmarks.git
cd kernels-benchmarks
uvx https://github.com/drbh/uvnote.git build benches
METHODOLOGY
Each benchmark is run with the
Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts
BENCHMARKS
ACTIVATION FUNCTIONS
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| HF Kernels SwiGLU | HuggingFace kernels SwiGLU implementation | GitHub | HF | Bench |
| PyTorch SwiGLU | PyTorch native SwiGLU implementation | - | - | Bench |
FLASH ATTENTION
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| Flash Attention | Torch SDPA Flash Attention implementation | - | - | Bench |
| HF Kernels Flash Attention 2 | HuggingFace kernels Flash Attention | GitHub | HF | Bench |
| HF Kernels Flash Attention 3 | HuggingFace kernels Flash Attention 3 | GitHub | HF | Bench |
| Memory Efficient Attention | Memory efficient attention implementation | - | Bench | |
| Sage Attention | Sage attention implementation | HF | Bench | |
| xFormers | xFormers attention implementation | GitHub | - | Bench |
DEFORMABLE DETR
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| HF Kernels Deformable DETR | HuggingFace kernels Deformable DETR implementation | GitHub | HF | Bench |
| PyTorch Deformable DETR | PyTorch native Deformable DETR implementation | - | - | Bench |
OPENAI-STYLE MOE
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| GptOssExperts | GPT OSS reference OpenAI-style MoE | Bench | ||
| Binned PyTorch | Binned PyTorch OpenAI-style MoE implementation | - | - | Bench |
CAUSAL CONV1D
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| HF Kernels Causal Conv1D | HuggingFace kernels implementation | GitHub | HF | Bench |
| PyTorch Causal Conv1D | PyTorch native implementation | - | - | Bench |
ROTARY POSITION EMBEDDINGS
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| HF Kernels Rotary | HuggingFace kernels implementation | GitHub | HF | Bench |
| PyTorch Rotary | PyTorch native implementation | - | - | Bench |
LAYER NORMALIZATION
| Implementation | Description | Source | HF | Bench |
|---|---|---|---|---|
| HF Kernels Layer Norm | HuggingFace kernels implementation | GitHub | HF | Bench |
| PyTorch Layer Norm | PyTorch native implementation | - | - | Bench |