Python Library | Builder | Community | Community Hub | Benchmarks

KERNELS COMMUNITY BENCHMARKS

This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks

ACTIVATION FUNCTIONS
FLASH ATTENTION
DEFORMABLE DETR
OPENAI-STYLE MOE
ROTARY POSITION EMBEDDINGS
CAUSAL CONV1D
LAYER NORMALIZATION

RUN YOURSELF

To run the benchmarks locally, clone the repository and use uvx to build and run the benchmarks:

Note benches are made to run on a machine with a compatible NVIDIA GPU and CUDA installed, other hardware may not not work as expected.


git clone https://github.com/huggingface/kernels-benchmarks.git
cd kernels-benchmarks
uvx https://github.com/drbh/uvnote.git build benches

METHODOLOGY

Each benchmark is run with the Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts

BENCHMARKS

Note: Latency values are measured in milliseconds (ms). Lower values indicate better performance.

ACTIVATION FUNCTIONS

Implementation	Description	Source	HF	Bench
HF Kernels SwiGLU	HuggingFace kernels SwiGLU implementation	GitHub	HF	Bench
PyTorch SwiGLU	PyTorch native SwiGLU implementation	-	-	Bench

FLASH ATTENTION

Implementation	Description	Source	HF	Bench
Flash Attention	Torch SDPA Flash Attention implementation	-	-	Bench
HF Kernels Flash Attention 2	HuggingFace kernels Flash Attention	GitHub	HF	Bench
HF Kernels Flash Attention 3	HuggingFace kernels Flash Attention 3	GitHub	HF	Bench
Memory Efficient Attention	Memory efficient attention implementation		-	Bench
Sage Attention	Sage attention implementation		HF	Bench
xFormers	xFormers attention implementation	GitHub	-	Bench

DEFORMABLE DETR

Implementation	Description	Source	HF	Bench
HF Kernels Deformable DETR	HuggingFace kernels Deformable DETR implementation	GitHub	HF	Bench
PyTorch Deformable DETR	PyTorch native Deformable DETR implementation	-	-	Bench

OPENAI-STYLE MOE

Implementation	Description	Source	HF	Bench
GptOssExperts	GPT OSS reference OpenAI-style MoE			Bench
Binned PyTorch	Binned PyTorch OpenAI-style MoE implementation	-	-	Bench

CAUSAL CONV1D

Implementation	Description	Source	HF	Bench
HF Kernels Causal Conv1D	HuggingFace kernels implementation	GitHub	HF	Bench
PyTorch Causal Conv1D	PyTorch native implementation	-	-	Bench

ROTARY POSITION EMBEDDINGS

Implementation	Description	Source	HF	Bench
HF Kernels Rotary	HuggingFace kernels implementation	GitHub	HF	Bench
PyTorch Rotary	PyTorch native implementation	-	-	Bench

LAYER NORMALIZATION

Implementation	Description	Source	HF	Bench
HF Kernels Layer Norm	HuggingFace kernels implementation	GitHub	HF	Bench
PyTorch Layer Norm	PyTorch native implementation	-	-	Bench