KERNELS COMMUNITY BENCHMARKS

This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks

TABLE OF CONTENTS

RUN YOURSELF

To run the benchmarks locally, clone the repository and use uvx to build and run the benchmarks:

Note benches are made to run on a machine with a compatible NVIDIA GPU and CUDA installed, other hardware may not not work as expected.


git clone https://github.com/huggingface/kernels-benchmarks.git
cd kernels-benchmarks
uvx https://github.com/drbh/uvnote.git build benches

METHODOLOGY

Each benchmark is run with the Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts


BENCHMARKS

Note: Latency values are measured in milliseconds (ms). Lower values indicate better performance.

ACTIVATION FUNCTIONS

Activation Latency
Implementation Description Source HF Bench
HF Kernels SwiGLU HuggingFace kernels SwiGLU implementation GitHub HF Bench
PyTorch SwiGLU PyTorch native SwiGLU implementation - - Bench


FLASH ATTENTION

Flash Attention Latency
Implementation Description Source HF Bench
Flash Attention Torch SDPA Flash Attention implementation - - Bench
HF Kernels Flash Attention 2 HuggingFace kernels Flash Attention GitHub HF Bench
HF Kernels Flash Attention 3 HuggingFace kernels Flash Attention 3 GitHub HF Bench
Memory Efficient Attention Memory efficient attention implementation - Bench
Sage Attention Sage attention implementation HF Bench
xFormers xFormers attention implementation GitHub - Bench


DEFORMABLE DETR

Deformable DETR Latency
Implementation Description Source HF Bench
HF Kernels Deformable DETR HuggingFace kernels Deformable DETR implementation GitHub HF Bench
PyTorch Deformable DETR PyTorch native Deformable DETR implementation - - Bench


OPENAI-STYLE MOE

OpenAI MoE Latency
Implementation Description Source HF Bench
GptOssExperts GPT OSS reference OpenAI-style MoE Bench
Binned PyTorch Binned PyTorch OpenAI-style MoE implementation - - Bench


CAUSAL CONV1D

Causal Conv1D Latency
Implementation Description Source HF Bench
HF Kernels Causal Conv1D HuggingFace kernels implementation GitHub HF Bench
PyTorch Causal Conv1D PyTorch native implementation - - Bench


ROTARY POSITION EMBEDDINGS

Rotary Position Embeddings Latency
Implementation Description Source HF Bench
HF Kernels Rotary HuggingFace kernels implementation GitHub HF Bench
PyTorch Rotary PyTorch native implementation - - Bench


LAYER NORMALIZATION

Layer Norm Latency
Implementation Description Source HF Bench
HF Kernels Layer Norm HuggingFace kernels implementation GitHub HF Bench
PyTorch Layer Norm PyTorch native implementation - - Bench