KERNELS COMMUNITY BENCHMARKS

This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks

METHODOLOGY
LAYER NORMALIZATION
ROTARY POSITION EMBEDDINGS
FLASH ATTENTION
CAUSAL CONV1D
ACTIVATION FUNCTIONS
NOTES

METHODOLOGY

Each benchmark is run with the Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts

Note: Latency values are measured in milliseconds (ms). Lower values indicate better performance.

LAYER NORMALIZATION

Implementation	Description
HF Kernels Layer Norm	HuggingFace kernels implementation
PyTorch Layer Norm	PyTorch native implementation

ROTARY POSITION EMBEDDINGS

Implementation	Description
HF Kernels Rotary	HuggingFace kernels implementation
PyTorch Rotary	PyTorch native implementation

FLASH ATTENTION

Implementation	Description
Flash Attention	Flash Attention implementation
HF Kernels Flash Attention	HuggingFace kernels Flash Attention
HF Kernels Flash Attention 3	HuggingFace kernels Flash Attention 3
Memory Efficient Attention	Memory efficient attention implementation
Sage Attention	Sage attention implementation
xFormers	xFormers attention implementation

CAUSAL CONV1D

Implementation	Description
HF Kernels Causal Conv1D	HuggingFace kernels implementation
PyTorch Causal Conv1D	PyTorch native implementation

ACTIVATION FUNCTIONS

Implementation	Description
HF Kernels SwiGLU	HuggingFace kernels SwiGLU implementation
PyTorch SwiGLU	PyTorch native SwiGLU implementation