KERNELS COMMUNITY BENCHMARKS

This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks

TABLE OF CONTENTS

METHODOLOGY

Each benchmark is run with the Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts


Note: Latency values are measured in milliseconds (ms). Lower values indicate better performance.

LAYER NORMALIZATION

Layer Norm Latency
Implementation Description
HF Kernels Layer Norm HuggingFace kernels implementation
PyTorch Layer Norm PyTorch native implementation


ROTARY POSITION EMBEDDINGS

Rotary Position Embeddings Latency
Implementation Description
HF Kernels Rotary HuggingFace kernels implementation
PyTorch Rotary PyTorch native implementation


FLASH ATTENTION

Flash Attention Latency
Implementation Description
Flash Attention Flash Attention implementation
HF Kernels Flash Attention HuggingFace kernels Flash Attention
HF Kernels Flash Attention 3 HuggingFace kernels Flash Attention 3
Memory Efficient Attention Memory efficient attention implementation
Sage Attention Sage attention implementation
xFormers xFormers attention implementation


CAUSAL CONV1D

Causal Conv1D Latency
Implementation Description
HF Kernels Causal Conv1D HuggingFace kernels implementation
PyTorch Causal Conv1D PyTorch native implementation


ACTIVATION FUNCTIONS

Activation Latency
Implementation Description
HF Kernels SwiGLU HuggingFace kernels SwiGLU implementation
PyTorch SwiGLU PyTorch native SwiGLU implementation