KERNELS COMMUNITY BENCHMARKS
This report aggregates latency and performance benchmarks across core model components.
Each section includes:
- A latency visualization
- Links to detailed implementation benchmarks
TABLE OF CONTENTS
- METHODOLOGY
- LAYER NORMALIZATION
- ROTARY POSITION EMBEDDINGS
- FLASH ATTENTION
- CAUSAL CONV1D
- ACTIVATION FUNCTIONS
- NOTES
METHODOLOGY
Each benchmark is run with the Kernels Benchmarking Framework and follows these principles:
- a reference implementation (usually PyTorch native) is included for baseline comparison
- multiple input sizes and batch sizes are tested to reflect real-world usage
- runs are repeatable via python virtual environments and documented dependencies
- results are collected and visualized using standardized scripts
Note: Latency values are measured in milliseconds (ms). Lower values indicate better performance.
LAYER NORMALIZATION
| Implementation | Description |
|---|---|
| HF Kernels Layer Norm | HuggingFace kernels implementation |
| PyTorch Layer Norm | PyTorch native implementation |
ROTARY POSITION EMBEDDINGS
| Implementation | Description |
|---|---|
| HF Kernels Rotary | HuggingFace kernels implementation |
| PyTorch Rotary | PyTorch native implementation |
FLASH ATTENTION
| Implementation | Description |
|---|---|
| Flash Attention | Flash Attention implementation |
| HF Kernels Flash Attention | HuggingFace kernels Flash Attention |
| HF Kernels Flash Attention 3 | HuggingFace kernels Flash Attention 3 |
| Memory Efficient Attention | Memory efficient attention implementation |
| Sage Attention | Sage attention implementation |
| xFormers | xFormers attention implementation |
CAUSAL CONV1D
| Implementation | Description |
|---|---|
| HF Kernels Causal Conv1D | HuggingFace kernels implementation |
| PyTorch Causal Conv1D | PyTorch native implementation |
ACTIVATION FUNCTIONS
| Implementation | Description |
|---|---|
| HF Kernels SwiGLU | HuggingFace kernels SwiGLU implementation |
| PyTorch SwiGLU | PyTorch native SwiGLU implementation |