Trending

SageAttention

thu-mlApache-2.0

Inference3.3K Stars407 Forks92 views

Quantized Attention achieving 2-5x speedup over FlashAttention without losing end-to-end metrics across language, image, and video models — accepted at ICLR2025, ICML2025, and NeurIPS2025 Spotlight.

Key Features

INT8 quantization for QK^T computations and FP8/FP16 for PV operations, achieving 2-5x speedup over FlashAttention
Three iterative versions: SageAttention v1 (baseline 8-bit), v2 (per-thread INT4 + outlier smoothing), and v3 (microscaling FP4 attention)
Optimized kernels across Hopper (H100/H800/H20), Ada/Ampere (RTX4090/RTX3090/A100/A800/L40/L20), and Blackwell (RTX5090) GPUs
RTX5090 reaches 560 TFLOPS with 2.7x speedup over FlashAttention2 in benchmarks
Maintains end-to-end accuracy across language models, image models (Flux/SD), and video models (CogVideoX, etc.)
Drop-in replacement API compatible with PyTorch attention call sites — minimal integration friction
Three-paper academic backing: ICLR 2025, ICML 2025, and NeurIPS 2025 Spotlight

Open Source

SageAttention

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth