Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

FlashAttention - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

FlashAttention

Dao-AILabBSD-3-Clause

View on GitHub

LLM22.7K Stars2.5K Forks341 views

FlashAttention is the open-source library from Dao-AILab that makes transformer attention fast and memory-efficient without any approximation. With 22.7k GitHub stars, 2.5k forks, and the freshly released FlashAttention-4 v4.0.0.beta4 on March 5, 2026, it remains the foundational attention kernel powering virtually every major LLM training and inference stack in production today. ## Why FlashAttention Matters The attention mechanism is the computational bottleneck of every transformer model. Standard attention scales quadratically with sequence length in both time and memory, making long-context LLMs prohibitively expensive. FlashAttention solves this by rethinking how attention is computed at the hardware level, using IO-aware tiling to minimize memory reads and writes between GPU SRAM and HBM. The result is 2-4x faster training and 5-20x memory savings with mathematically identical outputs to standard attention. Every major LLM framework — PyTorch, Hugging Face Transformers, vLLM, TensorRT-LLM, DeepSpeed — integrates FlashAttention as its default attention backend. When you run inference on Claude, GPT, Gemini, or Llama, FlashAttention kernels are almost certainly doing the heavy lifting. ## Key Features ### IO-Aware Tiling Algorithm FlashAttention's core innovation is computing attention in tiles that fit in GPU SRAM, avoiding the costly materialization of the full N×N attention matrix in HBM. By fusing the softmax normalization with the matrix multiplications in a single kernel pass, it eliminates multiple round-trips to slow global memory. This approach delivers exact results — not an approximation — while being dramatically faster than naive implementations. ### Multi-Architecture GPU Support The library supports NVIDIA GPUs from Ampere through Hopper and the new Blackwell architecture (SM80, SM90, SM100), plus AMD MI200/MI250/MI300 GPUs via ROCm with both Composable Kernel and Triton backends. It handles FP16, BF16, and FP8 datatypes, enabling mixed-precision training and quantized inference across hardware vendors. ### FlashAttention-4 with CuTeDSL The latest FA4 release brings torch.compile support, ABI-stable builds, and CUDA 13 compatibility. It introduces block sparsity for structured sparse attention patterns, paged attention for efficient KV cache management during inference, variable-length sequence batching, and deterministic modes for reproducible training. The SM100 backward pass optimizations specifically target Blackwell GPUs. ### Advanced Attention Variants Beyond standard scaled dot-product attention, FlashAttention supports multi-query attention (MQA), grouped-query attention (GQA), causal masking for autoregressive models, sliding window local attention, ALiBi positional encoding, rotary embeddings, dropout, and custom score modifiers via a flexible `score_mod` API. This covers virtually every attention variant used in modern LLM architectures. ## Technical Architecture | Component | Details | |-----------|----------| | Core Algorithm | IO-aware tiled attention with online softmax | | Supported GPUs | NVIDIA Ampere/Ada/Hopper/Blackwell, AMD MI200/MI300 | | Datatypes | FP16, BF16, FP8 | | CUDA Requirement | 12.0+ (13.0 for FA4 Blackwell) | | ROCm Requirement | 6.0+ | | Head Dimensions | Up to 256 | | Installation | pip install flash-attn | | Versions | FA2 (stable v2.8.3), FA3 (Hopper beta), FA4 (Hopper+Blackwell beta) | ## Version History FlashAttention has evolved through four major generations. FA1 (2022) introduced the IO-aware tiling concept. FA2 (2023) improved parallelism and added support for head dimensions up to 256. FA3 (2024) was specifically optimized for Hopper GPUs with FP8 support. FA4 (2026) extends to Blackwell GPUs, adds torch.compile integration, and introduces the score_mod API for custom attention patterns. ## Real-World Impact FlashAttention is not just an academic contribution — it is production infrastructure. PyTorch 2.0+ ships with `torch.nn.functional.scaled_dot_product_attention` backed by FlashAttention kernels. Hugging Face Transformers automatically enables it when available. vLLM and TensorRT-LLM use it for high-throughput inference serving. The library has become so fundamental that most practitioners use it without realizing it. ## Limitations - Installation requires CUDA toolkit and compilation, which can be painful on some systems - FA4 beta features are not yet production-stable - AMD ROCm support lags behind NVIDIA CUDA in feature parity - 955 open issues reflect the complexity of supporting diverse GPU architectures - Head dimensions beyond 256 are not yet supported ## Conclusion FlashAttention is one of the most impactful open-source contributions to the LLM ecosystem. By making exact attention 2-4x faster and dramatically more memory-efficient, it has enabled the long-context models and large-batch training runs that define the current generation of AI. The FA4 beta release with Blackwell GPU support and torch.compile integration ensures it will remain the attention kernel of choice as hardware evolves.

Key Features

IO-aware tiling algorithm for 2-4x faster attention with exact results, no approximation
Multi-architecture GPU support: NVIDIA Ampere/Ada/Hopper/Blackwell and AMD MI200/MI300 via ROCm
FlashAttention-4 beta with torch.compile, ABI-stable builds, and CUDA 13 for Blackwell GPUs
FP16, BF16, and FP8 datatype support for mixed-precision training and quantized inference
Advanced attention variants: MQA, GQA, causal masking, sliding window, ALiBi, rotary embeddings
Block sparsity and paged attention for efficient KV cache management during inference
Custom score_mod API for flexible attention pattern modifications
Native integration with PyTorch, Hugging Face Transformers, vLLM, and TensorRT-LLM