Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
FlashInfer is a high-performance kernel library and kernel generator for LLM serving that delivers state-of-the-art inference performance across diverse GPU architectures. The project provides unified APIs for attention mechanisms, GEMM operations, and Mixture-of-Experts (MoE) computations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. FlashInfer is integrated into leading LLM serving frameworks such as SGLang, vLLM, and MLC Engine, powering some of the largest-scale LLM deployments in production. The library features a JIT compilation system that generates optimized CUDA kernels at runtime, adapting to specific hardware configurations and workload patterns. Key capabilities include paged KV-cache attention for efficient memory management, fused RoPE positional encoding, GQA/MQA/MHA attention variant support, and distributed inference primitives. FlashInfer supports FP16, BF16, FP8, and INT4 quantization formats for flexible precision-performance tradeoffs. The kernel generator architecture allows users to compose custom attention kernels by specifying components like masking, positional encoding, and score transformation. Built with CUDA and Python, FlashInfer is designed for both research experimentation and production deployment at scale on NVIDIA GPUs.