Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

FlashInfer - Open Source | Evermx | Evermx

Back to Open Source

Trending

FlashInfer

flashinfer-aiApache-2.0

View on GitHub

Inference5.1K Stars749 Forks213 views

FlashInfer is a high-performance kernel library and kernel generator for LLM serving that delivers state-of-the-art inference performance across diverse GPU architectures. The project provides unified APIs for attention mechanisms, GEMM operations, and Mixture-of-Experts (MoE) computations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. FlashInfer is integrated into leading LLM serving frameworks such as SGLang, vLLM, and MLC Engine, powering some of the largest-scale LLM deployments in production. The library features a JIT compilation system that generates optimized CUDA kernels at runtime, adapting to specific hardware configurations and workload patterns. Key capabilities include paged KV-cache attention for efficient memory management, fused RoPE positional encoding, GQA/MQA/MHA attention variant support, and distributed inference primitives. FlashInfer supports FP16, BF16, FP8, and INT4 quantization formats for flexible precision-performance tradeoffs. The kernel generator architecture allows users to compose custom attention kernels by specifying components like masking, positional encoding, and score transformation. Built with CUDA and Python, FlashInfer is designed for both research experimentation and production deployment at scale on NVIDIA GPUs.

Key Features

Unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3 and CUTLASS
JIT compilation system generating optimized CUDA kernels at runtime for specific hardware and workload patterns
Paged KV-cache attention for efficient memory management in production LLM serving
Support for FP16, BF16, FP8, and INT4 quantization formats for flexible precision-performance tradeoffs
Composable kernel generator allowing custom attention kernels with configurable masking, positional encoding, and scoring
Integration with leading serving frameworks SGLang, vLLM, and MLC Engine for production-scale deployments

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT299

Open Source

FlashInfer

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth