Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

SGLang - Open Source | Evermx | Evermx

Back to Open Source

SGLang

Apache-2.0

View on GitHub

Inference26.6K Stars5.6K Forks2 views

SGLang (Structured Generation Language) is a high-throughput, low-latency inference engine for large language models and multimodal models, developed by the LMSYS team. With 26,600 GitHub stars and over 12,000 commits, it has become the de facto open-source infrastructure standard for LLM deployment in 2026, running across 400,000+ GPUs at organizations including xAI, NVIDIA, AMD, LinkedIn, Google Cloud, and AWS. The framework's flagship innovation is RadixAttention — a prefix caching mechanism that automatically reuses KV cache activations across requests sharing common prefixes (system prompts, RAG context, few-shot examples). This delivers up to 5x faster inference and 6x higher throughput on real-world workloads compared to frameworks without automatic KV cache reuse. In February 2026, SGLang unlocked a 25x inference performance improvement on NVIDIA GB300 NVL72 hardware. Independent 2026 benchmarks consistently rank SGLang at the top of open-source inference engines, delivering approximately 16,200 tokens per second on H100 GPUs — a 29% throughput advantage over vLLM. The framework supports all major model architectures (Llama, Qwen, DeepSeek, GPT variants, diffusion models), runs on NVIDIA (GB200/H100/A100), AMD MI-series, Intel CPUs, Google TPUs, and Huawei Ascend NPUs, and provides OpenAI-compatible APIs for drop-in replacement of existing inference stacks. SGLang also serves as the inference backbone for RL post-training frameworks including verl and Tunix.

Key Features

RadixAttention delivers up to 5x faster inference and 6x higher throughput via automatic KV cache prefix reuse
~16,200 tokens/sec throughput on H100 — 29% faster than vLLM in independent 2026 benchmarks
Deployed across 400,000+ GPUs at xAI, NVIDIA, AMD, LinkedIn, Google Cloud, and AWS
25x inference performance improvement on NVIDIA GB300 NVL72 (February 2026)
OpenAI-compatible API enables drop-in replacement of existing inference stacks

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT135

Open Source

SGLang

Key Features

Tags

Related Projects

Ollama

llama.cpp

Unsloth

SGLang