Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
SGLang is a high-performance serving framework for large language models and multimodal models, maintained by the sgl-project community and backed by the LMSYS team. It has grown into one of the most widely deployed open-source inference engines, with the project reporting that it powers trillions of tokens per day in production. The framework joined the PyTorch ecosystem in 2025 and received an Open Source AI Grant, signals of both technical maturity and broad adoption. ## What It Does SGLang sits between your model weights and your application, turning a raw transformer checkpoint into a fast, OpenAI-compatible serving endpoint. Its goal is throughput and latency at scale: serving many concurrent requests cheaply while keeping per-request latency low. It targets the hard part of running LLMs in production — batching, caching, and GPU utilization — rather than training or fine-tuning. ## Core Techniques The engine is built around a few well-known performance ideas, implemented carefully. RadixAttention provides automatic prefix caching, so requests that share a common prompt prefix (system prompts, few-shot examples, multi-turn chats) reuse previously computed key-value cache instead of recomputing it. A zero-overhead batch scheduler overlaps CPU scheduling with GPU compute to keep the accelerator busy. Continuous batching, a cache-aware load balancer, and fast structured-output decoding (constrained JSON and grammar-guided generation) round out the serving stack. For very large models, SGLang supports tensor, pipeline, and expert parallelism, plus prefill/decode disaggregation for Mixture-of-Experts deployments. ## Models and Hardware SGLang is known for day-0 support of major open model releases — including DeepSeek V3/R1, Qwen, Llama, and gpt-oss — often shipping optimized kernels the same day a model launches. It runs across a wide hardware range: NVIDIA GPUs (with specific tuning for recent datacenter parts), AMD Instinct accelerators, and a JAX backend that enables execution on TPUs. Beyond text, it serves vision-language and diffusion models, broadening its use beyond pure chat workloads. ## Usability For teams already comfortable with Python and GPU serving, getting started is straightforward: install the package, launch a server pointed at a Hugging Face model, and call it through an OpenAI-compatible API. Extensive documentation, a public roadmap, weekly dev meetings, and an active Slack community lower the operational learning curve. The trade-off is that SGLang is infrastructure, not a turnkey app — it assumes you can provision GPUs and reason about parallelism and memory. ## Considerations The main caveats are inherent to high-performance inference. Squeezing maximum throughput often requires tuning parallelism strategies and batch settings for your specific hardware, and the most advanced features (large-scale expert parallelism, disaggregated serving) target multi-GPU and multi-node setups that smaller teams may not need. The project moves quickly, so APIs and optimal configurations evolve between releases. For organizations serving open models at scale and looking to cut GPU cost per token, SGLang is a mature, actively developed option released under the Apache-2.0 license.