Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
llm-d is a Kubernetes-native high-performance distributed LLM inference serving stack designed for production deployments. It integrates vLLM as the model server engine, Kubernetes Inference Gateway for control plane orchestration, and an intelligent Envoy-based inference scheduler that makes routing decisions with awareness of prefix cache state, KV cache occupancy, SLA requirements, and load distribution. Key capabilities include disaggregated serving that splits prefill and decode phases across independent instances, wide expert parallelism for large MoE models like DeepSeek-R1, tiered KV prefix caching that offloads to CPU/SSD/remote storage, and workload autoscaling with scale-to-zero support. The v0.5.1 release (March 2026) validated approximately 3,100 tokens per second per B200 decode GPU and up to 50,000 output tokens per second on a 16x16 B200 prefill/decode topology, achieving order-of-magnitude TTFT reduction versus round-robin baselines. The project is backed by Red Hat, KServe, and the Kubernetes ML community, with optimizations contributed directly back to upstream vLLM.