Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

llm-d - Open Source | Evermx | Evermx

Back to Open Source

Trending

llm-d

llm-dApache-2.0

View on GitHub

Inference2.8K Stars340 Forks196 views

llm-d is a Kubernetes-native high-performance distributed LLM inference serving stack designed for production deployments. It integrates vLLM as the model server engine, Kubernetes Inference Gateway for control plane orchestration, and an intelligent Envoy-based inference scheduler that makes routing decisions with awareness of prefix cache state, KV cache occupancy, SLA requirements, and load distribution. Key capabilities include disaggregated serving that splits prefill and decode phases across independent instances, wide expert parallelism for large MoE models like DeepSeek-R1, tiered KV prefix caching that offloads to CPU/SSD/remote storage, and workload autoscaling with scale-to-zero support. The v0.5.1 release (March 2026) validated approximately 3,100 tokens per second per B200 decode GPU and up to 50,000 output tokens per second on a 16x16 B200 prefill/decode topology, achieving order-of-magnitude TTFT reduction versus round-robin baselines. The project is backed by Red Hat, KServe, and the Kubernetes ML community, with optimizations contributed directly back to upstream vLLM.

Key Features

Intelligent Envoy-based inference scheduler with prefix cache and KV cache awareness
Disaggregated serving splitting prefill and decode phases across independent instances
Wide expert parallelism optimized for large MoE models like DeepSeek-R1
Tiered KV prefix caching offloading to CPU, SSD, and remote storage
Workload autoscaling with SLO optimization and scale-to-zero support
Validated 3,100 tok/s per B200 decode GPU and 50,000 tok/s on 16x16 B200 topology
Helm chart deployment on Kubernetes 1.29+ with comprehensive documentation
Upstream contributions back to vLLM benefiting the broader inference ecosystem

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT275

Open Source

llm-d

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth