Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Mooncake - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

Mooncake

kvcache-ai (Moonshot AI)Apache-2.0

View on GitHub

Inference5.5K Stars800 Forks108 views

Mooncake is the open-source serving platform that powers Kimi, the leading LLM service operated by Moonshot AI. Originally published as an architecture paper in 2024, the project has steadily grown into a full reference implementation and as of May 2026 sits at over 5,400 GitHub stars and 800 forks. Maintained under the kvcache-ai organization, Mooncake is one of the most concrete looks at how a top-tier production LLM service is actually built underneath the API. ## A KV-Cache-Centric Architecture Mooncake is designed around a single observation: the KV cache is the most valuable artifact a serving stack produces, and treating it as a first-class, disaggregated resource unlocks substantial efficiency gains. Rather than treating each GPU replica as a self-contained black box, Mooncake exposes the KV cache as a pooled, networked resource that any replica can read from or write to. This perspective is what enables most of the platform's distinguishing features. ## Disaggregated Prefill and Decode LLM inference has two very different phases: prefill is compute-bound and benefits from large batches on high-FLOP GPUs, while decode is memory-bandwidth-bound and benefits from tight latency control. Mooncake splits these phases onto separate GPU pools, sized and scheduled independently. The handoff between phases is mediated by the KV cache, which is transferred over high-speed networking instead of recomputed. This disaggregation is one of the techniques that lets Kimi sustain very long contexts at reasonable cost. ## RDMA-Based KV Cache Transfer Moving KV caches between machines used to be a non-starter because of bandwidth and latency. Mooncake addresses this with an RDMA-based transfer engine that moves cache blocks at near-line-rate between nodes, with carefully tuned scheduling to overlap transfer with compute. This is what makes pooled KV storage practical at the scale of a production chatbot rather than just a research demo. ## Integration with vLLM and SGLang Although Mooncake originated inside Moonshot, the open-source release is engine-agnostic. The project provides adapters for both vLLM and SGLang, letting operators plug Mooncake's KV pool and scheduler into the inference engine they already use. This positioning mirrors a broader industry trend: serving platforms are decoupling from inference engines, with each layer specializing in what it does best. ## Conversation Cache and Long-Context Optimization Kimi is known for very long context windows, and Mooncake reflects this in features dedicated to multi-turn conversation reuse and long-document workloads. The platform aggressively caches prior turns, supports partial prefix reuse, and exposes APIs for clients to hint at expected context reuse, all of which compound to keep marginal cost low even as conversations grow. ## Production-Grade Scheduling Mooncake includes a global scheduler that places requests across prefill and decode pools based on cache locality, current load, and SLO class. The scheduler exposes hooks for SLO-aware admission control and back-pressure, which are essential when running a public LLM service with mixed free-tier and paid-tier traffic. ## Limitations Mooncake's value proposition is strongest at scale: small deployments running a handful of replicas on a single node will see most of its features as overhead rather than gain. The RDMA-based transfer engine assumes a high-quality network fabric, which is not always available outside well-engineered datacenters. Documentation is improving but still trails the pace of the codebase, so adopters should expect to read source code and recent papers to fully understand tuning knobs.

Key Features

KV-cache-centric architecture that treats the cache as a pooled, disaggregated resource
Disaggregated prefill and decode running on independently sized GPU pools
RDMA-based KV transfer engine for near-line-rate cache movement between nodes
Engine-agnostic adapters for both vLLM and SGLang
Aggressive multi-turn conversation cache for long-context chat workloads
Global scheduler with SLO-aware placement and admission control
Reference implementation of the production Kimi serving stack
Apache 2.0 license maintained under the kvcache-ai community organization

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT299

Open Source

Mooncake

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth