Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
LMCache is an open-source KV cache layer designed to dramatically reduce time-to-first-token and increase throughput for LLM serving systems. As of late May 2026, the project has surpassed 8,300 GitHub stars and is being actively developed alongside vLLM, with which it integrates deeply. Rather than reimplementing yet another inference engine, LMCache focuses on a narrow but critical bottleneck: ensuring that any text already processed by the model never has to be processed again, regardless of which replica originally computed it. ## Why a Dedicated KV Cache Layer Matters In production LLM workloads, the same content is processed over and over: shared system prompts, retrieved documents in RAG pipelines, multi-turn conversation history, and code context in agentic loops. Without a shared cache, every replica recomputes these tokens for every request, wasting GPU cycles and inflating latency. LMCache stores the resulting KV pairs and lets any replica reuse them, turning expensive prefill work into a near-instant memory lookup. ## Cross-Replica and Cross-Tier Storage LMCache treats GPU memory, CPU memory, local disk, and remote object storage as a unified cache hierarchy. Hot KV blocks stay in GPU HBM, warm blocks spill to CPU RAM, and cold blocks are persisted to NVMe or S3-compatible backends. The library handles transparent promotion and eviction across these tiers, so operators get the latency of in-memory caching with the capacity of cheap storage. This tiered design is what makes long-context and conversation-heavy workloads economically viable at scale. ## Tight vLLM Integration LMCache ships as a first-class plugin for vLLM, with one-line activation in the engine config. Once enabled, vLLM consults LMCache before every prefill, fetches matching KV blocks if available, and only computes the uncached suffix. The integration covers prefix caching, partial-prefix matches, and disaggregated prefill workflows, which separate the compute-heavy prefill stage from the latency-sensitive decode stage onto different GPU pools. ## Cache-Aware Routing and Distributed Sharing Because LMCache exposes a network-accessible cache service, multiple inference replicas can share the same KV blocks. Combined with a cache-aware router, requests can be steered to the replica most likely to have a warm cache for the request prefix, or to any replica that can pull missing blocks from a peer over RDMA or TCP. This eliminates the cold-start penalty when traffic shifts between replicas and unlocks horizontal scaling without sacrificing cache hit rates. ## Hardware Coverage LMCache supports both NVIDIA CUDA and AMD ROCm GPUs, with growing coverage for emerging accelerators. The project ships optimized kernels for KV block serialization, compression, and transfer, allowing it to keep up with the bandwidth demands of modern H100 and MI300X deployments. Apache 2.0 licensing makes it safe to embed in commercial inference stacks. ## Real-World Impact Teams adopting LMCache typically report 2x to 10x improvements in time-to-first-token for prompts with substantial shared context, and meaningful cost reductions on long-context workloads where prefill dominates the bill. RAG pipelines, code assistants with large repository context, and multi-turn chatbots benefit the most, because their workloads are exactly the pattern LMCache is optimized for. ## Limitations LMCache is most valuable when prompts genuinely share prefixes; workloads with highly unique short prompts see little benefit and may even pay a small overhead for cache lookups. The disaggregated prefill mode and distributed cache features add operational complexity that smaller deployments may not need. As a relatively young project, some advanced features and integrations are still stabilizing release-to-release, so production users should pin specific versions and test upgrades carefully.