Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

LMCache - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

LMCache

LMCacheApache-2.0

View on GitHub

Inference8.4K Stars1.2K Forks119 views

LMCache is an open-source KV cache layer designed to dramatically reduce time-to-first-token and increase throughput for LLM serving systems. As of late May 2026, the project has surpassed 8,300 GitHub stars and is being actively developed alongside vLLM, with which it integrates deeply. Rather than reimplementing yet another inference engine, LMCache focuses on a narrow but critical bottleneck: ensuring that any text already processed by the model never has to be processed again, regardless of which replica originally computed it. ## Why a Dedicated KV Cache Layer Matters In production LLM workloads, the same content is processed over and over: shared system prompts, retrieved documents in RAG pipelines, multi-turn conversation history, and code context in agentic loops. Without a shared cache, every replica recomputes these tokens for every request, wasting GPU cycles and inflating latency. LMCache stores the resulting KV pairs and lets any replica reuse them, turning expensive prefill work into a near-instant memory lookup. ## Cross-Replica and Cross-Tier Storage LMCache treats GPU memory, CPU memory, local disk, and remote object storage as a unified cache hierarchy. Hot KV blocks stay in GPU HBM, warm blocks spill to CPU RAM, and cold blocks are persisted to NVMe or S3-compatible backends. The library handles transparent promotion and eviction across these tiers, so operators get the latency of in-memory caching with the capacity of cheap storage. This tiered design is what makes long-context and conversation-heavy workloads economically viable at scale. ## Tight vLLM Integration LMCache ships as a first-class plugin for vLLM, with one-line activation in the engine config. Once enabled, vLLM consults LMCache before every prefill, fetches matching KV blocks if available, and only computes the uncached suffix. The integration covers prefix caching, partial-prefix matches, and disaggregated prefill workflows, which separate the compute-heavy prefill stage from the latency-sensitive decode stage onto different GPU pools. ## Cache-Aware Routing and Distributed Sharing Because LMCache exposes a network-accessible cache service, multiple inference replicas can share the same KV blocks. Combined with a cache-aware router, requests can be steered to the replica most likely to have a warm cache for the request prefix, or to any replica that can pull missing blocks from a peer over RDMA or TCP. This eliminates the cold-start penalty when traffic shifts between replicas and unlocks horizontal scaling without sacrificing cache hit rates. ## Hardware Coverage LMCache supports both NVIDIA CUDA and AMD ROCm GPUs, with growing coverage for emerging accelerators. The project ships optimized kernels for KV block serialization, compression, and transfer, allowing it to keep up with the bandwidth demands of modern H100 and MI300X deployments. Apache 2.0 licensing makes it safe to embed in commercial inference stacks. ## Real-World Impact Teams adopting LMCache typically report 2x to 10x improvements in time-to-first-token for prompts with substantial shared context, and meaningful cost reductions on long-context workloads where prefill dominates the bill. RAG pipelines, code assistants with large repository context, and multi-turn chatbots benefit the most, because their workloads are exactly the pattern LMCache is optimized for. ## Limitations LMCache is most valuable when prompts genuinely share prefixes; workloads with highly unique short prompts see little benefit and may even pay a small overhead for cache lookups. The disaggregated prefill mode and distributed cache features add operational complexity that smaller deployments may not need. As a relatively young project, some advanced features and integrations are still stabilizing release-to-release, so production users should pin specific versions and test upgrades carefully.

Key Features

Tiered KV cache spanning GPU HBM, CPU RAM, local NVMe, and remote object storage
First-class vLLM plugin with one-line activation in the engine config
Distributed KV block sharing across replicas over RDMA or TCP
Cache-aware routing to steer requests to replicas with warm prefixes
Disaggregated prefill support that splits prefill and decode onto separate pools
Hardware coverage for NVIDIA CUDA and AMD ROCm with optimized transfer kernels
Substantial TTFT and throughput gains for RAG, long-context, and multi-turn workloads
Apache 2.0 license suitable for embedding in commercial inference stacks

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT299

Open Source

LMCache

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth