Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

LMCache - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

LMCache

LMCacheApache-2.0

View on GitHub

Inference7.4K Stars963 Forks231 views

LMCache is an open-source LLM serving engine extension that dramatically accelerates large language model inference by intelligently caching and reusing KV (key-value) cache data across distributed infrastructure. Developed under the Apache 2.0 license, it addresses one of the most costly bottlenecks in LLM production deployments: the repeated computation of prefill tokens for identical or overlapping input text. ## The KV Cache Problem in LLM Serving Every time an LLM processes a prompt, it generates a KV cache — a set of intermediate computations stored in GPU memory that allow the model to attend over previously seen tokens efficiently. In production environments, many requests share substantial text overlap: a RAG pipeline sending the same system prompt and retrieved documents to every query, a multi-turn chatbot replaying the full conversation history with each turn, or a batch of users accessing a shared knowledge base. Without KV cache reuse, each request recomputes this overlap from scratch on the GPU, burning compute cycles and increasing time-to-first-token (TTFT). LMCache solves this by storing computed KV caches and reusing them across requests, instances, and even time windows. ## Core Architecture LMCache introduces a multi-tier cache storage hierarchy that spans GPU VRAM, CPU RAM, NVMe SSDs, and remote storage backends including Redis, S3-compatible stores, Weka, and Valkey. When a new request arrives, LMCache checks whether any portion of its input already has a cached KV representation. If found, the cached data is retrieved and injected into the serving engine, bypassing prefill computation for those tokens entirely. The key architectural innovation is that LMCache supports reuse of **any** overlapping text, not just exact prefixes. This makes it significantly more powerful than the native prefix caching built into vLLM and other serving frameworks, which require the reusable text to appear at the start of the prompt. ## Integration and Compatibility LMCache integrates tightly with vLLM v1 as its primary target, supporting CPU offloading, disaggregated prefill across nodes, and P2P cache sharing between serving instances. It also integrates with SGLang for KV cache offloading. The framework ships with a pip package and is compatible with the latest vLLM releases. Major cloud and infrastructure providers have adopted LMCache in production: Google Cloud, GMI Cloud, CoreWeave, and NVIDIA Dynamo. The project is backed by Tensormesh and works with KServe for Kubernetes-native deployments. ## Performance Benchmarks In standard benchmarks, LMCache delivers 3-10x reductions in time-to-first-token and corresponding increases in throughput for workloads with text reuse. Multi-round QA and RAG pipelines benefit most, since they naturally reuse large portions of context across requests. For a typical RAG deployment where each query shares the same 4,000-token document context, LMCache eliminates the prefill cost for those tokens on all but the first request. The v0.3.15 release in March 2026 brought improvements to the NIXL storage backend and enhanced async loading pipelines, further reducing cache retrieval overhead at scale. ## Cache Compression LMCache includes CacheGen, a KV cache compression module that reduces the storage footprint of cached data. This enables longer retention periods on limited-capacity CPU RAM or disk, increasing the probability that a given request finds a cache hit. Compression adds minimal overhead compared to the prefill computation savings. ## Disaggregated Prefill For large-scale deployments, LMCache supports disaggregated prefill architectures where prefill computation is offloaded to dedicated nodes separate from the decode servers. The system then transfers the resulting KV cache via P2P mechanisms to the decode server, allowing each node type to be scaled independently based on workload characteristics. ## Community and Adoption With 7,400 GitHub stars, 963 forks, and 28 releases as of March 2026, LMCache has established itself as the standard KV cache optimization layer for production vLLM deployments. The project maintains bi-weekly community meetings and an active Discord. It holds Apache 2.0 licensing, making it suitable for commercial applications without restriction.

Key Features

Any-text KV cache reuse: caches and reuses overlapping input tokens across all requests, not just exact prefixes
Multi-tier storage hierarchy spanning GPU VRAM, CPU RAM, NVMe SSD, Redis, S3, and Weka backends
3-10x reduction in time-to-first-token for workloads with shared context such as RAG and multi-turn QA
Disaggregated prefill support separating prefill and decode nodes for independent scaling
P2P cache sharing between distributed serving instances to maximize cache hit rates across a fleet
CacheGen KV cache compression module to reduce storage footprint and extend cache retention windows
Native vLLM v1 and SGLang integration with minimal configuration overhead
Async cache loading that overlaps retrieval with other operations to minimize added latency

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT301

Open Source

LMCache

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth