Definition

What it is

A high-performance distributed **KV-cache offloading** layer for LLM inference, written to maximize prefix-reuse across vLLM and other inference engines. Repository: [github.com/LMCache/LMCache](https://github.com/LMCache/LMCache). LMCache intercepts prefix tokens during prefill, persists their computed KV tensors to a distributed hierarchy (CPU memory → local NVMe → S3-compatible object storage), and serves them back instantly when the same prefix recurs — dramatically lowering Time-to-First-Token for repeated long-context queries. The **L2 Serde components** explicitly support S3 backends for datacenter-wide KV-cache persistence.

Why it exists

As prompts grow to millions of tokens, the "prefill tax" — the compute required to process the input sequence before generating the first token — becomes prohibitive. Moving a single bit of data through the memory hierarchy costs an order of magnitude more energy than the equivalent computation, so caching computed KV tensors is fundamentally more efficient than recomputing them. LMCache makes that cache durable, distributed, and S3-resident so it's amortized across the entire inference fleet.

Primary use cases

Disaggregated prefill architectures, multi-node P2P CPU memory sharing for KV cache, persistent KV-cache pools backed by S3, prefix-reuse acceleration for vLLM deployments, CoreWeave-Cohere-style enterprise LLM serving.

Recent developments

Latest signals

Latest release: v0.4.7 (current stable, June 13, 2026). The 0.4.8 line is still pre-release (rc); 0.4.7 is the latest stable. Per LMCache/LMCache releases.
LMCache paper at arXiv 2510.09665 formalized the enterprise KV-cache layer. "An Efficient KV Cache Layer for Enterprise-Scale LLM Inference" — academic framing for the production-cache patterns LMCache has shipped. First-class reference for the engineering community. Per arXiv 2510.09665 — LMCache: Efficient KV Cache Layer for Enterprise-Scale LLM Inference.
Now supports both vLLM AND SGLang. Extended beyond vLLM to also intercept and persist KV caches from SGLang — making LMCache the cross-engine KV-cache substrate for the two leading open-source inference engines. Per arXiv 2510.09665.
NetApp ONTAP S3 is the canonical shared-storage backend. NetApp published an end-to-end deployment guide for vLLM + LMCache + ONTAP S3, demonstrating both engines hitting the same offloaded KV-cache entries via S3. First major enterprise-storage vendor to ship a reference architecture. Per NetApp — KV Cache Offloading with vLLM, LMCache, and ONTAP S3.
vLLM production-stack ships LMCache as the canonical KV-cache offloader. vllm-project/production-stack includes LMCache + the KV-cache-offloading tutorial as a first-class deployment pattern. Confirms LMCache as the reference for "vLLM + persistent KV cache." Per GitHub — vllm-project/production-stack KV cache offload tutorial.
KServe also supports LMCache for Hugging Face vLLM backends. Kubernetes-native model-serving (KServe) added LMCache integration for the HF vLLM runtime — KV cache offloading now sits inside the standard K8s ML-serving stack. Per KServe — KV Cache Offloading with Huggingface vLLM Backend.
Tiered storage: CPU memory → local NVMe → S3 — the "energy moves more than data" framing. Production deployments use the full tiered hierarchy: moving a single bit through memory costs an order of magnitude more energy than recomputing it locally — so caching tiers down to S3 amortizes that energy across the entire fleet. Per NetApp — Engineering Inference: KV Cache, Shared Storage, Economics of AI.