Guide 38

KV-Cache Persistence to S3 — LMCache, SGLang, and Mooncake

Problem Framing

As LLM prompts grow into hundreds of thousands or millions of tokens, the Prefill Tax — the compute cost of processing input before generating the first output token — dominates serving cost. KV-cache persistence eliminates redundant prefill by storing computed KV tensors after the first pass and fetching them on every subsequent invocation. Doing this durably across an inference fleet requires storing those tensors in S3-compatible object storage. This guide maps the 2026 KV-cache persistence stack.

Relevant Nodes

  • Topics: AI Memory Infrastructure, Inference Locality, GPU + Object Storage Convergence
  • Technologies: LMCache, SGLang, Mooncake, Vestige, NIXL (NVIDIA Inference Transfer Library), Inference Context Memory Storage (ICMS)
  • Standards: S3 API
  • Architectures: Tiered Storage, Separation of Storage and Compute
  • Pain Points: Prefill Tax, Memory Wall, High Cloud Inference Cost

Decision Path

  1. Quantify the prefill savings opportunity:

    • Measure prefill-to-decode compute ratio on your workload. If prefill is >50% of compute, KV-cache persistence is high-leverage.
    • Identify prefix overlap — system prompts, few-shot examples, persistent agent context. The more prefix that recurs, the more KV-cache persistence helps.
    • Workloads with low prefix reuse (one-shot queries, no shared system prompt) get little benefit; skip this category.
  2. Option A — LMCache (intercept and offload):

    • Best for: vLLM-based deployments where prefix-reuse is significant and the KV-cache pool needs to survive across nodes and restarts.
    • Architecture: Intercepts prefix tokens during prefill, serializes computed KV tensors via L2 Serde components, writes to a distributed hierarchy (CPU memory → local NVMe → S3-compatible object storage). When the same prefix recurs, fetches serialized tensors directly from S3.
    • Integration: vLLM integrates_with LMCache via dynamic connectors. Production deployment: CoreWeave + Cohere.
  3. Option B — SGLang with RadixAttention:

    • Best for: Workloads with deeply structured prefix overlap (multi-tenant serving with shared system prompts, function-calling pipelines with shared schema prefixes).
    • Architecture: Uses a radix tree to identify and share KV-cache state across requests with overlapping prefixes. Evicts cold cache lines to remote storage (S3-compatible).
    • When it wins: Workloads where the prefix overlap is structured (not just shared system prompts but deeper structural overlap) — SGLang's radix-tree mechanic exploits this exactly.
  4. Option C — Mooncake (disaggregated prefill at scale):

    • Best for: High-scale LLM serving where prefill compute and decode compute should run on different hardware (Moonshot AI's serving pattern for Kimi).
    • Architecture: Formal disaggregated prefill — separate prefill compute pools from decode compute pools, with KV-cache state transferred between them via DRAM, NVMe, or S3-compatible object storage.
    • When it wins: When workload economics favor different hardware for prefill (high compute, low memory) vs decode (high memory, lower compute) — typical for very-large-context serving.
  5. Add NIXL + ICMS for tier-3.5 KV-cache pools (NVIDIA stack):

    • For NVIDIA-anchored deployments, the NVIDIA Inference Transfer Library (NIXL) orchestrates KV-cache movement across tiers; Inference Context Memory Storage (ICMS / CMX) is the dedicated hardware tier between local NVMe and cold S3.
    • Together they let inference engines spill KV-cache from GPU HBM → CPU DRAM → CXL pool → ICMS → S3 automatically based on access patterns.

What Changed Over Time

  • 2024: KV-cache was ephemeral — discarded between requests. Inference engines re-ran prefill on every call.
  • Mid-2025: LMCache published as an experimental layer demonstrating distributed KV-cache persistence (CoreWeave + Cohere production case study).
  • Late 2025: SGLang shipped RadixAttention; SGLang RadixAttention depends_on remote storage backends for evictions — recognizing that KV-cache state belongs in object storage at scale.
  • 2026: Mooncake formalized disaggregated prefill in open-source form. NIXL + ICMS framed the hardware-software stack around durable KV-cache pools.
  • Forward: KV-cache will become a first-class resource type with its own SLOs, observability, and cost-tracking — likely a dedicated catalog node alongside lakehouse data.

Sources