KV-Cache Disaggregation
An architectural pattern that decouples LLM inference compute from inference **state** (the KV-cache), enabling that state to be stored in tiered, network-attached memory (CPU DRAM, CXL DRAM, NVMe, S3-compatible object storage) rather than living solely in GPU HBM. KV-Cache Disaggregation encompasses three sub-patterns: prefill/decode pool separation (compute-phase disaggregation), tiered KV-cache memory (storage-tier disaggregation), and cross-node prefix-cache federation (geographic disaggregation).
Definition
An architectural pattern that decouples LLM inference compute from inference **state** (the KV-cache), enabling that state to be stored in tiered, network-attached memory (CPU DRAM, CXL DRAM, NVMe, S3-compatible object storage) rather than living solely in GPU HBM. KV-Cache Disaggregation encompasses three sub-patterns: prefill/decode pool separation (compute-phase disaggregation), tiered KV-cache memory (storage-tier disaggregation), and cross-node prefix-cache federation (geographic disaggregation).
GPU HBM is the scarcest, most expensive memory tier in any LLM-serving stack. As context windows extended from 4k to 1M+ tokens and multi-agent workflows generated massive volumes of derived state, KV-cache demand exceeded HBM capacity by 10-100x. Storing all KV-cache in HBM is economically infeasible; storing it only in CPU DRAM kills decode throughput. Disaggregation is the structural fix: HBM holds hot working set, DRAM holds warm prefix caches, NVMe holds long-tail prefixes, S3-compatible object storage holds cross-session episodic context. The disaggregation pattern makes the economics of long-context serving work.
Recent developments
- The Mooncake architecture paper formalized the pattern. Moonshot AI's "KV-cache-centric architecture for LLM serving" paper is the canonical reference; every major serving runtime has now adopted at least the compute-phase tier (prefill/decode pool separation). Per arXiv 2407.00079 — Mooncake.
- LMCache + ObjectCache + Mooncake provide the connective software tissue. LMCache handles compression and chunked prefetch between tiers; ObjectCache provides layerwise S3 retrieval; Mooncake orchestrates RDMA-accelerated cross-node prefix shipment. Per LMCache tech report.
- NVIDIA ICMS + BlueField-4 elevate the pattern to rack-scale. ICMS treats inference context as a rack-scale resource pool rather than per-accelerator state, with BlueField-4 DPUs orchestrating S3-over-RDMA access to dedicated CMX NVMe enclosures. Per NVIDIA Developer — BlueField-4 ICMS announcement.
- The economic argument is now settled. Long-context serving (1M+ tokens) at competitive cost per token is only feasible with disaggregated KV-cache; the pattern is no longer a research curiosity but a production requirement for any frontier-model serving operation. Per LMCache tech report.
Connections 8
Outbound 7
solves1enables1