Hierarchical KV Cache Architecture
A four-tier storage architecture for LLM-inference KV-cache, layering: **(L1)** active working-set KV-cache in GPU HBM; **(L2)** pinned CPU DRAM as a hot intermediary across the PCIe bus; **(L3)** local NVMe (often with GPUDirect Storage) for long-context payloads exceeding DRAM; **(L4)** remote/distributed tier — Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage — as the durable, globally accessible repository. The architecture pairs with chunked-prefetch logic that exploits inference-queue idle intervals to stage required prefix-caches from L3/L4 up to L1/L2 *before* the compute step demands them.
Definition
A four-tier storage architecture for LLM-inference KV-cache, layering: **(L1)** active working-set KV-cache in GPU HBM; **(L2)** pinned CPU DRAM as a hot intermediary across the PCIe bus; **(L3)** local NVMe (often with GPUDirect Storage) for long-context payloads exceeding DRAM; **(L4)** remote/distributed tier — Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage — as the durable, globally accessible repository. The architecture pairs with chunked-prefetch logic that exploits inference-queue idle intervals to stage required prefix-caches from L3/L4 up to L1/L2 *before* the compute step demands them.
Single-tier KV-cache management forces an unwinnable trade-off — keep everything in HBM (capacity hits the wall at ~10s of GB per accelerator), or fetch on demand from slower tiers (decode-stage latency explodes). The hierarchical pattern lets each tier do what it's good at: HBM serves the immediate decode step, DRAM serves the next-1-second window, NVMe serves the next-1-minute window, S3 serves the multi-session durable archive. With chunked prefetch hiding the L4→L1 staging behind GPU idle time, the architecture delivers near-HBM-latency reads for cache sizes orders of magnitude larger than HBM capacity.
Recent developments
- LMCache, Mooncake, NIXL together provide the connective software. LMCache spans L1-L4 with chunked transfer + prefetch heuristics; Mooncake pools L4 DRAM + NVMe across cluster nodes with RDMA aggregation; NIXL is the unified transfer primitive that ships data between layers. Per LMCache tech report and arXiv 2407.00079 — Mooncake.
- 256-token chunk size is the operational default. Cache pages individually are too small for efficient S3 PUT/GET, so middleware groups them into ~256-token chunks for parallel-friendly object storage I/O. Per arXiv 2510.09665 — LMCache.
- CacheGen + SnapMLA add compression at the L4 boundary. Compressing the KV-cache 3-10x before sending it to S3 makes the L4 tier economically viable for high-context production serving. Per LMCache + CacheGen blog.
- NVIDIA ICMS adds a hardware L3.5 tier. BlueField-4-managed CMX NVMe enclosures act as a pod-scale shared tier between local-NVMe (L3) and remote object storage (L4), with DPU-orchestrated zero-copy access. Per NVIDIA Developer — BlueField-4 ICMS.
Connections 7
Outbound 7
scoped_to2solves1