Architecture

Hierarchical KV Cache Architecture

A four-tier storage architecture for LLM-inference KV-cache, layering: **(L1)** active working-set KV-cache in GPU HBM; **(L2)** pinned CPU DRAM as a hot intermediary across the PCIe bus; **(L3)** local NVMe (often with GPUDirect Storage) for long-context payloads exceeding DRAM; **(L4)** remote/distributed tier — Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage — as the durable, globally accessible repository. The architecture pairs with chunked-prefetch logic that exploits inference-queue idle intervals to stage required prefix-caches from L3/L4 up to L1/L2 *before* the compute step demands them.

7 connections 1 post

Definition

What it is

A four-tier storage architecture for LLM-inference KV-cache, layering: **(L1)** active working-set KV-cache in GPU HBM; **(L2)** pinned CPU DRAM as a hot intermediary across the PCIe bus; **(L3)** local NVMe (often with GPUDirect Storage) for long-context payloads exceeding DRAM; **(L4)** remote/distributed tier — Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage — as the durable, globally accessible repository. The architecture pairs with chunked-prefetch logic that exploits inference-queue idle intervals to stage required prefix-caches from L3/L4 up to L1/L2 *before* the compute step demands them.

Why it exists

Single-tier KV-cache management forces an unwinnable trade-off — keep everything in HBM (capacity hits the wall at ~10s of GB per accelerator), or fetch on demand from slower tiers (decode-stage latency explodes). The hierarchical pattern lets each tier do what it's good at: HBM serves the immediate decode step, DRAM serves the next-1-second window, NVMe serves the next-1-minute window, S3 serves the multi-session durable archive. With chunked prefetch hiding the L4→L1 staging behind GPU idle time, the architecture delivers near-HBM-latency reads for cache sizes orders of magnitude larger than HBM capacity.

Recent developments

Latest signals
  • LMCache, Mooncake, NIXL together provide the connective software. LMCache spans L1-L4 with chunked transfer + prefetch heuristics; Mooncake pools L4 DRAM + NVMe across cluster nodes with RDMA aggregation; NIXL is the unified transfer primitive that ships data between layers. Per LMCache tech report and arXiv 2407.00079 — Mooncake.
  • 256-token chunk size is the operational default. Cache pages individually are too small for efficient S3 PUT/GET, so middleware groups them into ~256-token chunks for parallel-friendly object storage I/O. Per arXiv 2510.09665 — LMCache.
  • CacheGen + SnapMLA add compression at the L4 boundary. Compressing the KV-cache 3-10x before sending it to S3 makes the L4 tier economically viable for high-context production serving. Per LMCache + CacheGen blog.
  • NVIDIA ICMS adds a hardware L3.5 tier. BlueField-4-managed CMX NVMe enclosures act as a pod-scale shared tier between local-NVMe (L3) and remote object storage (L4), with DPU-orchestrated zero-copy access. Per NVIDIA Developer — BlueField-4 ICMS.

Connections 7

Outbound 7

Featured in