Hierarchical KV-Cache Tier Topology — From GPU HBM to S3
Problem Framing
Single-tier KV-cache management forces an unwinnable trade-off — keep everything in HBM (the wall hits at tens of GB per accelerator) or fetch on demand from slower tiers (decode-stage latency explodes). The 2026 answer is to tier the memory hierarchy further out: HBM holds the immediate working set, CPU DRAM holds the next-1-second window, local NVMe holds the next-1-minute window, and remote / distributed tiers (Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage) hold the durable archive. The Memory Wall demanded this; the chunked-prefetch software stack delivers it.
Relevant Nodes
- Topics: AI Memory Infrastructure, LLM Serving
- Technologies: vLLM, TensorRT-LLM, LMCache, Mooncake, NIXL, CacheGen, SnapMLA
- Architectures: Hierarchical KV Cache Architecture, KV-Cache Disaggregation, Prefill-Decode Disaggregation, ObjectCache, Memory Efficient Attention
- Pain Points: Memory Wall, Prefill Tax, KV Cache Memory Footprint
Decision Path
Map the four tiers explicitly.
- L1 — GPU HBM: active working-set KV-cache for the current decode step. PagedAttention-managed via vLLM or TensorRT-LLM.
- L2 — Pinned CPU DRAM: hot intermediary across PCIe; zero-copy candidate for the next 1-second window.
- L3 — Local NVMe (optionally GPUDirect Storage): long-context payloads exceeding DRAM. Provides the next-1-minute window cheaply.
- L4 — Remote / distributed: Mooncake's pooled DRAM+NVMe across cluster nodes OR S3-compatible object storage. The durable, globally accessible archive.
Add an L3.5 hardware tier if you're at hyperscale. NVIDIA BlueField-4 + CMX NVMe enclosures act as a pod-scale shared tier between L3 and L4, with DPU-orchestrated S3-over-RDMA. This is ICMS; it changes pod economics by removing CPU bottlenecks on cache movement.
Use a 256-token chunk size as the operational default for L4 writes. Individual KV-cache pages are too small for efficient S3 PUT/GET. LMCache and ObjectCache both group pages into ~256-token chunks. Larger chunks improve write throughput but reduce cache reuse on divergent generation paths — past 512 tokens you lose more from cache misses than you gain from I/O efficiency.
Compress at the L4 boundary. CacheGen (compression + streaming) and SnapMLA (FP8 quantization of MLA latents) make S3 economically viable for production prefix caches. Without compression, the storage and network costs sink the economics.
Pair with Prefill-Decode Disaggregation to amortize the L4 fetch. ObjectCache showed that layerwise S3 retrieval can hide round-trip latency behind decode compute if compute is layer-sequential. Combined with disaggregated prefill / decode pools (Wave 2), the L4 → L1 staging happens on a different worker than the one that needs the data at decode time, so latency is fully overlapped.
Pick a connective-tissue stack. LMCache for the chunked-prefetch heuristics + tier transitions; Mooncake for the pooled L4; NIXL as the unified transfer primitive (the open-source successor to per-vendor RDMA glue).
What Changed Over Time
- 2024: KV-cache lived entirely in HBM. Long-context serving was economically infeasible.
- 2025: LMCache + Mooncake formalized cross-tier caching. PagedAttention became the standard L1 manager.
- 2026: ObjectCache demonstrated layerwise S3 retrieval; CacheGen + SnapMLA brought compression to the L4 boundary; NVIDIA ICMS made the L3.5 tier physical.
- Forward: NIXL standardizes transfers across tiers; predictive prefetch driven by LLM attention maps (academically forecast).