Guide 45

Hierarchical KV-Cache Tier Topology — From GPU HBM to S3

Problem Framing

Single-tier KV-cache management forces an unwinnable trade-off — keep everything in HBM (the wall hits at tens of GB per accelerator) or fetch on demand from slower tiers (decode-stage latency explodes). The 2026 answer is to tier the memory hierarchy further out: HBM holds the immediate working set, CPU DRAM holds the next-1-second window, local NVMe holds the next-1-minute window, and remote / distributed tiers (Mooncake's pooled cluster DRAM+NVMe or S3-compatible object storage) hold the durable archive. The Memory Wall demanded this; the chunked-prefetch software stack delivers it.

Relevant Nodes

  • Topics: AI Memory Infrastructure, LLM Serving
  • Technologies: vLLM, TensorRT-LLM, LMCache, Mooncake, NIXL, CacheGen, SnapMLA
  • Architectures: Hierarchical KV Cache Architecture, KV-Cache Disaggregation, Prefill-Decode Disaggregation, ObjectCache, Memory Efficient Attention
  • Pain Points: Memory Wall, Prefill Tax, KV Cache Memory Footprint

Decision Path

  1. Map the four tiers explicitly.

    • L1 — GPU HBM: active working-set KV-cache for the current decode step. PagedAttention-managed via vLLM or TensorRT-LLM.
    • L2 — Pinned CPU DRAM: hot intermediary across PCIe; zero-copy candidate for the next 1-second window.
    • L3 — Local NVMe (optionally GPUDirect Storage): long-context payloads exceeding DRAM. Provides the next-1-minute window cheaply.
    • L4 — Remote / distributed: Mooncake's pooled DRAM+NVMe across cluster nodes OR S3-compatible object storage. The durable, globally accessible archive.
  2. Add an L3.5 hardware tier if you're at hyperscale. NVIDIA BlueField-4 + CMX NVMe enclosures act as a pod-scale shared tier between L3 and L4, with DPU-orchestrated S3-over-RDMA. This is ICMS; it changes pod economics by removing CPU bottlenecks on cache movement.

  3. Use a 256-token chunk size as the operational default for L4 writes. Individual KV-cache pages are too small for efficient S3 PUT/GET. LMCache and ObjectCache both group pages into ~256-token chunks. Larger chunks improve write throughput but reduce cache reuse on divergent generation paths — past 512 tokens you lose more from cache misses than you gain from I/O efficiency.

  4. Compress at the L4 boundary. CacheGen (compression + streaming) and SnapMLA (FP8 quantization of MLA latents) make S3 economically viable for production prefix caches. Without compression, the storage and network costs sink the economics.

  5. Pair with Prefill-Decode Disaggregation to amortize the L4 fetch. ObjectCache showed that layerwise S3 retrieval can hide round-trip latency behind decode compute if compute is layer-sequential. Combined with disaggregated prefill / decode pools (Wave 2), the L4 → L1 staging happens on a different worker than the one that needs the data at decode time, so latency is fully overlapped.

  6. Pick a connective-tissue stack. LMCache for the chunked-prefetch heuristics + tier transitions; Mooncake for the pooled L4; NIXL as the unified transfer primitive (the open-source successor to per-vendor RDMA glue).

What Changed Over Time

  • 2024: KV-cache lived entirely in HBM. Long-context serving was economically infeasible.
  • 2025: LMCache + Mooncake formalized cross-tier caching. PagedAttention became the standard L1 manager.
  • 2026: ObjectCache demonstrated layerwise S3 retrieval; CacheGen + SnapMLA brought compression to the L4 boundary; NVIDIA ICMS made the L3.5 tier physical.
  • Forward: NIXL standardizes transfers across tiers; predictive prefetch driven by LLM attention maps (academically forecast).

Sources