Memory Wall | LLMS3

Definition

What it is

The architectural ceiling created by the diverging trajectories of compute throughput (which has scaled rapidly with GPU generations) and memory bandwidth / latency (which has scaled much more slowly). For AI inference at scale, the result is a hard upper bound on tokens-per-second that no amount of additional compute can break — the bottleneck has migrated from FLOPs to memory access. Naming this constraint as a first-class pain point reframes architectural decisions across the stack: ICMS/CMX tiers, CXL memory pooling, KV-cache persistence to S3, and disaggregated prefill are all responses to the Memory Wall.

Recent developments

Latest signals

2026 hardware revolution focuses on memory bandwidth, not compute cores. LLM performance in 2026 is defined by abundant compute power but finite memory bandwidth — chip vendors are responding by prioritizing bandwidth-per-die over additional compute, inverting the 2010s scaling playbook. Per TrendForce — Memory Wall Bottleneck.
HBM consumption growing 70%+ YoY through 2026. Driven by next-gen platforms (B300, GB300, R100 R200, VR100 VR200) — HBM is the rate-limiting input on inference-cluster capacity expansion. Per TrendForce — Memory Wall.
TurboQuant compresses KV-cache to 3.5 bits per value. Google's TurboQuant attacks the capacity dimension of the memory wall by compressing KV in fast caches to as little as 3.5 bits per value — extends effective HBM footprint without architectural changes. Per SemiAnalysis — Scaling the Memory Wall: HBM Roadmap.
Inference-hardware annual sales projected 6× over 5 years. Annual sales of inference hardware will soar by up to six-times over the next half decade — but the cost of serving state-of-the-art models due to intense HBM + end-to-end latency requirements may price out some businesses. Per Data Exchange — Breaking the Memory Wall (interview).
Network latency + memory trump compute — Google engineers. Google engineers explicitly frame the 2026 inference crisis as memory + network problem, not compute problem. The architectural responses (CXL pools, disaggregated prefill, KV offloading to S3) target memory access patterns rather than FLOPs. Per SDxCentral — AI Inference Crisis: Network Latency + Memory Trump Compute.
HBM + CXL + new GPU playbook. DataCenterKnowledge's analysis traces the 2026 GPU+memory playbook: HBM for hot working set, CXL pools for warm tier, NVMe + S3 for cold tier — every tier transition has latency and energy costs the application absorbs. Per DataCenterKnowledge — Scaling the Memory Wall: HBM, CXL, GPU Playbook.

Connections 27

Outbound 2

scoped_to2

AI Memory Infrastructure Inference Locality

Inbound 25click to expand

Definition

Recent developments

Connections 27

Featured in