Memory Wall
The architectural ceiling created by the diverging trajectories of compute throughput (which has scaled rapidly with GPU generations) and memory bandwidth / latency (which has scaled much more slowly). For AI inference at scale, the result is a hard upper bound on tokens-per-second that no amount of additional compute can break — the bottleneck has migrated from FLOPs to memory access. Naming this constraint as a first-class pain point reframes architectural decisions across the stack: ICMS/CMX tiers, CXL memory pooling, KV-cache persistence to S3, and disaggregated prefill are all responses to the Memory Wall.
Definition
The architectural ceiling created by the diverging trajectories of compute throughput (which has scaled rapidly with GPU generations) and memory bandwidth / latency (which has scaled much more slowly). For AI inference at scale, the result is a hard upper bound on tokens-per-second that no amount of additional compute can break — the bottleneck has migrated from FLOPs to memory access. Naming this constraint as a first-class pain point reframes architectural decisions across the stack: ICMS/CMX tiers, CXL memory pooling, KV-cache persistence to S3, and disaggregated prefill are all responses to the Memory Wall.