Inference Locality

Definition

What it is

The architectural shift toward minimizing data movement between storage and inference compute — placing computation as close as physically possible to where the data lives, often inside the storage fabric itself (DPUs, in-network compute, edge tiers). Operationalizes the "data gravity" principle: bring the model to the data, not the inverse.

Why it exists

Moving a single bit through the memory hierarchy costs an order of magnitude more power than performing the computation itself. Traditional POSIX file systems mediated by CPU bounce-buffers are highly inefficient for modern AI workloads. Inference Locality names the body of techniques that collapse the distance — GPU-aware storage, compute-near-storage, edge inference caching, and the new **Inference Context Memory Storage (ICMS)** / **Context Memory eXtension (CMX)** tier that lives between Tier 3 SSDs and Tier 4 cold S3.

Primary use cases

Zero-copy KV-cache streaming via DPU-attached flash, GPU-aware S3-RDMA data planes, edge inference caching for sensor-adjacent RAG, sovereign cloud alignment for regulated inference, in-storage attention offloading (Computing-in-Memory).

Recent developments

Latest signals

Beluga (CXL-based KV-cache memory architecture) — 89.6% TTFT reduction + 7.35× throughput vs RDMA. First system enabling GPUs to directly access large-scale memory pools through CXL switches. Beluga-KVCache: 89.6% reduction in Time-To-First-Token, 7.35× throughput improvement vs RDMA-based solutions. Per arXiv 2511.20172 — Beluga: CXL-Based KVCache Architecture.
CXL 4.0 enables 100+ TB shared memory pools across racks. CXL 4.0 enables memory pooling at unprecedented scale — allows AI inference workloads to access 100+ TB of shared memory with cache coherency across multiple racks. Commercial CXL memory pools reaching 100TiB available in 2025; larger deployments planned for 2026. Per Introl — CXL 4.0 Infrastructure Planning Guide.
NVIDIA Blackwell supports CXL — Phase 3 memory pooling 2026-2027. NVIDIA supports CXL on Blackwell architecture; phase 3 memory pooling (2026-2027) deploys CXL switches for shared memory pools where multiple hosts access common memory resources. Disaggregated memory architecture for inference becomes mainstream. Per Penguin Solutions — Why AI Needs CXL.
CXL pooling improves inference throughput 4.8× + reduces TTFT 82.7%. Production-side numbers (separate from Beluga research): CXL pooling improved inference throughput 4.8× + reduced time-to-first-token 82.7% vs non-pooled baseline. Per Introl — CXL 4.0 Planning Guide.
Marvell Structera S — CXL switching scales the AI memory wall. Marvell's 2026 Structera S product line specifically targets the AI memory-wall scaling problem with CXL switching — commercial silicon now available for the architecture pattern Beluga + similar research demonstrated. Per Marvell — Structera S CXL Switching for AI Memory Wall.
Processing-near-memory for 1M-token LLM inference. Research direction beyond CXL pooling: processing-near-memory architectures specifically scaled for 1M-token LLM inference workloads — the next architectural step when even pooled memory isn't enough. Per arXiv 2511.00321 — Scalable Processing-Near-Memory for 1M-Token LLM Inference.

Connections 14

Outbound 3

scoped_to3

Object Storage S3 Object Storage for AI Data Pipelines

Inbound 11

optimizes_for3

AIStor MCP Server MinIO MemKV Local Object Transport Accelerator (LOTA)

scoped_to7

NVIDIA BlueField-4 Inference Context Memory Storage (ICMS)NIXL (NVIDIA Inference Transfer Library)MemVerge Local Object Transport Accelerator (LOTA)Memory Wall Prefill Tax

enables1

NVIDIA BlueField-4

Definition

Recent developments

Connections 14

Featured in