Inference Locality
The architectural shift toward minimizing data movement between storage and inference compute — placing computation as close as physically possible to where the data lives, often inside the storage fabric itself (DPUs, in-network compute, edge tiers). Operationalizes the "data gravity" principle: bring the model to the data, not the inverse.
Definition
The architectural shift toward minimizing data movement between storage and inference compute — placing computation as close as physically possible to where the data lives, often inside the storage fabric itself (DPUs, in-network compute, edge tiers). Operationalizes the "data gravity" principle: bring the model to the data, not the inverse.
Moving a single bit through the memory hierarchy costs an order of magnitude more power than performing the computation itself. Traditional POSIX file systems mediated by CPU bounce-buffers are highly inefficient for modern AI workloads. Inference Locality names the body of techniques that collapse the distance — GPU-aware storage, compute-near-storage, edge inference caching, and the new **Inference Context Memory Storage (ICMS)** / **Context Memory eXtension (CMX)** tier that lives between Tier 3 SSDs and Tier 4 cold S3.
Zero-copy KV-cache streaming via DPU-attached flash, GPU-aware S3-RDMA data planes, edge inference caching for sensor-adjacent RAG, sovereign cloud alignment for regulated inference, in-storage attention offloading (Computing-in-Memory).