GPU + Object Storage Convergence

Definition

What it is

The set of technologies eliminating CPU bounce-buffers between object storage and GPU memory — establishing direct memory access paths from S3-compatible storage to GPU VRAM via RDMA, GPUDirect Storage, and the cuObject library's `x-amz-rdma-token` extension. Includes CXL 3.0 rack-scale coherent memory fabrics and Distributed Page Caches that treat the entire cluster's DRAM as a single cache budget.

Why it exists

Traditional POSIX file systems mediated by CPU bounce-buffers are highly inefficient for modern AI workloads. NVIDIA GPUDirect Storage drops latency from ~15µs to under 2µs, reduces CPU utilization by up to 45%, and massively increases aggregate throughput. Extending this to S3-compatible object storage (via cuObject's separation of control plane from data plane) lets training pipelines stream multi-hundred-GB/s directly from S3 buckets into GPU VRAM without ever touching the host TCP/IP stack.

Primary use cases

Zero-copy retrieval pipelines from S3 to GPU VRAM (Cloudian + NVIDIA, VAST + DASE, MinIO + GDS), object-to-VRAM streaming for training-data loaders, CXL-based distributed page caches for shared vector indices and KV-caches, AI-memory fabrics that dissolve the strict host-RAM-vs-object-storage boundary.

Recent developments

Latest signals

NVIDIA cuObject v1.0.0 GA shipped (R1.16 release cycle). Both cuObjClient (GET/PUT APIs with RDMA data-path) and cuObjServer (RDMA-accelerated server side for S3-compat object storage) reached 1.0. The reference toolkit for production GPUDirect-to-object-storage deployments. Per NVIDIA — cuObject server v1.0.0 release notes and NVIDIA — cuObject client v1.0.0 release notes.
MinIO AIStor + GPUDirect RDMA tech preview: 200+ GB/s sustained, 45% CPU reduction. First open-source S3-compatible storage to ship GPUDirect RDMA support — measured 3× faster than non-RDMA flash and 45% reduction in GPU-server CPU utilization. Tech-preview track toward GA. Per MinIO — AIStor + NVIDIA GPUDirect RDMA for S3.
Cloudian shipped direct GPU-to-object-storage at GA. First commercial S3-compatible vendor to ship GPUDirect support at GA — claims "AI storage barrier shattered" with throughput parity to fast block storage. Production-ready alternative to lab-grade integrations. Per Cloudian — Direct GPU-to-Object Storage with GPUDirect.
LMCache filed feature request: S3-over-RDMA data plane via cuObject. LMCache (the KV-cache library) is actively wiring cuObject as the RDMA transport for its S3 backend — convergence signal: not just training-data loaders, but also inference KV-cache pools want zero-copy S3-to-VRAM paths. Per GitHub — LMCache Issue #2875: S3-over-RDMA via cuObject.
DC (Dynamic Connection) transport is the cuObject default. Unlike Reliable Connections (RC), DC doesn't require pre-establishing connections between every client/server pair — scales to many-clients/many-servers fabrics without N² connection state. RDMA negotiation rides on x-amz-rdma-token HTTP header in the S3 control plane. Per NVIDIA — cuObject GPUDirect Storage for Objects docs.
GPUDirect Storage drops latency 15µs → <2µs; ~45% CPU reduction. Baseline numbers across NVIDIA + Cloudian + MinIO + VAST measurements — convergence pattern: bypass the kernel TCP stack and bounce-buffer, replace it with RDMA reads directly into VRAM. Per NVIDIA — GPUDirect Storage docs and Cloudian — GPUDirect blog.

Connections 10

Outbound 3

scoped_to3

Object Storage S3 Object Storage for AI Data Pipelines

Inbound 7

scoped_to6

NVIDIA BlueField-4 NIXL (NVIDIA Inference Transfer Library)MemVerge NVIDIA cuObject Cloudian HyperStore CXL 3.0

enables1

CXL 3.0

Definition

Recent developments

Connections 10

Featured in