NVIDIA GPUDirect RDMA for S3

Summary

What it is

NVIDIA's client/server library stack released November 2025 that moves S3-compatible object data directly from storage-node memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand — bypassing the host OS kernel and TCP/IP stack. Client libraries run on GPU nodes (or offload to BlueField-3 DPUs via the ROS2 / SmartNIC pattern); server libraries ship in object-storage controllers from MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, and HPE Alletra Storage MP X10000.

Where it fits

At 400GbE and above, TCP interrupt handling and user-space packet copies starve GPUs during training. Kernel-bypass RDMA shifts transport off the host CPU, keeping GPUs compute-bound instead of waiting on I/O. This moved from research curiosity to AI-factory prerequisite in the span of a year — by 2026 it's a standard checkbox on enterprise on-prem AI storage.

Misconceptions / Traps

Not a cloud-S3 accelerator. Public S3 over HTTPS does not expose RDMA — this applies to on-prem and colo object stores with RDMA-capable controllers.
Control plane and data plane are separate. The gRPC control channel for namespace resolution is low-bandwidth; RDMA runs beneath on UCX/libfabric. Skipping the DPU offload means keeping the client on the host, which still works.
Benchmarks favor DPU offload (ROS2 pattern) when multi-tenant isolation or inline encryption matters. Running TCP on a SmartNIC lags badly — RDMA is the mandatory prerequisite for the offload to pay off.

Key Connections

implements RDMA (RoCE v2 / InfiniBand) — the underlying transport
augments GPU-Direct Storage Pipeline — the data-to-HBM path for S3 sources
augments RDMA-Accelerated Object Access — productizes the arch pattern
bypasses S3 API — routes around HTTP/TCP while preserving S3 semantics
solves Cold Scan Latency — checkpoint loads land at near-local-memory speed
solves High Cloud Inference Cost — reclaims CPU cycles lost to interrupt handling
scoped_to Object Storage for AI Data Pipelines

Definition

What it is

NVIDIA's RDMA client/server library stack for S3-compatible object storage, introduced November 2025, that moves data directly from a storage node's memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand fabrics — bypassing the host OS kernel, the TCP/IP stack, and user-space copy chains. A storage-side server library integrates into object-storage controllers (MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, HPE Alletra Storage MP X10000); the client library runs on GPU compute nodes or is offloaded to a BlueField-3 DPU using the ROS2 (RDMA-first Object Storage) pattern.

Why it exists

At 400GbE and above, TCP interrupt handling and kernel-space packet copies starve GPUs of training data. The host CPU cannot feed PCIe fast enough, GPUs idle, and the economics of the AI factory collapse. Kernel-bypass RDMA shifts the entire data-transport burden off the CPU, keeping expensive GPUs compute-bound during massive checkpoint loads and distributed training reads.

Primary use cases

Multi-trillion-parameter LLM pre-training, synchronized checkpoint loading across thousands of GPU nodes, high-throughput RAG vector ingestion at training scale, SmartNIC-offloaded object storage on BlueField-3 DPUs, on-prem AI factories where standard HTTP S3 is the throughput-per-watt bottleneck.

Connections 12

Outbound 12

scoped_to2

Object Storage Object Storage for AI Data Pipelines

implements1

RDMA (RoCE v2 / InfiniBand)

depends_on1

RDMA (RoCE v2 / InfiniBand)

augments2

GPU-Direct Storage Pipeline RDMA-Accelerated Object Access

enables2

Checkpoint/Artifact Lake on Object Storage Training Data Streaming from Object Storage

bypasses1

S3 API

solves3

Cold Scan Latency High Cloud Inference Cost Data Loading Bottleneck

Resources 4

BlogHigh

blogs.nvidia.com/blog/s3-compatible-ai-storage/

NVIDIA's announcement of RDMA for S3-compatible storage (November 2025) with architecture diagrams of the GPU HBM ← network fabric ← storage-node memory path.

BlogHigh

www.min.io/blog/minio-aistor-with-nvidia-gpudirect-r-rdma-fo...

MinIO AIStor integration write-up — most detailed vendor-side explanation of the server library and why TCP/IP overhead is the bottleneck RDMA eliminates.

PaperHigh

arxiv.org/html/2509.13997v1

ROS2 ("RDMA-first Object Storage") arxiv paper — the research basis for BlueField-3 DPU offload of the DAOS client, including FIO/DFS benchmarks showing DPU-offloaded RDMA matches host-side performance.

BlogMedium

pt.hi-network.com/nvidia-brings-rdma-acceleration-to-s3-obje...

Press coverage enumerating the enterprise storage partners (Cloudian HyperStore, Dell ObjectScale, HPE Alletra MP X10000) who embedded the server library natively.

Summary

Definition

Connections 12

Resources 4

Featured in