Technology

NVIDIA GPUDirect RDMA for S3

NVIDIA's client/server library stack released November 2025 that moves S3-compatible object data directly from storage-node memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand — bypassing the host OS kernel and TCP/IP stack. Client libraries run on GPU nodes (or offload to BlueField-3 DPUs via the ROS2 / SmartNIC pattern); server libraries ship in object-storage controllers from MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, and HPE Alletra Storage MP X10000.

12 connections 4 resources 3 posts

Summary

What it is

NVIDIA's client/server library stack released November 2025 that moves S3-compatible object data directly from storage-node memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand — bypassing the host OS kernel and TCP/IP stack. Client libraries run on GPU nodes (or offload to BlueField-3 DPUs via the ROS2 / SmartNIC pattern); server libraries ship in object-storage controllers from MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, and HPE Alletra Storage MP X10000.

Where it fits

At 400GbE and above, TCP interrupt handling and user-space packet copies starve GPUs during training. Kernel-bypass RDMA shifts transport off the host CPU, keeping GPUs compute-bound instead of waiting on I/O. This moved from research curiosity to AI-factory prerequisite in the span of a year — by 2026 it's a standard checkbox on enterprise on-prem AI storage.

Misconceptions / Traps
  • Not a cloud-S3 accelerator. Public S3 over HTTPS does not expose RDMA — this applies to on-prem and colo object stores with RDMA-capable controllers.
  • Control plane and data plane are separate. The gRPC control channel for namespace resolution is low-bandwidth; RDMA runs beneath on UCX/libfabric. Skipping the DPU offload means keeping the client on the host, which still works.
  • Benchmarks favor DPU offload (ROS2 pattern) when multi-tenant isolation or inline encryption matters. Running TCP on a SmartNIC lags badly — RDMA is the mandatory prerequisite for the offload to pay off.
Key Connections
  • implements RDMA (RoCE v2 / InfiniBand) — the underlying transport
  • augments GPU-Direct Storage Pipeline — the data-to-HBM path for S3 sources
  • augments RDMA-Accelerated Object Access — productizes the arch pattern
  • bypasses S3 API — routes around HTTP/TCP while preserving S3 semantics
  • solves Cold Scan Latency — checkpoint loads land at near-local-memory speed
  • solves High Cloud Inference Cost — reclaims CPU cycles lost to interrupt handling
  • scoped_to Object Storage for AI Data Pipelines

Definition

What it is

NVIDIA's RDMA client/server library stack for S3-compatible object storage, introduced November 2025, that moves data directly from a storage node's memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand fabrics — bypassing the host OS kernel, the TCP/IP stack, and user-space copy chains. A storage-side server library integrates into object-storage controllers (MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, HPE Alletra Storage MP X10000); the client library runs on GPU compute nodes or is offloaded to a BlueField-3 DPU using the ROS2 (RDMA-first Object Storage) pattern.

Why it exists

At 400GbE and above, TCP interrupt handling and kernel-space packet copies starve GPUs of training data. The host CPU cannot feed PCIe fast enough, GPUs idle, and the economics of the AI factory collapse. Kernel-bypass RDMA shifts the entire data-transport burden off the CPU, keeping expensive GPUs compute-bound during massive checkpoint loads and distributed training reads.

Primary use cases

Multi-trillion-parameter LLM pre-training, synchronized checkpoint loading across thousands of GPU nodes, high-throughput RAG vector ingestion at training scale, SmartNIC-offloaded object storage on BlueField-3 DPUs, on-prem AI factories where standard HTTP S3 is the throughput-per-watt bottleneck.

Connections 12

Outbound 12

Resources 4

Featured in