NVIDIA GPUDirect RDMA for S3
NVIDIA's client/server library stack released November 2025 that moves S3-compatible object data directly from storage-node memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand — bypassing the host OS kernel and TCP/IP stack. Client libraries run on GPU nodes (or offload to BlueField-3 DPUs via the ROS2 / SmartNIC pattern); server libraries ship in object-storage controllers from MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, and HPE Alletra Storage MP X10000.
Summary
NVIDIA's client/server library stack released November 2025 that moves S3-compatible object data directly from storage-node memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand — bypassing the host OS kernel and TCP/IP stack. Client libraries run on GPU nodes (or offload to BlueField-3 DPUs via the ROS2 / SmartNIC pattern); server libraries ship in object-storage controllers from MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, and HPE Alletra Storage MP X10000.
At 400GbE and above, TCP interrupt handling and user-space packet copies starve GPUs during training. Kernel-bypass RDMA shifts transport off the host CPU, keeping GPUs compute-bound instead of waiting on I/O. This moved from research curiosity to AI-factory prerequisite in the span of a year — by 2026 it's a standard checkbox on enterprise on-prem AI storage.
- Not a cloud-S3 accelerator. Public S3 over HTTPS does not expose RDMA — this applies to on-prem and colo object stores with RDMA-capable controllers.
- Control plane and data plane are separate. The gRPC control channel for namespace resolution is low-bandwidth; RDMA runs beneath on UCX/libfabric. Skipping the DPU offload means keeping the client on the host, which still works.
- Benchmarks favor DPU offload (ROS2 pattern) when multi-tenant isolation or inline encryption matters. Running TCP on a SmartNIC lags badly — RDMA is the mandatory prerequisite for the offload to pay off.
implementsRDMA (RoCE v2 / InfiniBand) — the underlying transportaugmentsGPU-Direct Storage Pipeline — the data-to-HBM path for S3 sourcesaugmentsRDMA-Accelerated Object Access — productizes the arch patternbypassesS3 API — routes around HTTP/TCP while preserving S3 semanticssolvesCold Scan Latency — checkpoint loads land at near-local-memory speedsolvesHigh Cloud Inference Cost — reclaims CPU cycles lost to interrupt handlingscoped_toObject Storage for AI Data Pipelines
Definition
NVIDIA's RDMA client/server library stack for S3-compatible object storage, introduced November 2025, that moves data directly from a storage node's memory to GPU high-bandwidth memory over RoCE v2 or InfiniBand fabrics — bypassing the host OS kernel, the TCP/IP stack, and user-space copy chains. A storage-side server library integrates into object-storage controllers (MinIO AIStor, Cloudian HyperStore, Dell ObjectScale, HPE Alletra Storage MP X10000); the client library runs on GPU compute nodes or is offloaded to a BlueField-3 DPU using the ROS2 (RDMA-first Object Storage) pattern.
At 400GbE and above, TCP interrupt handling and kernel-space packet copies starve GPUs of training data. The host CPU cannot feed PCIe fast enough, GPUs idle, and the economics of the AI factory collapse. Kernel-bypass RDMA shifts the entire data-transport burden off the CPU, keeping expensive GPUs compute-bound during massive checkpoint loads and distributed training reads.
Multi-trillion-parameter LLM pre-training, synchronized checkpoint loading across thousands of GPU nodes, high-throughput RAG vector ingestion at training scale, SmartNIC-offloaded object storage on BlueField-3 DPUs, on-prem AI factories where standard HTTP S3 is the throughput-per-watt bottleneck.
Recent developments
- NVIDIA cuObject — GPUDirect Storage for Objects shipped January 2026. This is the public-API surface that formalizes the RDMA fast path between S3-compatible object storage and GPU memory. Before cuObject, GPUDirect for object storage existed as an internal NVIDIA library and a handful of vendor-specific integrations — cuObject standardizes the API so any object-storage vendor can plug their server-side library into the same client surface. Strategic implication: the GPU-direct path stops being a per-vendor differentiator and becomes a baseline integration target. Storage vendors who don't wire up cuObject will lose AI-training workloads to those who do.
- MinIO AIStor × NVIDIA STX integration — announced at GTC 2026. First commercial announcement of a vendor-supported on-prem object-storage tier targeting the NVIDIA STX reference architecture. The stack rides NVIDIA DOCA (Data-center-on-a-chip programming model) on BlueField-4 DPUs — moving the storage client off the host CPU entirely, freeing host cycles for actual training work. GPUDirect RDMA for S3 is currently in tech preview; BlueField-4 GA is expected H2 2026. The integration also bundles hardware-accelerated erasure coding and zero-copy data transfer alongside the GPU memory path, so it's not just RDMA-over-HTTP — it's a full data-plane re-architecture for AI training I/O.
- Validation pattern — Meta already proved the deployment shape. The MinIO/STX configuration follows the same architectural pattern Meta ran in production to feed 2,048 H100 GPUs simultaneously during foundation-model pre-training, sustaining 192 GB/s of streamed training data straight from object storage to GPU HBM and cutting wall-clock training time by 3.8× versus the prior CPU-mediated TCP path. What's new with cuObject + STX is the commercial packaging — you no longer need Meta-scale internal engineering to reproduce that result; the vendor stack delivers it.
- What this means for operators. If you're running on-prem AI training and your data path is still HTTP-over-TCP S3, the gap between you and a cuObject-enabled stack is now order-of-magnitude (3-4×) on training throughput. Two prerequisites for adoption: (1) RDMA-capable fabric end-to-end (RoCE v2 or InfiniBand — not optional), and (2) a storage tier that's wired into cuObject. Public-cloud managed S3 over standard HTTPS is not a viable substrate; this is an on-prem / colo / private-cloud play. Watch for AWS / Azure / GCP to respond either with their own proprietary fast paths or with bare-metal instance types that expose the underlying RDMA fabric to tenants.
- Head-to-head: software-defined RDMA vs purpose-built appliance (June 2026). A published throughput comparison frames the two design philosophies emerging in the GPUDirect-for-S3 race. Software-defined: MinIO AIStor sustains ~45-50 GB/s single-node GET over 400GbE RDMA with GPU-server CPU utilization at ~1% (the kernel-bypass payoff), scaling to ~900 GB/s per 20-node rack (~18 kW). Appliance: Dell Lightning FS claims 150 GB/s per 1RU enclosure, ~6 TB/s per rack (40 enclosures, ~32 kW). The trade is density-per-watt (appliance wins on raw rack throughput) vs commodity-hardware flexibility (software-defined runs on the NICs and servers you already buy). Numbers are vendor-adjacent — the source is a MinIO-authored benchmark — but the ~1% CPU figure is the architecturally important one: it's what frees the host to keep GPUs fed. Per MinIO AIStor vs Dell AI Data Platform benchmark.
Connections 14
Outbound 12
implements1depends_on1bypasses1Inbound 2
depends_on1accelerates1Resources 4
NVIDIA's announcement of RDMA for S3-compatible storage (November 2025) with architecture diagrams of the GPU HBM ← network fabric ← storage-node memory path.
MinIO AIStor integration write-up — most detailed vendor-side explanation of the server library and why TCP/IP overhead is the bottleneck RDMA eliminates.
ROS2 ("RDMA-first Object Storage") arxiv paper — the research basis for BlueField-3 DPU offload of the DAOS client, including FIO/DFS benchmarks showing DPU-offloaded RDMA matches host-side performance.
Press coverage enumerating the enterprise storage partners (Cloudian HyperStore, Dell ObjectScale, HPE Alletra MP X10000) who embedded the server library natively.