RDMA (RoCE v2 / InfiniBand)
A network transport protocol for direct memory-to-memory data transfer between machines, bypassing the operating system kernel and CPU for minimal latency and maximum throughput. In early 2026, NVIDIA shipped RDMA client and server libraries for S3-compatible storage as part of the CUDA Toolkit, marking the transition from niche technical preview to standard "AI Factory" infrastructure.
Summary
A network transport protocol for direct memory-to-memory data transfer between machines, bypassing the operating system kernel and CPU for minimal latency and maximum throughput. In early 2026, NVIDIA shipped RDMA client and server libraries for S3-compatible storage as part of the CUDA Toolkit, marking the transition from niche technical preview to standard "AI Factory" infrastructure.
RDMA is the high-performance network fabric used by storage systems that need microsecond-level access. Object storage systems serving AI/ML workloads use RDMA to achieve storage access times that approach local NVMe, enabling GPU-direct data paths. The NVIDIA CUDA Toolkit integration means GPU clusters can now access S3-compatible storage over RDMA without custom driver work.
- RDMA requires specialized network infrastructure. RoCE v2 works on lossless Ethernet (requires PFC/ECN configuration); InfiniBand requires dedicated switches and HCAs.
- RDMA performance is highly sensitive to network configuration. Incorrect QoS, PFC, or ECN settings cause performance worse than standard TCP.
- The NVIDIA CUDA Toolkit RDMA libraries target S3-compatible storage specifically; not all object stores support the required RDMA transport yet.
enablesRDMA-Accelerated Object Access — the transport protocol for microsecond object accessenablesGPU-Direct Storage Pipeline — direct storage-to-GPU data pathscoped_toObject Storage — underlying transport for high-performance storage
Definition
Network transport protocols enabling Remote Direct Memory Access — transferring data directly between application memory on different servers without involving the CPU or OS kernel, achieving microsecond-level latency. In early 2026, NVIDIA released RDMA client and server libraries for S3-compatible storage as part of the CUDA Toolkit, transitioning RDMA from niche technical preview to a standard component of "AI Factory" infrastructure.
HTTP/TCP-based S3 access introduces millisecond-scale latency. For internal object storage data paths (inter-node replication, erasure coding reconstruction), RDMA eliminates protocol overhead, enabling storage fabric performance closer to local memory access.
High-performance inter-node replication, erasure-coding reconstruction acceleration, AI/ML storage fabric, low-latency data movement within storage clusters, NVIDIA CUDA Toolkit GPU-direct S3 access.
Recent developments
- Ethernet (RoCE v2) winning the AI fabric war — ~70% of new deployments. Broadcom's March 2026 earnings confirmed ~70% of new AI infrastructure deployments are choosing Ethernet-based fabrics over InfiniBand. Meta, Microsoft, AWS converging on RoCE v2 for operational reasons (existing Ethernet skills + open-vendor sourcing). Per Rack2Cloud — InfiniBand vs RoCEv2: Why Ethernet Is Winning and NetPilot — RoCEv2 vs InfiniBand: AI Cluster Networking Compared 2026.
- Performance gap narrowed: InfiniBand 1–2µs vs RoCE v2 2–5µs. RoCE v2 hits 85–95% of InfiniBand's training throughput for tier-2/3 deployments (256–1,024 GPUs). InfiniBand still wins at frontier scale (>10K GPUs) where the latency delta compounds. Per FirstPassLab — RoCE vs InfiniBand for AI Data Center Networking 2026.
- Ultra Ethernet Consortium (UEC) emerging as the third option. UEC builds next-gen Ethernet specifically for AI workloads with built-in reliability that eliminates PFC entirely + adaptive AI-tuned congestion control + native RDMA. Production deployments expected H2 2026 onward. Per Stordis — Ultra Ethernet vs InfiniBand, RoCE and TCP - AI and Medium — From InfiniBand to Ultra Ethernet: Why AI Networks Rethought RDMA.
- Lossless Ethernet (PFC + ECN) is the table-stakes RoCE v2 deployment pattern. Production RoCE v2 requires Priority Flow Control + Explicit Congestion Notification across the fabric — getting this configuration right is the load-bearing operational task that separates working RoCE v2 deployments from broken ones. Per Intelligent Visibility — Lossless Ethernet Design Guide for AI Fabrics 2026.
- iWARP is dead; RoCE v2 + InfiniBand + UEC are the three options. Intelligent Visibility's RDMA-for-Storage guide explicitly retires iWARP from the production option set — the third option that competed with RoCE v2 and InfiniBand a decade ago has effectively zero new deployments. Per Intelligent Visibility — RDMA for Storage Ethernet: RoCE vs iWARP.
- RDMA-for-S3 is the convergence point — cuObject, MinIO AIStor, Cloudian, VAST all wire it via RoCE v2 or InfiniBand. Cross-vendor signal: every major S3-RDMA implementation in 2026 sits on top of RoCE v2 or InfiniBand. The protocol layer is settled; the storage-vendor implementations are where the differentiation lives. Per Distributed AI Fabrics — InfiniBand, RDMA, Lossless Ethernet Strategy Guide.
Connections 6
Outbound 2
scoped_to1enables1Inbound 4
Resources 2
NVIDIA networking solutions page covering InfiniBand and RoCE products for RDMA-accelerated data center communication.
Authoritative whitepaper on deploying RoCE v2 in data centers, covering lossless Ethernet configuration and performance analysis.