Guide 40

GPUDirect to S3 — cuObject, RDMA, and the Zero-Copy Pipeline

Problem Framing

Traditional AI training and inference pipelines move object-storage data through the CPU as a bounce-buffer: S3 → NIC → CPU → PCIe → GPU. Every hop adds latency and burns CPU cycles. NVIDIA GPUDirect Storage (GDS) eliminates the CPU bounce-buffer for block and file storage via cuFile; cuObject extends the same pattern to S3-compatible object storage using a control-plane / data-plane split with the x-amz-rdma-token HTTP header. The result: object payloads stream directly from S3 to GPU VRAM at sustained 200+ GB/s. This guide maps when to adopt the GPU-direct-from-S3 pattern.

Relevant Nodes

Topics: GPU + Object Storage Convergence, Inference Locality, Object Storage for AI Data Pipelines
Technologies: NVIDIA cuObject, NVIDIA BlueField-4, NIXL (NVIDIA Inference Transfer Library), MinIO, Cloudian, VAST Data, RustFS
Standards: S3 API (with x-amz-rdma-token extension), RDMA / RoCE v2, InfiniBand, CXL 3.0, NVMe-oF (NVMe-over-TCP)
Architectures: GPU-Direct Storage Pipeline, Tiered Storage, Decoupled Vector Search
Pain Points: Data Loading Bottleneck, Memory Wall, Cold Scan Latency, Prefill Tax

Decision Path

Confirm the workload is data-loading-bound (not compute-bound):
- Measure GPU utilization during S3-reading phases. If utilization is <50% and the bottleneck is data movement, GPU-direct-from-S3 is high-leverage.
- Workloads where compute fully saturates the GPU (small models, low-batch-size inference) won't benefit — the bottleneck is elsewhere.
- Training data loaders, large-model checkpoint streaming, and KV-cache pool transfers are the highest-leverage targets.
Verify the storage backend supports cuObject / GDS over S3:
- Cloudian HyperStore — sustained >200 GB/s reported with GPUDirect for Object Storage integration.
- VAST Data with DASE architecture — pushes S3 over RDMA natively.
- MinIO — native S3 GDS implementation for massive parallel throughput on training datasets.
- RustFS — Apache 2.0 alternative with the same drop-in pattern; cuObject support tracking upstream.
- Self-hosted MinIO clusters on commodity hardware can also achieve high throughput; the bottleneck shifts to NIC speed (200/400 Gbps Ethernet, InfiniBand, or RoCE v2).
Plan the fabric:
- InfiniBand — lowest latency, highest throughput; the de-facto fabric for AI training clusters.
- RoCE v2 (RDMA over Converged Ethernet) — sufficient for most production workloads, cheaper than InfiniBand, integrates with standard Ethernet switching.
- NVMe-oF / NVMe-over-TCP — fallback for environments without RDMA fabrics; expect lower throughput but workable for batch workloads.
Understand the cuObject protocol:
- Control plane: The GPU application initiates a standard S3 GET/PUT via a modified S3 SDK. The SDK appends specific metadata tags, notably x-amz-rdma-token, to the HTTP request.
- Fabric negotiation: On token verification, the storage gateway initiates a Dynamic Connection (DC) transport over InfiniBand or RoCE v2.
- Data plane: An RDMA_READ or RDMA_WRITE streams the S3 object payload directly into GPU VRAM, bypassing the host CPU's TCP/IP stack entirely.
Layer in NIXL + ICMS for full tier-3.5 architecture:
- NIXL (NVIDIA Inference Transfer Library) coordinates data movement between storage tiers, GPUs, and inference engines.
- ICMS / CMX (Inference Context Memory Storage / Context Memory eXtension) — the dedicated Tier 3.5 storage layer between local SSDs (Tier 3) and cold S3 (Tier 4), hosted by NVIDIA BlueField-4 DPUs.
- Together: NIXL automatically spills KV-cache and agentic state from GPU HBM → CXL → NVMe → ICMS → S3 based on access patterns, with cuObject handling the S3 transfers at line speed.

What Changed Over Time

2023: GPUDirect Storage (GDS) shipped for block and file storage via cuFile. Object storage was excluded from the GPU-direct pattern.
Mid-2025: cuObject library released, extending GDS semantics to S3 via the x-amz-rdma-token mechanism.
2026: Cloudian, VAST, MinIO, and RustFS all integrated cuObject; Cloudian reported sustained 200+ GB/s. NVIDIA BlueField-4 announced as the DPU substrate for AI-native storage (the ICMS / CMX tier). NIXL formalized cross-tier data orchestration.
Forward: CXL 3.0 will further dissolve the host-RAM-vs-object-storage boundary. Distributed Page Caches over CXL.mem will let inference clusters share KV-cache state at sub-microsecond latency, with S3 as the cold-durable tier behind them.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources