Definition

What it is

NVIDIA's CUDA library extending **GPUDirect Storage (GDS)** semantics to S3-compatible object storage. Where the original GDS targeted block and file storage via `cuFile`, cuObject enables high-performance **RDMA transfers over S3 APIs** by separating the control plane from the data plane: 1. **Control plane handshake** — the GPU application initiates a standard S3 GET/PUT via a modified S3 SDK. The SDK appends specific metadata tags, notably **`x-amz-rdma-token`**, to the HTTP request. 2. **Fabric negotiation** — on token verification, the storage gateway initiates a Dynamic Connection (DC) transport over InfiniBand or RoCE v2. 3. **Data plane streaming** — an RDMA_READ or RDMA_WRITE streams the S3 object payload directly into GPU VRAM, bypassing the host CPU's TCP/IP stack entirely.

Why it exists

Object storage was historically the slow tier — HTTP, REST, JSON, byte-stream-into-CPU-then-into-GPU. cuObject's framing inverts that: by treating object payloads as RDMA-streamable, S3-backed training-data loaders and KV-cache pools can match the throughput characteristics of fast block storage. Cloudian, VAST Data, and MinIO all integrate cuObject; Cloudian reports sustained **>200 GB/s** throughput on GPU-attached S3 fabrics.

Primary use cases

Zero-copy S3-to-VRAM training-data loading, KV-cache pool transfers across cluster fabric, large-model checkpoint streaming, object-storage-backed inference fabrics, S3-RDMA for `MinIO`-and-similar self-hosted deployments adopting GPU-direct paths.

Recent developments

Latest signals

cuObject v1.0.0 GA in R1.16 release. Both client and server libraries shipped to General Availability under the GPUDirect Storage 1.16 release umbrella — the reference toolkit for GPUDirect-to-S3 deployments. Per NVIDIA — cuObject client v1.0.0 release notes and NVIDIA — cuObject server v1.0.0 release notes.
DC (Dynamic Connection) transport is the production default. Unlike Reliable Connections, DC avoids pre-establishing N² connections between every client/server pair — scales to large fabrics. RDMA metadata rides in the x-amz-rdma-token S3 header during the HTTP control-plane handshake. Per NVIDIA — cuObject GPUDirect Storage for Objects docs.
Multi-vendor adoption: MinIO, Cloudian, VAST. MinIO AIStor + GPUDirect tech preview hit 200+ GB/s sustained; Cloudian shipped at GA; VAST's DASE architecture also wires cuObject. Cross-vendor adoption signals cuObject becoming the de facto interop layer for S3-RDMA. Per MinIO — AIStor + NVIDIA GPUDirect RDMA blog and Cloudian — Direct GPU-to-Object Storage.
45% GPU-server CPU reduction; 3× throughput vs non-RDMA flash. Reproducible numbers across vendor benchmarks — the CPU bypass is the dominant operational win, not just the throughput delta. Frees CPU cycles for the training-pipeline orchestration layer. Per MinIO — AIStor + GPUDirect RDMA performance.
KV-cache adoption emerging via LMCache integration request. LMCache filed a feature request to add S3-over-RDMA via cuObject — extends cuObject's value prop beyond training-data ingest into inference-time KV-cache pool transfers. Inference workloads now want what training got first. Per GitHub — LMCache Issue #2875: S3-over-RDMA via cuObject.
RoCE v2 + InfiniBand transports both supported. Production deployments can land on either the InfiniBand fabric (HPC-flavored deployments) or RoCE v2 over Ethernet (mainstream datacenter fabric) without code changes — cuObject abstracts the underlying RDMA transport. Per NVIDIA — cuObject GPUDirect Storage docs.
cuObject v1.2.0 (May 2026, CUDA 13.3) closed the object-vs-POSIX gap. Splits the control plane (HTTP S3 requests) from the data plane, using x-amz-rdma-token headers to push/pull data straight into GPU memory over Dynamically Connected RDMA — removing POSIX scratch staging and reaching wire-speed object→VRAM. Per NVIDIA — cuObject Server Release Notes.