Technology

NVIDIA cuObject

NVIDIA's CUDA library extending **GPUDirect Storage (GDS)** semantics to S3-compatible object storage. Where the original GDS targeted block and file storage via `cuFile`, cuObject enables high-performance **RDMA transfers over S3 APIs** by separating the control plane from the data plane: 1. **Control plane handshake** — the GPU application initiates a standard S3 GET/PUT via a modified S3 SDK. The SDK appends specific metadata tags, notably **`x-amz-rdma-token`**, to the HTTP request. 2. **Fabric negotiation** — on token verification, the storage gateway initiates a Dynamic Connection (DC) transport over InfiniBand or RoCE v2. 3. **Data plane streaming** — an RDMA_READ or RDMA_WRITE streams the S3 object payload directly into GPU VRAM, bypassing the host CPU's TCP/IP stack entirely.

8 connections 1 post

Definition

What it is

NVIDIA's CUDA library extending **GPUDirect Storage (GDS)** semantics to S3-compatible object storage. Where the original GDS targeted block and file storage via `cuFile`, cuObject enables high-performance **RDMA transfers over S3 APIs** by separating the control plane from the data plane: 1. **Control plane handshake** — the GPU application initiates a standard S3 GET/PUT via a modified S3 SDK. The SDK appends specific metadata tags, notably **`x-amz-rdma-token`**, to the HTTP request. 2. **Fabric negotiation** — on token verification, the storage gateway initiates a Dynamic Connection (DC) transport over InfiniBand or RoCE v2. 3. **Data plane streaming** — an RDMA_READ or RDMA_WRITE streams the S3 object payload directly into GPU VRAM, bypassing the host CPU's TCP/IP stack entirely.

Why it exists

Object storage was historically the slow tier — HTTP, REST, JSON, byte-stream-into-CPU-then-into-GPU. cuObject's framing inverts that: by treating object payloads as RDMA-streamable, S3-backed training-data loaders and KV-cache pools can match the throughput characteristics of fast block storage. Cloudian, VAST Data, and MinIO all integrate cuObject; Cloudian reports sustained **>200 GB/s** throughput on GPU-attached S3 fabrics.

Primary use cases

Zero-copy S3-to-VRAM training-data loading, KV-cache pool transfers across cluster fabric, large-model checkpoint streaming, object-storage-backed inference fabrics, S3-RDMA for `MinIO`-and-similar self-hosted deployments adopting GPU-direct paths.

Connections 8

Outbound 8

Featured in