Guide 9

Object Storage for AI/ML Training Pipelines

Problem Framing

AI/ML training workloads are becoming the dominant consumers of object storage bandwidth. A single LLM training run may read tens of terabytes of tokenized data per epoch, write multi-gigabyte checkpoints every few minutes, and pull feature vectors from embedding stores for augmentation — all against S3-compatible storage. The challenge is architectural: S3's per-request latency, throughput ceilings, and pricing model interact badly with GPU-driven workloads that stall when starved for data.

Engineers building ML training infrastructure on object storage face a cascade of decisions: how to stream training data without pre-downloading entire datasets, where to persist checkpoints without blocking training, whether to use GPU-Direct Storage to bypass CPU bottlenecks, and when to insert a cache tier between S3 and compute. Getting these wrong means GPUs sitting idle waiting for data — the most expensive form of waste in modern infrastructure.

Relevant Nodes

  • Topics: Object Storage for AI Data Pipelines, Directory Buckets / Hot Object Storage
  • Technologies: S3 Express One Zone, GeeseFS, VAST Data, Pure Storage FlashBlade
  • Standards: NVMe-oF / NVMe over TCP
  • Architectures: GPU-Direct Storage Pipeline, Training Data Streaming from Object Storage, Checkpoint/Artifact Lake on Object Storage, Feature/Embedding Store on Object Storage, NVMe-backed Object Tier, Cache-Fronted Object Storage, Online Embedding Refresh Pipeline
  • Pain Points: Cold Retrieval Latency, Small Files Amplification

Decision Path

  1. Decide your training data access pattern. The fundamental fork:

    • Pre-download to local NVMe: Simplest. Copy dataset to instance storage before training. Works when dataset fits on local disk and you can tolerate the copy time. Falls apart at multi-TB scale or when iterating rapidly on data.
    • Stream from S3: Use MosaicML Streaming, PyTorch DataPipes, or NVIDIA DALI S3 plugin to read training data directly from S3 during training. Dataset can exceed local disk. Trade-off: training throughput depends on S3 read bandwidth.
    • FUSE mount (GeeseFS): Present S3 as a POSIX filesystem. Useful when training code expects file paths rather than S3 URIs. GeeseFS is optimized for sequential read patterns common in training. Adds latency vs. local disk.
  2. Choose your checkpoint strategy:

    • Direct to S3 Standard: Simple, durable, cost-effective for hourly checkpoints. Latency is 50-200ms per PUT — acceptable if checkpoint frequency is low.
    • S3 Express One Zone for hot checkpoints: Single-digit ms latency. Use for frequent checkpoints (every few minutes) where standard S3 latency would stall training. Trade-off: single-AZ durability, higher cost per GB.
    • Local NVMe + async upload: Write checkpoints to local NVMe instantly, upload to S3 asynchronously. Lowest training disruption. Risk: lose the checkpoint if the instance dies before upload completes.
  3. Evaluate GPU-Direct Storage:

    • Use GPU-Direct Storage (GDS) when training data is on NVMe-backed storage and you need to eliminate CPU-mediated copies. GDS streams data from NVMe directly into GPU memory via DMA. Requires NVIDIA GPUDirect Storage support and compatible NVMe hardware.
    • Skip GDS if your bottleneck is S3 network throughput rather than CPU copy overhead, or if you are using standard S3 (GDS does not work over HTTP).
  4. Decide on a cache tier:

    • No cache: Acceptable when S3 bandwidth meets training throughput needs and data is read sequentially (each sample read once per epoch).
    • Alluxio or similar distributed cache: Insert between S3 and compute when multiple training jobs read the same data, or when S3 read latency causes GPU stalls. Cache absorbs repeat reads.
    • NVMe-backed object tier: Use S3 Express One Zone or self-hosted NVMe-backed MinIO as a hot tier for frequently accessed training data and checkpoints.
  5. Handle the small files problem for training data:

    • ML datasets often consist of millions of small files (images, audio clips, text chunks). Each S3 GET has per-request overhead.
    • Pack into larger archives: Use WebDataset (tar shards), TFRecord, or Lance format to bundle small files into sequential-read-friendly containers.
    • Use S3 Select or byte-range GETs to read subsets of large files without downloading the entire object.

What Changed Over Time

  • Early ML training on S3 was almost exclusively pre-download: copy data to HDFS or local disk, then train. Streaming was unreliable and slow.
  • MosaicML Streaming, NVIDIA DALI S3 plugin, and PyTorch DataPipes made streaming from S3 production-viable, enabling training on datasets too large for local storage.
  • AWS launched S3 Express One Zone (2023) to address the latency gap for hot-path workloads like checkpointing, reducing first-byte latency from ~100ms to single-digit ms.
  • GPU-Direct Storage moved from an HPC niche to mainstream AI infrastructure as training clusters adopted NVMe-oF fabrics.
  • The cost of idle GPUs ($2-30+/hour per GPU) has made storage I/O optimization a first-order economic concern, not an afterthought.

Sources