Architecture

Training Data Streaming from Object Storage

Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.

5 connections 3 resources

Summary

What it is

Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.

Where it fits

As training datasets grow to multi-TB scale, pre-downloading to local NVMe becomes impractical. Streaming from S3 enables training to start immediately and handle datasets larger than local storage — at the cost of depending on network throughput.

Misconceptions / Traps
  • Streaming requires sufficient network bandwidth. If S3 throughput cannot keep up with GPU consumption rate, GPUs idle and training wall-clock time increases. Benchmark throughput before committing to streaming.
  • Data shuffling is harder when streaming. Random access to S3 is expensive; streaming libraries use buffer-and-shuffle techniques that provide approximate randomness.
Key Connections
  • scoped_to Object Storage for AI Data Pipelines — training data loading pattern
  • depends_on S3 API — data read from S3 during training
  • constrained_by Cold Scan Latency — first-epoch data loading is latency-bound
  • GeeseFS enables Training Data Streaming from Object Storage — POSIX access layer

Definition

What it is

Streaming training data directly from S3 into GPU memory during model training, avoiding the need to pre-download entire datasets to local storage. Enables training on datasets larger than local disk.

Why it exists

AI training datasets routinely exceed local storage capacity (10s-100s of TB). Streaming from S3 decouples dataset size from local disk, enables dynamic data sampling, and eliminates the hours-long pre-download step.

Primary use cases

Large-scale distributed training, dynamic data sampling during training, training on datasets exceeding local storage, multi-node training with shared S3 data.

Connections 5

Outbound 4
Inbound 1

Resources 3