Training Data Streaming from Object Storage
Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.
Summary
Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.
As training datasets grow to multi-TB scale, pre-downloading to local NVMe becomes impractical. Streaming from S3 enables training to start immediately and handle datasets larger than local storage — at the cost of depending on network throughput.
- Streaming requires sufficient network bandwidth. If S3 throughput cannot keep up with GPU consumption rate, GPUs idle and training wall-clock time increases. Benchmark throughput before committing to streaming.
- Data shuffling is harder when streaming. Random access to S3 is expensive; streaming libraries use buffer-and-shuffle techniques that provide approximate randomness.
scoped_toObject Storage for AI Data Pipelines — training data loading patterndepends_onS3 API — data read from S3 during trainingconstrained_byCold Scan Latency — first-epoch data loading is latency-bound- GeeseFS
enablesTraining Data Streaming from Object Storage — POSIX access layer
Definition
Streaming training data directly from S3 into GPU memory during model training, avoiding the need to pre-download entire datasets to local storage. Enables training on datasets larger than local disk.
AI training datasets routinely exceed local storage capacity (10s-100s of TB). Streaming from S3 decouples dataset size from local disk, enables dynamic data sampling, and eliminates the hours-long pre-download step.
Large-scale distributed training, dynamic data sampling during training, training on datasets exceeding local storage, multi-node training with shared S3 data.
Connections 5
Outbound 4
Inbound 1
enables1Resources 3
SageMaker documentation on streaming training data from S3 using Fast File Mode and Pipe Mode for efficient GPU utilization.
PyTorch DataPipes documentation for building streaming data pipelines from S3 and other remote sources.
MosaicML Streaming library documentation for deterministic, resumable data streaming from S3 for distributed training.