Training Data Streaming from Object Storage
Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.
Summary
Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.
As training datasets grow to multi-TB scale, pre-downloading to local NVMe becomes impractical. Streaming from S3 enables training to start immediately and handle datasets larger than local storage — at the cost of depending on network throughput.
- Streaming requires sufficient network bandwidth. If S3 throughput cannot keep up with GPU consumption rate, GPUs idle and training wall-clock time increases. Benchmark throughput before committing to streaming.
- Data shuffling is harder when streaming. Random access to S3 is expensive; streaming libraries use buffer-and-shuffle techniques that provide approximate randomness.
scoped_toObject Storage for AI Data Pipelines — training data loading patterndepends_onS3 API — data read from S3 during trainingconstrained_byCold Scan Latency — first-epoch data loading is latency-bound- GeeseFS
enablesTraining Data Streaming from Object Storage — POSIX access layer
Definition
Streaming training data directly from S3 into GPU memory during model training, avoiding the need to pre-download entire datasets to local storage. Enables training on datasets larger than local disk.
AI training datasets routinely exceed local storage capacity (10s-100s of TB). Streaming from S3 decouples dataset size from local disk, enables dynamic data sampling, and eliminates the hours-long pre-download step.
Large-scale distributed training, dynamic data sampling during training, training on datasets exceeding local storage, multi-node training with shared S3 data.
Recent developments
- MosaicML StreamingDataset (
mosaicml-streaming) v0.11.0 is the production-canonical PyTorch streaming dataset. Drop-in replacement for PyTorch'sIterableDatasetthat streams from S3, GCS, Azure, OCI, Databricks UC Volumes, plus any S3-compatible store (Cloudflare R2, CoreWeave, Backblaze B2). Per PyPI — mosaicml-streaming and GitHub — mosaicml/streaming. - Partitions samples across nodes/GPUs/workers to eliminate redundant downloads. StreamingDataset's deduplication discipline is the load-bearing piece — naive object-storage streaming downloads the same sample N times when N workers each pull independently; StreamingDataset partitions deterministically so each sample downloads exactly once. Per Databricks Blog — MosaicML StreamingDataset: Fast Streaming from Cloud Storage.
- 2026 streaming-loader landscape: WebDataset, Megatron-Energon, MosaicML MDS, Lightning LitData. Four production-credible streaming loaders converged in 2026. Yin's benchmark compares data-prep efficiency + cloud-streaming perf + fault tolerance across all four — no single winner; pick by workload shape. Per Substack — Multimodal Dataloaders Go Brrrrrrr (Haoli Yin).
- Multi-cloud + S3-compatible deployment is now table stakes. All four major loaders (WebDataset, MDS, Energon, LitData) support both AWS S3 + arbitrary S3-compatible stores. The "training jobs lock you into AWS" objection is no longer credible — every loader is portable. Per GitHub — mosaicml/streaming.
- Fault tolerance is the new differentiator. With training jobs running on 1000+ GPUs for weeks, dataset-streaming fault tolerance (worker crash / network blip recovery) became the load-bearing engineering work in 2026. MDS + Energon ship rigorous fault tolerance; WebDataset's simpler design wins on speed but trails on resilience. Per Substack — Multimodal Dataloaders Go Brrrrrrr.
- Databricks-managed Mosaic AI Training pattern documented for both AWS + Azure. Databricks documents the loader-pattern end-to-end for both AWS + Azure — the pattern transcended its MosaicML origin into a standard managed-platform primitive. Per Databricks — Load Data using Mosaic Streaming and Microsoft Learn — Load Data using Mosaic Streaming on Azure Databricks.
Connections 9
Outbound 4
Inbound 5
Resources 3
SageMaker documentation on streaming training data from S3 using Fast File Mode and Pipe Mode for efficient GPU utilization.
PyTorch DataPipes documentation for building streaming data pipelines from S3 and other remote sources.
MosaicML Streaming library documentation for deterministic, resumable data streaming from S3 for distributed training.