Architecture

Training Data Streaming from Object Storage

Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.

9 connections 3 resources 1 post

Summary

What it is

Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire datasets to local storage before training begins.

Where it fits

As training datasets grow to multi-TB scale, pre-downloading to local NVMe becomes impractical. Streaming from S3 enables training to start immediately and handle datasets larger than local storage — at the cost of depending on network throughput.

Misconceptions / Traps

Streaming requires sufficient network bandwidth. If S3 throughput cannot keep up with GPU consumption rate, GPUs idle and training wall-clock time increases. Benchmark throughput before committing to streaming.
Data shuffling is harder when streaming. Random access to S3 is expensive; streaming libraries use buffer-and-shuffle techniques that provide approximate randomness.

Key Connections

scoped_to Object Storage for AI Data Pipelines — training data loading pattern
depends_on S3 API — data read from S3 during training
constrained_by Cold Scan Latency — first-epoch data loading is latency-bound
GeeseFS enables Training Data Streaming from Object Storage — POSIX access layer

Definition

What it is

Streaming training data directly from S3 into GPU memory during model training, avoiding the need to pre-download entire datasets to local storage. Enables training on datasets larger than local disk.

Why it exists

AI training datasets routinely exceed local storage capacity (10s-100s of TB). Streaming from S3 decouples dataset size from local disk, enables dynamic data sampling, and eliminates the hours-long pre-download step.

Primary use cases

Large-scale distributed training, dynamic data sampling during training, training on datasets exceeding local storage, multi-node training with shared S3 data.

Recent developments

Latest signals

MosaicML StreamingDataset (mosaicml-streaming) v0.11.0 is the production-canonical PyTorch streaming dataset. Drop-in replacement for PyTorch's IterableDataset that streams from S3, GCS, Azure, OCI, Databricks UC Volumes, plus any S3-compatible store (Cloudflare R2, CoreWeave, Backblaze B2). Per PyPI — mosaicml-streaming and GitHub — mosaicml/streaming.
Partitions samples across nodes/GPUs/workers to eliminate redundant downloads. StreamingDataset's deduplication discipline is the load-bearing piece — naive object-storage streaming downloads the same sample N times when N workers each pull independently; StreamingDataset partitions deterministically so each sample downloads exactly once. Per Databricks Blog — MosaicML StreamingDataset: Fast Streaming from Cloud Storage.
2026 streaming-loader landscape: WebDataset, Megatron-Energon, MosaicML MDS, Lightning LitData. Four production-credible streaming loaders converged in 2026. Yin's benchmark compares data-prep efficiency + cloud-streaming perf + fault tolerance across all four — no single winner; pick by workload shape. Per Substack — Multimodal Dataloaders Go Brrrrrrr (Haoli Yin).
Multi-cloud + S3-compatible deployment is now table stakes. All four major loaders (WebDataset, MDS, Energon, LitData) support both AWS S3 + arbitrary S3-compatible stores. The "training jobs lock you into AWS" objection is no longer credible — every loader is portable. Per GitHub — mosaicml/streaming.
Fault tolerance is the new differentiator. With training jobs running on 1000+ GPUs for weeks, dataset-streaming fault tolerance (worker crash / network blip recovery) became the load-bearing engineering work in 2026. MDS + Energon ship rigorous fault tolerance; WebDataset's simpler design wins on speed but trails on resilience. Per Substack — Multimodal Dataloaders Go Brrrrrrr.
Databricks-managed Mosaic AI Training pattern documented for both AWS + Azure. Databricks documents the loader-pattern end-to-end for both AWS + Azure — the pattern transcended its MosaicML origin into a standard managed-platform primitive. Per Databricks — Load Data using Mosaic Streaming and Microsoft Learn — Load Data using Mosaic Streaming on Azure Databricks.