Checkpoint/Artifact Lake on Object Storage
Using S3 as the durable repository for ML model checkpoints, trained model artifacts, training logs, and experiment metadata. A centralized, versioned artifact store on object storage.
Summary
Using S3 as the durable repository for ML model checkpoints, trained model artifacts, training logs, and experiment metadata. A centralized, versioned artifact store on object storage.
ML training produces large, versioned artifacts (checkpoints can be tens of GB each). S3 provides the scalable, durable storage that keeps these artifacts accessible across experiments, teams, and clusters — serving as the "source of truth" for model lineage.
- Checkpoint frequency has a direct cost impact. Frequent checkpointing (every N steps) generates significant storage volume. Implement retention policies to garbage-collect old checkpoints.
- S3 write latency affects training throughput if checkpointing is synchronous. Use asynchronous checkpoint uploads to avoid GPU idle time during saves.
scoped_toObject Storage for AI Data Pipelines — ML artifact managementdepends_onS3 API — artifacts stored in S3constrained_byEgress Cost — downloading checkpoints across regions/clouds is expensive
Definition
Using S3 as the durable, versioned repository for ML training checkpoints, model weights, pipeline artifacts, and experiment metadata — with lifecycle policies for retention and cost management.
ML training produces frequent checkpoints (every N steps) and final model artifacts. These must be durable, versioned, and shareable across teams. S3 provides cheap, durable, HTTP-accessible storage with versioning, making it the natural checkpoint and artifact repository.
ML training checkpoint storage, model registry artifact storage, experiment tracking metadata, pipeline artifact management.
Recent developments
- PyTorch DCP on S3: 72× faster than PyTorch 1.13 baseline, validated to 30B+ parameters. PyTorch Distributed Checkpoint with S3 backend (after the one-line consistency-fix) is now the canonical large-model-training checkpoint path — IBM published production-scale numbers proving FSDP + DCP + S3 works at 30B parameters and beyond. Per PyTorch Blog — Performant Distributed Checkpointing in Production with IBM.
- Object storage typically an order of magnitude cheaper than shared filesystems for checkpoint backing. The economic argument that drove the DCP+S3 work: replacing dedicated parallel filesystems (Lustre, GPFS) with S3-class object storage cuts checkpoint storage cost ~10× while enabling more frequent checkpoints. Per PyTorch Blog — Performant Distributed Checkpointing.
- AWS S3 Connector for PyTorch (awslabs/s3-connector-for-pytorch) ships first-class S3StorageWriter / S3StorageReader. Drop-in storage backends for DCP — the official AWS pattern for PyTorch training on S3. Per GitHub — awslabs/s3-connector-for-pytorch.
- SageMaker HyperPod managed tiered checkpointing. AWS shipped a managed tiered-checkpoint service that integrates with PyTorch DCP — "a few lines" code change to add to training scripts; SageMaker handles the tier promotion + S3 backing. Per AWS ML Blog — SageMaker HyperPod Managed Tiered Checkpointing.
- PyTorch Lightning cloud-checkpoint pattern is the production default for smaller jobs. Lightning's
save_checkpointintegrates with fsspec backends — point ats3://...and it Just Works for the single-node to small-cluster cohort. Per Lightning — Cloud-Based Checkpoints Advanced. - MinIO + Amazon S3 Connector reference: open-source path to identical pattern on-prem. MinIO's blog walks through PyTorch DCP via the AWS S3 Connector pointed at a MinIO endpoint — same code path, on-prem deployment. The connector being S3-API-compatible (not AWS-specific) is the key portability story. Per MinIO Blog — Model Checkpointing using Amazon's S3 Connector for PyTorch and MinIO.
Connections 4
Outbound 3
scoped_to2depends_on1Inbound 1
enables1Resources 3
SageMaker documentation on saving model checkpoints to S3 during training for fault tolerance and resume.
PyTorch checkpointing documentation covering state dict serialization patterns used with S3-backed artifact stores.
MLflow artifact stores documentation showing how to persist ML artifacts, models, and metadata to S3.