Architecture

Checkpoint/Artifact Lake on Object Storage

Using S3 as the durable repository for ML model checkpoints, trained model artifacts, training logs, and experiment metadata. A centralized, versioned artifact store on object storage.

4 connections 3 resources 1 post

Summary

What it is

Using S3 as the durable repository for ML model checkpoints, trained model artifacts, training logs, and experiment metadata. A centralized, versioned artifact store on object storage.

Where it fits

ML training produces large, versioned artifacts (checkpoints can be tens of GB each). S3 provides the scalable, durable storage that keeps these artifacts accessible across experiments, teams, and clusters — serving as the "source of truth" for model lineage.

Misconceptions / Traps

Checkpoint frequency has a direct cost impact. Frequent checkpointing (every N steps) generates significant storage volume. Implement retention policies to garbage-collect old checkpoints.
S3 write latency affects training throughput if checkpointing is synchronous. Use asynchronous checkpoint uploads to avoid GPU idle time during saves.

Key Connections

scoped_to Object Storage for AI Data Pipelines — ML artifact management
depends_on S3 API — artifacts stored in S3
constrained_by Egress Cost — downloading checkpoints across regions/clouds is expensive

Definition

What it is

Using S3 as the durable, versioned repository for ML training checkpoints, model weights, pipeline artifacts, and experiment metadata — with lifecycle policies for retention and cost management.

Why it exists

ML training produces frequent checkpoints (every N steps) and final model artifacts. These must be durable, versioned, and shareable across teams. S3 provides cheap, durable, HTTP-accessible storage with versioning, making it the natural checkpoint and artifact repository.

Primary use cases

ML training checkpoint storage, model registry artifact storage, experiment tracking metadata, pipeline artifact management.

Recent developments

Latest signals

PyTorch DCP on S3: 72× faster than PyTorch 1.13 baseline, validated to 30B+ parameters. PyTorch Distributed Checkpoint with S3 backend (after the one-line consistency-fix) is now the canonical large-model-training checkpoint path — IBM published production-scale numbers proving FSDP + DCP + S3 works at 30B parameters and beyond. Per PyTorch Blog — Performant Distributed Checkpointing in Production with IBM.
Object storage typically an order of magnitude cheaper than shared filesystems for checkpoint backing. The economic argument that drove the DCP+S3 work: replacing dedicated parallel filesystems (Lustre, GPFS) with S3-class object storage cuts checkpoint storage cost ~10× while enabling more frequent checkpoints. Per PyTorch Blog — Performant Distributed Checkpointing.
AWS S3 Connector for PyTorch (awslabs/s3-connector-for-pytorch) ships first-class S3StorageWriter / S3StorageReader. Drop-in storage backends for DCP — the official AWS pattern for PyTorch training on S3. Per GitHub — awslabs/s3-connector-for-pytorch.
SageMaker HyperPod managed tiered checkpointing. AWS shipped a managed tiered-checkpoint service that integrates with PyTorch DCP — "a few lines" code change to add to training scripts; SageMaker handles the tier promotion + S3 backing. Per AWS ML Blog — SageMaker HyperPod Managed Tiered Checkpointing.
PyTorch Lightning cloud-checkpoint pattern is the production default for smaller jobs. Lightning's save_checkpoint integrates with fsspec backends — point at s3://... and it Just Works for the single-node to small-cluster cohort. Per Lightning — Cloud-Based Checkpoints Advanced.
MinIO + Amazon S3 Connector reference: open-source path to identical pattern on-prem. MinIO's blog walks through PyTorch DCP via the AWS S3 Connector pointed at a MinIO endpoint — same code path, on-prem deployment. The connector being S3-API-compatible (not AWS-specific) is the key portability story. Per MinIO Blog — Model Checkpointing using Amazon's S3 Connector for PyTorch and MinIO.