Topic

Data Versioning

Summary

What it is

Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.

Where it fits

S3 objects are immutable once written. Data versioning adds the concept of change history on top of that immutability — from S3's built-in object versioning to table format snapshots to Git-like branching with lakeFS.

Misconceptions / Traps

  • S3 object versioning and dataset versioning are different things. S3 versioning tracks individual object changes; dataset versioning (Iceberg snapshots, lakeFS branches) tracks logical dataset state.
  • Versioning has storage cost implications. Every snapshot or version retains data, and garbage collection policies are essential at scale.

Key Connections

  • scoped_to Object Storage, S3 — versioning operates on S3-stored data

Definition

What it is

Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.

Why it exists

S3 objects are immutable once written. Representing logical change over time — schema evolution, data corrections, reprocessing — requires explicit versioning mechanisms built on top of the storage layer.

Relationships

Outbound Relationships

Resources