Branching / Tagging
The catalog-level capability to create lightweight named references (branches and tags) to specific table states, enabling isolated experimentation, safe schema changes, and reproducible analysis without duplicating data files on S3.
Summary
The catalog-level capability to create lightweight named references (branches and tags) to specific table states, enabling isolated experimentation, safe schema changes, and reproducible analysis without duplicating data files on S3.
Branching and tagging bring Git-like version control semantics to lakehouse metadata. A branch creates an isolated workspace where writes do not affect the main table state; a tag creates an immutable named snapshot for reproducibility. Both operate on metadata only — data files on S3 are shared.
- Branches do not copy data. They are metadata pointers. Creating a branch is nearly free; the cost comes from writes to the branch that create new data files.
- Not all catalogs support branching. Iceberg supports branch and tag natively in its spec, but Glue Catalog and Hive Metastore do not expose branch APIs. Nessie and Polaris do.
- Merging branches in a lakehouse is not as mature as Git merging. Conflict resolution is table-level, not row-level, and concurrent modifications to the same table on different branches require careful handling.
scoped_toData Versioning, Table Formats — version control for table stateenabled_byProject Nessie, Apache Polaris — catalogs that support branchingenabled_byApache Iceberg — Iceberg spec supports branch and tag referencesenablesTime Travel — tags provide named time-travel targets
Definition
The practice of creating named branches or tags on table metadata in an S3-based lakehouse, enabling isolated reads and writes against a logical copy of the data without physically duplicating files on S3.
Testing schema changes, running what-if analyses, or validating pipeline changes on production data requires isolation. Physical copies of S3 data are expensive and slow. Branching at the catalog/metadata level provides zero-copy isolation by forking the metadata pointers, not the data.
Safe schema evolution testing, data pipeline CI/CD, isolated experimentation on production datasets.
Connections 7
Outbound 6
Inbound 1
enables1Resources 3
Project Nessie provides Git-like branching and tagging for Iceberg catalogs, enabling isolated development and zero-copy table snapshots.
lakeFS provides Git-like branch, commit, and merge operations at the data lake level, enabling branching workflows on S3-hosted data.
Delta Lake CLONE documentation for creating zero-copy or deep-copy branches of tables for isolated development and testing.