Architecture

Branching / Tagging

The catalog-level capability to create lightweight named references (branches and tags) to specific table states, enabling isolated experimentation, safe schema changes, and reproducible analysis without duplicating data files on S3.

7 connections 3 resources

Summary

What it is

The catalog-level capability to create lightweight named references (branches and tags) to specific table states, enabling isolated experimentation, safe schema changes, and reproducible analysis without duplicating data files on S3.

Where it fits

Branching and tagging bring Git-like version control semantics to lakehouse metadata. A branch creates an isolated workspace where writes do not affect the main table state; a tag creates an immutable named snapshot for reproducibility. Both operate on metadata only — data files on S3 are shared.

Misconceptions / Traps
  • Branches do not copy data. They are metadata pointers. Creating a branch is nearly free; the cost comes from writes to the branch that create new data files.
  • Not all catalogs support branching. Iceberg supports branch and tag natively in its spec, but Glue Catalog and Hive Metastore do not expose branch APIs. Nessie and Polaris do.
  • Merging branches in a lakehouse is not as mature as Git merging. Conflict resolution is table-level, not row-level, and concurrent modifications to the same table on different branches require careful handling.
Key Connections
  • scoped_to Data Versioning, Table Formats — version control for table state
  • enabled_by Project Nessie, Apache Polaris — catalogs that support branching
  • enabled_by Apache Iceberg — Iceberg spec supports branch and tag references
  • enables Time Travel — tags provide named time-travel targets

Definition

What it is

The practice of creating named branches or tags on table metadata in an S3-based lakehouse, enabling isolated reads and writes against a logical copy of the data without physically duplicating files on S3.

Why it exists

Testing schema changes, running what-if analyses, or validating pipeline changes on production data requires isolation. Physical copies of S3 data are expensive and slow. Branching at the catalog/metadata level provides zero-copy isolation by forking the metadata pointers, not the data.

Primary use cases

Safe schema evolution testing, data pipeline CI/CD, isolated experimentation on production datasets.

Connections 7

Outbound 6
Inbound 1

Resources 3