Standard

Apache Hudi Spec

Summary

What it is

The specification for managing incremental data processing on object storage — record-level upserts, deletes, change logs, and timeline-based metadata.

Where it fits

The Hudi spec defines how to efficiently mutate individual records in S3-stored datasets. It is the specification behind Hudi's Copy-on-Write and Merge-on-Read table types, and its timeline abstraction tracks all changes.

Misconceptions / Traps

  • The Hudi spec's timeline model is conceptually different from Iceberg's snapshot model and Delta's transaction log. Understanding the timeline abstraction is prerequisite to operating Hudi tables.
  • The RFC-based evolution model means the spec is a living document. Breaking changes can be introduced via RFCs.

Key Connections

  • enables Lakehouse Architecture — makes incremental processing possible on data lakes
  • Apache Hudi depends_on Apache Hudi Spec
  • scoped_to Table Formats, Lakehouse

Definition

What it is

A specification for managing incremental data processing on object storage — defining record-level upserts, deletes, change logs, and timeline-based metadata.

Why it exists

Traditional data lake patterns only supported append operations. The Hudi spec defines how to efficiently update and delete individual records in S3-stored datasets, which is essential for CDC, compliance, and data correction workflows.

Primary use cases

Change data capture into S3, record-level updates without full partition rewrites, incremental query support.

Relationships

Outbound Relationships

Inbound Relationships

depends_on

Resources