Standard

Apache Hudi Spec

Summary

What it is

The specification for managing incremental data processing on object storage — record-level upserts, deletes, change logs, and timeline-based metadata.

Where it fits

The Hudi spec defines how to efficiently mutate individual records in S3-stored datasets. It is the specification behind Hudi's Copy-on-Write and Merge-on-Read table types, and its timeline abstraction tracks all changes.

Misconceptions / Traps

The Hudi spec's timeline model is conceptually different from Iceberg's snapshot model and Delta's transaction log. Understanding the timeline abstraction is prerequisite to operating Hudi tables.
The RFC-based evolution model means the spec is a living document. Breaking changes can be introduced via RFCs.

Key Connections

enables Lakehouse Architecture — makes incremental processing possible on data lakes
Apache Hudi depends_on Apache Hudi Spec
scoped_to Table Formats, Lakehouse

Definition

What it is

A specification for managing incremental data processing on object storage — defining record-level upserts, deletes, change logs, and timeline-based metadata.

Why it exists

Traditional data lake patterns only supported append operations. The Hudi spec defines how to efficiently update and delete individual records in S3-stored datasets, which is essential for CDC, compliance, and data correction workflows.

Primary use cases

Change data capture into S3, record-level updates without full partition rewrites, incremental query support.

Relationships

Outbound Relationships

scoped_to

Table Formats Lakehouse

enables

Lakehouse Architecture

Inbound Relationships

depends_on

Apache Hudi

Resources

SpecHigh

hudi.apache.org/tech-specs/

Technical specification pages documenting Hudi table format internals including the timeline, file layout, indexing mechanisms, and compaction strategies.

DocsHigh

hudi.apache.org/docs/overview

Official documentation covering table types (CoW and MoR), write operations, querying, and configuration.

GitHubHigh

github.com/apache/hudi

Canonical repository containing the reference Java implementation, RFC documents, and the source of truth for the Hudi table format.

SpecHigh

github.com/apache/hudi/tree/master/rfc

The Hudi RFC directory contains formal design documents for major features and format changes, serving as living specification amendments.