Technology

Apache Hudi

Summary

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Where it fits

Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records.

Misconceptions / Traps

Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake.
Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead.

Key Connections

implements Lakehouse Architecture — provides incremental processing on lakes
depends_on Apache Hudi Spec, Apache Parquet — specification and data format
solves Legacy Ingestion Bottlenecks (incremental ingestion), Schema Evolution
scoped_to Table Formats, Lakehouse

Definition

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Why it exists

Many real-world data pipelines need to update and delete individual records, not just append. Hudi brings record-level operations to data lakes without requiring a full rewrite of affected files.

Primary use cases

Change data capture (CDC) into data lakes, incremental ETL pipelines, near-real-time analytics on S3 data.

Relationships

Outbound Relationships

scoped_to

Table Formats Lakehouse

implements

Lakehouse Architecture

depends_on

Apache Hudi Spec Apache Parquet

solves

Legacy Ingestion Bottlenecks Schema Evolution

Resources

DocsHigh

hudi.apache.org/docs/overview

Official Apache Hudi documentation covering table types (CoW/MoR), indexing, compaction, and storage layer configuration for S3.

GitHubHigh

github.com/apache/hudi

Main Hudi source repository including the core storage engine, Spark/Flink integrations, and the S3-compatible file system layer.

DocsMedium

hudi.apache.org/docs/s3_hoodie

Hudi's dedicated AWS S3 configuration guide covering S3A filesystem setup, IAM roles, and performance tuning.

SpecHigh

github.com/apache/hudi/tree/master/rfc

The Hudi RFC directory contains formal design documents for major features and format changes, serving as living specification amendments.