Apache Hudi
Summary
What it is
A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.
Where it fits
Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records.
Misconceptions / Traps
- Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake.
- Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead.
Key Connections
implementsLakehouse Architecture — provides incremental processing on lakesdepends_onApache Hudi Spec, Apache Parquet — specification and data formatsolvesLegacy Ingestion Bottlenecks (incremental ingestion), Schema Evolutionscoped_toTable Formats, Lakehouse
Definition
What it is
A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.
Why it exists
Many real-world data pipelines need to update and delete individual records, not just append. Hudi brings record-level operations to data lakes without requiring a full rewrite of affected files.
Primary use cases
Change data capture (CDC) into data lakes, incremental ETL pipelines, near-real-time analytics on S3 data.
Relationships
Outbound Relationships
scoped_toimplementsdepends_onResources
Official Apache Hudi documentation covering table types (CoW/MoR), indexing, compaction, and storage layer configuration for S3.
Main Hudi source repository including the core storage engine, Spark/Flink integrations, and the S3-compatible file system layer.
Hudi's dedicated AWS S3 configuration guide covering S3A filesystem setup, IAM roles, and performance tuning.
The Hudi RFC directory contains formal design documents for major features and format changes, serving as living specification amendments.