Apache Hudi
A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.
Summary
A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.
Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records.
- Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake.
- Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead.
implementsLakehouse Architecture — provides incremental processing on lakesdepends_onApache Hudi Spec, Apache Parquet — specification and data formatsolvesLegacy Ingestion Bottlenecks (incremental ingestion), Schema Evolutionscoped_toTable Formats, Lakehouse
Definition
A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.
Many real-world data pipelines need to update and delete individual records, not just append. Hudi brings record-level operations to data lakes without requiring a full rewrite of affected files.
Change data capture (CDC) into data lakes, incremental ETL pipelines, near-real-time analytics on S3 data.
Recent developments
- Hudi 1.0.x — stable artifact line bundling Spark 3.x / Scala 2.12. The 1.0.x release stream is the recommended production line as of 2026: the bundled artifacts pin Spark 3.x and Scala 2.12, eliminating the dependency-hell problems that plagued earlier minor versions where Spark/Scala compatibility had to be manually wired. For teams running production pipelines on Hudi, this is the line to pin to.
- Hudi 1.1.x — active development branch. As of mid-2026, the 1.1.x branch is where new features land: improved indexing, performance work on the upsert path, and incremental query optimizations. Not recommended for production yet, but worth watching for teams planning their next version-bump window. Expect the 1.1 → 1.2 transition to bring Spark 4.x compatibility.
- CDC-shape benchmark positioning — Hudi leads on record-level upsert throughput. The dominant cross-format benchmark in 2026 (Hudi vs. Iceberg V3 vs. Delta Lake vs. Paimon under change-data-capture workloads) consistently rates Hudi's record-level upsert path as the strongest of the four — Hudi was designed from day one with upserts as a first-class operation rather than retrofitting them onto a snapshot-isolation model. The catch: Iceberg V3's Puffin-encoded deletion vectors have closed most of the gap on UPDATE/DELETE throughput in the past 12 months, so the Hudi advantage is narrower than it used to be. Decision matrix: pick Hudi when upserts dominate the write pattern and you need sub-second commit latency; pick Iceberg V3 when you need the broader catalog/engine ecosystem (DuckDB, Trino, Snowflake all read it natively); pick Delta when CDC frequency is high and metadata stability under concurrent-writer pressure matters more than upsert throughput.
- AI-shaped indexing — Hudi's pluggable index over vector embeddings. Per Onehouse's lakehouse comparison, Hudi's distinguishing feature for AI workloads in 2026 is its multi-modal pluggable indexing subsystem sitting in the cloud metadata table — Bloom, R-tree, and bitmap indexes can be created asynchronously directly over vector embedding columns. Combined with Multi-Version Concurrency Control and Non-Blocking Concurrency Control, this provides record-level conflict resolution for workloads where thousands of concurrent AI agents write memory states simultaneously. The implication: Hudi is the table format of choice when the lakehouse doubles as the agent memory store.
- Hudi 1.0.2 released — Spark 3.5/4.0 + Java 17/21 support, bug fixes. Per the Apache Hudi 1.0.2 release notes, the latest patch added Spark 4.0 cross-compilation, Java 17 + 21 runtime support, and a long list of MVCC and clustering fixes. Production teams on the 1.0.x line should pin to 1.0.2 — earlier 1.0 minors had table-service deadlocks under heavy concurrent-writer pressure that bit at least one named platform team.
Connections 14
Outbound 8
scoped_to2implements1depends_on2competes_with1Inbound 6
alternative_to1competes_with1reads_from1Resources 4
Official Apache Hudi documentation covering table types (CoW/MoR), indexing, compaction, and storage layer configuration for S3.
Main Hudi source repository including the core storage engine, Spark/Flink integrations, and the S3-compatible file system layer.
Hudi's dedicated AWS S3 configuration guide covering S3A filesystem setup, IAM roles, and performance tuning.
The Hudi RFC directory contains formal design documents for major features and format changes, serving as living specification amendments.