Technology

Apache Hudi

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

14 connections 4 resources 2 posts

Summary

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Where it fits

Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records.

Misconceptions / Traps
  • Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake.
  • Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead.
Key Connections
  • implements Lakehouse Architecture — provides incremental processing on lakes
  • depends_on Apache Hudi Spec, Apache Parquet — specification and data format
  • solves Legacy Ingestion Bottlenecks (incremental ingestion), Schema Evolution
  • scoped_to Table Formats, Lakehouse

Definition

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Why it exists

Many real-world data pipelines need to update and delete individual records, not just append. Hudi brings record-level operations to data lakes without requiring a full rewrite of affected files.

Primary use cases

Change data capture (CDC) into data lakes, incremental ETL pipelines, near-real-time analytics on S3 data.

Recent developments

Latest signals
  • Hudi 1.0.x — stable artifact line bundling Spark 3.x / Scala 2.12. The 1.0.x release stream is the recommended production line as of 2026: the bundled artifacts pin Spark 3.x and Scala 2.12, eliminating the dependency-hell problems that plagued earlier minor versions where Spark/Scala compatibility had to be manually wired. For teams running production pipelines on Hudi, this is the line to pin to.
  • Hudi 1.1.x — active development branch. As of mid-2026, the 1.1.x branch is where new features land: improved indexing, performance work on the upsert path, and incremental query optimizations. Not recommended for production yet, but worth watching for teams planning their next version-bump window. Expect the 1.1 → 1.2 transition to bring Spark 4.x compatibility.
  • CDC-shape benchmark positioning — Hudi leads on record-level upsert throughput. The dominant cross-format benchmark in 2026 (Hudi vs. Iceberg V3 vs. Delta Lake vs. Paimon under change-data-capture workloads) consistently rates Hudi's record-level upsert path as the strongest of the four — Hudi was designed from day one with upserts as a first-class operation rather than retrofitting them onto a snapshot-isolation model. The catch: Iceberg V3's Puffin-encoded deletion vectors have closed most of the gap on UPDATE/DELETE throughput in the past 12 months, so the Hudi advantage is narrower than it used to be. Decision matrix: pick Hudi when upserts dominate the write pattern and you need sub-second commit latency; pick Iceberg V3 when you need the broader catalog/engine ecosystem (DuckDB, Trino, Snowflake all read it natively); pick Delta when CDC frequency is high and metadata stability under concurrent-writer pressure matters more than upsert throughput.
  • AI-shaped indexing — Hudi's pluggable index over vector embeddings. Per Onehouse's lakehouse comparison, Hudi's distinguishing feature for AI workloads in 2026 is its multi-modal pluggable indexing subsystem sitting in the cloud metadata table — Bloom, R-tree, and bitmap indexes can be created asynchronously directly over vector embedding columns. Combined with Multi-Version Concurrency Control and Non-Blocking Concurrency Control, this provides record-level conflict resolution for workloads where thousands of concurrent AI agents write memory states simultaneously. The implication: Hudi is the table format of choice when the lakehouse doubles as the agent memory store.
  • Hudi 1.0.2 released — Spark 3.5/4.0 + Java 17/21 support, bug fixes. Per the Apache Hudi 1.0.2 release notes, the latest patch added Spark 4.0 cross-compilation, Java 17 + 21 runtime support, and a long list of MVCC and clustering fixes. Production teams on the 1.0.x line should pin to 1.0.2 — earlier 1.0 minors had table-service deadlocks under heavy concurrent-writer pressure that bit at least one named platform team.

Connections 14

Outbound 8
Inbound 6

Resources 4

Featured in