Guide 17

Real-Time Lakehouse: Paimon vs. Hudi

Problem Framing

Building a real-time lakehouse on S3 requires a table format optimized for high-frequency writes and low-latency reads. Apache Hudi and Apache Paimon are the two primary contenders. Hudi has years of production maturity; Paimon brings a fundamentally different architecture (LSM-tree vs. Hudi's log-based merge-on-read). Engineers implementing CDC pipelines or streaming analytics on S3 must understand the architectural differences, performance characteristics, and ecosystem trade-offs to make an informed choice.

Relevant Nodes

Topics: Table Formats, Lakehouse
Technologies: Apache Paimon, Apache Hudi, Apache Iceberg, Apache Flink, Apache Spark, Flink CDC, Estuary Flow
Standards: S3 API, Apache Parquet
Architectures: Lakehouse Architecture, LSM-tree on S3, Deletion Vector
Pain Points: Small Files Problem, Cold Scan Latency

Decision Path

Understand the architectural difference:
- Hudi (MoR — Merge-on-Read): Writes log files alongside base data files. Reads merge logs at query time. Compaction merges logs into base files periodically.
- Paimon (LSM-tree): Writes sorted runs to S3 as immutable files across multiple levels. Compaction merges levels. Reads scan the active levels.
- Both avoid copy-on-write for updates. The difference is in how they organize and compact the incremental data.
Choose Apache Paimon when:
- Your primary compute engine is Apache Flink. Paimon was originally Flink Table Store, and the Flink integration is first-class.
- You need minute-level data visibility for CDC workloads on S3.
- You want a streaming-first architecture where batch is the secondary use case.
- Your workload is append-heavy with moderate update rates.
Choose Apache Hudi when:
- You need broad engine support. Hudi works with Spark, Flink, Trino, Presto, and more.
- You have existing Hudi infrastructure and expertise (Hudi has years of production history at Uber, Robinhood, etc.).
- You need record-level upserts with efficient index lookups.
- Hudi 1.1's Flink-native writer closes the Flink performance gap with Paimon.
Consider Iceberg as an alternative:
- Iceberg is not streaming-first, but with Flink and deletion vectors, it handles moderate update workloads.
- If your primary workload is batch with occasional streaming, Iceberg's broader ecosystem may outweigh Paimon's streaming optimization.
- XTable or UniForm can bridge: write with Paimon/Hudi, read as Iceberg.
Compaction economics:
- Both formats require compaction. On S3, compaction means reading files, merging, and writing new files — consuming compute and I/O.
- Paimon's LSM compaction is more predictable (level-based), while Hudi's compaction is workload-dependent (based on log file accumulation).
- Budget compaction compute as a first-class operational concern, not an afterthought.

What Changed Over Time

Hudi (2016, Uber) pioneered record-level upserts on data lakes, initially on HDFS, then S3.
Paimon (2022, originally Flink Table Store) brought LSM-tree architecture purpose-built for object storage streaming.
Hudi 1.0 (2024) introduced a major rewrite with improved indexing and non-blocking compaction.
Hudi 1.1 (late 2025) added Flink-native writers, directly competing with Paimon on its home turf.
The convergence trend: Hudi is becoming more streaming-capable, Paimon is gaining batch engine support. The gap is narrowing.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources