Guide 17

Real-Time Lakehouse: Paimon vs. Hudi

Problem Framing

Building a real-time lakehouse on S3 requires a table format optimized for high-frequency writes and low-latency reads. Apache Hudi and Apache Paimon are the two primary contenders. Hudi has years of production maturity; Paimon brings a fundamentally different architecture (LSM-tree vs. Hudi's log-based merge-on-read). Engineers implementing CDC pipelines or streaming analytics on S3 must understand the architectural differences, performance characteristics, and ecosystem trade-offs to make an informed choice.

Relevant Nodes

  • Topics: Table Formats, Lakehouse
  • Technologies: Apache Paimon, Apache Hudi, Apache Iceberg, Apache Flink, Apache Spark, Flink CDC, Estuary Flow
  • Standards: S3 API, Apache Parquet
  • Architectures: Lakehouse Architecture, LSM-tree on S3, Deletion Vector
  • Pain Points: Small Files Problem, Cold Scan Latency

Decision Path

  1. Understand the architectural difference:

    • Hudi (MoR — Merge-on-Read): Writes log files alongside base data files. Reads merge logs at query time. Compaction merges logs into base files periodically.
    • Paimon (LSM-tree): Writes sorted runs to S3 as immutable files across multiple levels. Compaction merges levels. Reads scan the active levels.
    • Both avoid copy-on-write for updates. The difference is in how they organize and compact the incremental data.
  2. Choose Apache Paimon when:

    • Your primary compute engine is Apache Flink. Paimon was originally Flink Table Store, and the Flink integration is first-class.
    • You need minute-level data visibility for CDC workloads on S3.
    • You want a streaming-first architecture where batch is the secondary use case.
    • Your workload is append-heavy with moderate update rates.
  3. Choose Apache Hudi when:

    • You need broad engine support. Hudi works with Spark, Flink, Trino, Presto, and more.
    • You have existing Hudi infrastructure and expertise (Hudi has years of production history at Uber, Robinhood, etc.).
    • You need record-level upserts with efficient index lookups.
    • Hudi 1.1's Flink-native writer closes the Flink performance gap with Paimon.
  4. Consider Iceberg as an alternative:

    • Iceberg is not streaming-first, but with Flink and deletion vectors, it handles moderate update workloads.
    • If your primary workload is batch with occasional streaming, Iceberg's broader ecosystem may outweigh Paimon's streaming optimization.
    • XTable or UniForm can bridge: write with Paimon/Hudi, read as Iceberg.
  5. Compaction economics:

    • Both formats require compaction. On S3, compaction means reading files, merging, and writing new files — consuming compute and I/O.
    • Paimon's LSM compaction is more predictable (level-based), while Hudi's compaction is workload-dependent (based on log file accumulation).
    • Budget compaction compute as a first-class operational concern, not an afterthought.

What Changed Over Time

  • Hudi (2016, Uber) pioneered record-level upserts on data lakes, initially on HDFS, then S3.
  • Paimon (2022, originally Flink Table Store) brought LSM-tree architecture purpose-built for object storage streaming.
  • Hudi 1.0 (2024) introduced a major rewrite with improved indexing and non-blocking compaction.
  • Hudi 1.1 (late 2025) added Flink-native writers, directly competing with Paimon on its home turf.
  • The convergence trend: Hudi is becoming more streaming-capable, Paimon is gaining batch engine support. The gap is narrowing.

Sources