Apache Paimon
An Apache top-level streaming lakehouse table format built on LSM-tree architecture, designed for high-frequency real-time writes and sub-minute data visibility on object storage.
Summary
An Apache top-level streaming lakehouse table format built on LSM-tree architecture, designed for high-frequency real-time writes and sub-minute data visibility on object storage.
While Iceberg and Delta focus on batch-first with streaming bolted on, Paimon is streaming-first. Its LSM-tree design on S3 enables minute-level data visibility for CDC workloads, making it the natural choice for Flink-based real-time pipelines writing to object storage.
- Paimon's strength is Flink integration. Spark support is improving but lags significantly behind Flink in maturity and performance.
- Higher metadata complexity than Iceberg. The LSM-tree compaction process adds operational overhead that batch-oriented formats do not have.
depends_onS3 API — stores data as objects on S3depends_onApache Parquet — data file formatenablesLakehouse Architecture — streaming-first lakehouse designcompetes_withApache Hudi — both target real-time ingestion workloads
Definition
An Apache top-level table format built on LSM-tree (Log-Structured Merge-tree) architecture, designed for high-frequency streaming writes and real-time analytics on object storage. Originally developed as Flink Table Store.
Traditional lakehouse table formats (Iceberg, Delta, Hudi) were designed primarily for batch workloads with streaming bolted on. Paimon is built streaming-first, using LSM-trees on S3 to enable minute-level data visibility for CDC and real-time analytics without the write amplification penalty of copy-on-write.
Real-time CDC ingestion into the lakehouse, streaming analytics with minute-level visibility, high-frequency update workloads on S3.
Recent developments
- 40 million rows/sec at ByteDance / TikTok / Alibaba — Paimon at hyperscale streaming. Per Alibaba Cloud's 2025 Paimon writeup, individual Paimon tables in production at ByteDance, TikTok, and Alibaba Group are sustaining 40 million rows per second of streaming writes, reducing end-to-end CDC latency from hours to seconds. The LSM-tree-on-Parquet design mimics the write characteristics of real-time transactional databases (RocksDB, ClickHouse) while retaining object-storage economics — a combination no other open table format ships.
- Multimodal in one pipeline — native Lance file integration. Paimon now integrates the Lance file format (the LanceDB columnar format optimized for ML blob access patterns) directly, so vectors, text corpora, and binary image data co-reside in the same streaming table without forking the pipeline into separate storage silos. For multimodal AI lakehouses, this is the architectural primitive that finally collapses the "structured + vector + blob" three-silo problem into one writeable lakehouse table.
- Iceberg V3 deletion-vector bridge — analytical engines read Paimon as Iceberg. Per the same Alibaba writeup, Paimon now uses Iceberg V3 deletion vectors to automatically generate Iceberg-compatible snapshots from its LSM layer. This means Trino and StarRocks can read the same data Paimon is streaming into, without an additional ETL hop. Result: Paimon owns the ingestion side; Iceberg owns the analytical side; one physical layout serves both — the foundation of the Chinese-cloud "real-time AI lakehouse" architecture exported via Aliyun OSS.
Connections 13
Outbound 8
scoped_to2depends_on2competes_with1augments1Inbound 5
competes_with1enables2reads_from1depends_on1Resources 3
Official Apache Paimon documentation covering LSM-tree architecture, streaming ingestion, and Flink integration.
Source repository with architecture docs, performance benchmarks, and connector guides.
Technical overview of Paimon's streaming lakehouse design and how it compares to Hudi for CDC workloads.