Technology

Apache Paimon

An Apache top-level streaming lakehouse table format built on LSM-tree architecture, designed for high-frequency real-time writes and sub-minute data visibility on object storage.

13 connections 3 resources 1 post

Summary

What it is

An Apache top-level streaming lakehouse table format built on LSM-tree architecture, designed for high-frequency real-time writes and sub-minute data visibility on object storage.

Where it fits

While Iceberg and Delta focus on batch-first with streaming bolted on, Paimon is streaming-first. Its LSM-tree design on S3 enables minute-level data visibility for CDC workloads, making it the natural choice for Flink-based real-time pipelines writing to object storage.

Misconceptions / Traps

Paimon's strength is Flink integration. Spark support is improving but lags significantly behind Flink in maturity and performance.
Higher metadata complexity than Iceberg. The LSM-tree compaction process adds operational overhead that batch-oriented formats do not have.

Key Connections

depends_on S3 API — stores data as objects on S3
depends_on Apache Parquet — data file format
enables Lakehouse Architecture — streaming-first lakehouse design
competes_with Apache Hudi — both target real-time ingestion workloads

Definition

What it is

An Apache top-level table format built on LSM-tree (Log-Structured Merge-tree) architecture, designed for high-frequency streaming writes and real-time analytics on object storage. Originally developed as Flink Table Store.

Why it exists

Traditional lakehouse table formats (Iceberg, Delta, Hudi) were designed primarily for batch workloads with streaming bolted on. Paimon is built streaming-first, using LSM-trees on S3 to enable minute-level data visibility for CDC and real-time analytics without the write amplification penalty of copy-on-write.

Primary use cases

Real-time CDC ingestion into the lakehouse, streaming analytics with minute-level visibility, high-frequency update workloads on S3.

Recent developments

Latest signals

40 million rows/sec at ByteDance / TikTok / Alibaba — Paimon at hyperscale streaming. Per Alibaba Cloud's 2025 Paimon writeup, individual Paimon tables in production at ByteDance, TikTok, and Alibaba Group are sustaining 40 million rows per second of streaming writes, reducing end-to-end CDC latency from hours to seconds. The LSM-tree-on-Parquet design mimics the write characteristics of real-time transactional databases (RocksDB, ClickHouse) while retaining object-storage economics — a combination no other open table format ships.
Multimodal in one pipeline — native Lance file integration. Paimon now integrates the Lance file format (the LanceDB columnar format optimized for ML blob access patterns) directly, so vectors, text corpora, and binary image data co-reside in the same streaming table without forking the pipeline into separate storage silos. For multimodal AI lakehouses, this is the architectural primitive that finally collapses the "structured + vector + blob" three-silo problem into one writeable lakehouse table.
Iceberg V3 deletion-vector bridge — analytical engines read Paimon as Iceberg. Per the same Alibaba writeup, Paimon now uses Iceberg V3 deletion vectors to automatically generate Iceberg-compatible snapshots from its LSM layer. This means Trino and StarRocks can read the same data Paimon is streaming into, without an additional ETL hop. Result: Paimon owns the ingestion side; Iceberg owns the analytical side; one physical layout serves both — the foundation of the Chinese-cloud "real-time AI lakehouse" architecture exported via Aliyun OSS.