Architecture

CDC into Lakehouse

The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them to tables in an S3-based lakehouse, maintaining a near-real-time replica of transactional data.

9 connections 3 resources 1 post

Summary

What it is

Where it fits

CDC into Lakehouse is the bridge between OLTP and OLAP worlds. It enables analytics on operational data without impacting source databases, using tools like Debezium for capture, Kafka/Redpanda for transport, and Flink/Spark/Hudi for applying changes to Iceberg or Delta tables on S3.

Misconceptions / Traps

CDC replication is not instantaneous. End-to-end latency includes WAL read delay, Kafka transit, and sink write batching. "Near-real-time" typically means minutes, not milliseconds.
Handling deletes in a lakehouse requires table formats that support row-level deletes (Iceberg position/equality deletes, Hudi's MOR). Append-only designs cannot faithfully replicate CDC streams.
Schema changes in the source database must be handled by every component in the CDC pipeline. A missing column in the Kafka schema registry or a rejected evolution in Iceberg will break the pipeline.

Key Connections

depends_on Debezium — the dominant open-source CDC capture tool
depends_on Kafka Tiered Storage, Redpanda — transport layer for CDC events
scoped_to Lakehouse, S3 — target is S3-based lakehouse tables
enables Apache Hudi, Apache Iceberg — table formats that support upserts

Definition

What it is

An architecture pattern that captures row-level changes from source databases using change data capture tools (Debezium, Flink CDC) and applies them to Iceberg, Delta, or Hudi tables on S3, maintaining a near-real-time replica in the lakehouse.

Why it exists

Batch ETL from OLTP databases to S3 introduces hours of latency and requires full-table scans. CDC into Lakehouse provides continuous, incremental replication that keeps lakehouse tables current with source databases at a fraction of the compute cost.

Primary use cases

Real-time database replication to S3 lakehouses, streaming upserts to Iceberg tables, operational analytics on fresh data.

Recent developments

Latest signals

Iceberg V3 deletion vectors + row lineage reshape CDC pipelines. Iceberg 1.11.0 (May 2026) shipped V3 to production-stability. CDC writes against Iceberg V3 use deletion vectors for UPDATE/DELETE (bitmap mark, not file rewrite) + row lineage (_row_id + sequence number per row) for per-row provenance. Per DataLakehouseHub — What Iceberg V3 Advances Mean for CDC Pipelines (May 2026).
Flink + Debezium → Iceberg V3 is the canonical 2026 reference architecture. Flink job consumes Debezium CDC events from Kafka; Iceberg Flink connector applies them — UPDATE marks old row in DV bitmap + writes new image; DELETE marks row position in bitmap. Per Dremio — A Guide to CDC with Apache Iceberg and Streamkap — CDC to Apache Iceberg: Real-Time Lakehouse.
Paimon beats Iceberg for highest-frequency mutable streams. When updates are continuous + Flink-native + sub-second freshness matters, Paimon's LSM-tree architecture produces a cleaner operational profile than Iceberg V3. Paimon assumes the Debezium input stream already contains full changelog events. Per DataLakehouseHub — When Paimon Beats Iceberg for Mutable Streams.
Engine maturity 2026: Spark full V3, Flink full DVs + DV compaction, Trino DV read/write, Dremio adding row lineage. The CDC-relevant engine matrix in mid-2026: Spark is most complete; Flink is the streaming-CDC default with full DV + compaction; Trino reads/writes DVs but row lineage + DV-aware compaction not yet complete. Per DataLakehouseHub — Iceberg V3 CDC Pipelines.
Debezium 2026 Oracle CDC replication-lag work. Active 2026 Debezium engineering focus is on Oracle CDC replication lag — improving redo-log mining throughput so the Oracle → Iceberg CDC path stops being the bottleneck. Per Debezium Blog — Overcoming Oracle CDC Replication Lag (April 2026).
RisingWave + Streamkap + Cazpian: CDC-pipeline patterns now have multiple vendor playbooks. 2026 cohort of CDC-pipeline vendor write-ups consolidating around the same patterns — proves the architecture has crossed the "still figuring it out" threshold into reproducible production playbooks. Per RisingWave — CDC Stream Processing Complete Guide 2026 and Cazpian — Iceberg CDC Patterns, Best Practices, Real-World Pipelines.