CDC into Lakehouse
The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them to tables in an S3-based lakehouse, maintaining a near-real-time replica of transactional data.
Summary
The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them to tables in an S3-based lakehouse, maintaining a near-real-time replica of transactional data.
CDC into Lakehouse is the bridge between OLTP and OLAP worlds. It enables analytics on operational data without impacting source databases, using tools like Debezium for capture, Kafka/Redpanda for transport, and Flink/Spark/Hudi for applying changes to Iceberg or Delta tables on S3.
- CDC replication is not instantaneous. End-to-end latency includes WAL read delay, Kafka transit, and sink write batching. "Near-real-time" typically means minutes, not milliseconds.
- Handling deletes in a lakehouse requires table formats that support row-level deletes (Iceberg position/equality deletes, Hudi's MOR). Append-only designs cannot faithfully replicate CDC streams.
- Schema changes in the source database must be handled by every component in the CDC pipeline. A missing column in the Kafka schema registry or a rejected evolution in Iceberg will break the pipeline.
depends_onDebezium — the dominant open-source CDC capture tooldepends_onKafka Tiered Storage, Redpanda — transport layer for CDC eventsscoped_toLakehouse, S3 — target is S3-based lakehouse tablesenablesApache Hudi, Apache Iceberg — table formats that support upserts
Definition
An architecture pattern that captures row-level changes from source databases using change data capture tools (Debezium, Flink CDC) and applies them to Iceberg, Delta, or Hudi tables on S3, maintaining a near-real-time replica in the lakehouse.
Batch ETL from OLTP databases to S3 introduces hours of latency and requires full-table scans. CDC into Lakehouse provides continuous, incremental replication that keeps lakehouse tables current with source databases at a fraction of the compute cost.
Real-time database replication to S3 lakehouses, streaming upserts to Iceberg tables, operational analytics on fresh data.
Connections 9
Outbound 7
solves1constrained_by1enables1Inbound 2
Resources 3
Debezium is the leading open-source CDC engine for capturing database changes and streaming them into lakehouse destinations on S3.
Hudi DeltaStreamer documentation for ingesting CDC change streams into Hudi tables on S3 with merge-on-read semantics.
Apache Flink's streaming ETL use case documentation covering real-time CDC ingestion into Iceberg and Hudi tables on object storage.