Architecture

CDC into Lakehouse

The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them to tables in an S3-based lakehouse, maintaining a near-real-time replica of transactional data.

9 connections 3 resources

Summary

What it is

The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them to tables in an S3-based lakehouse, maintaining a near-real-time replica of transactional data.

Where it fits

CDC into Lakehouse is the bridge between OLTP and OLAP worlds. It enables analytics on operational data without impacting source databases, using tools like Debezium for capture, Kafka/Redpanda for transport, and Flink/Spark/Hudi for applying changes to Iceberg or Delta tables on S3.

Misconceptions / Traps
  • CDC replication is not instantaneous. End-to-end latency includes WAL read delay, Kafka transit, and sink write batching. "Near-real-time" typically means minutes, not milliseconds.
  • Handling deletes in a lakehouse requires table formats that support row-level deletes (Iceberg position/equality deletes, Hudi's MOR). Append-only designs cannot faithfully replicate CDC streams.
  • Schema changes in the source database must be handled by every component in the CDC pipeline. A missing column in the Kafka schema registry or a rejected evolution in Iceberg will break the pipeline.
Key Connections
  • depends_on Debezium — the dominant open-source CDC capture tool
  • depends_on Kafka Tiered Storage, Redpanda — transport layer for CDC events
  • scoped_to Lakehouse, S3 — target is S3-based lakehouse tables
  • enables Apache Hudi, Apache Iceberg — table formats that support upserts

Definition

What it is

An architecture pattern that captures row-level changes from source databases using change data capture tools (Debezium, Flink CDC) and applies them to Iceberg, Delta, or Hudi tables on S3, maintaining a near-real-time replica in the lakehouse.

Why it exists

Batch ETL from OLTP databases to S3 introduces hours of latency and requires full-table scans. CDC into Lakehouse provides continuous, incremental replication that keeps lakehouse tables current with source databases at a fraction of the compute cost.

Primary use cases

Real-time database replication to S3 lakehouses, streaming upserts to Iceberg tables, operational analytics on fresh data.

Connections 9

Outbound 7
Inbound 2

Resources 3