Debezium
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.
Summary
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.
Debezium sits at the ingestion boundary between operational databases and the S3 data lake. It captures INSERT, UPDATE, and DELETE events from database transaction logs and publishes them to Kafka, from which downstream connectors write to S3 in Parquet or Iceberg format.
- Debezium captures changes but does not write directly to S3. It requires a downstream sink (Kafka Connect S3 Sink, Flink, or a table format writer) to land data on object storage.
- CDC from databases generates many small events. Without batching and compaction downstream, this creates the small files problem on S3.
- Schema changes in the source database propagate through Debezium as schema change events. If the lakehouse layer does not handle schema evolution, pipeline breakage occurs.
scoped_toS3, Lakehouse — CDC ingestion into S3-based lakehousesenablesCDC into Lakehouse — the primary architecture pattern Debezium feedsused_byApache Flink, Apache Spark — stream processors that consume Debezium eventsdepends_onKafka Tiered Storage, Redpanda — message brokers that transport CDC events
Definition
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into downstream systems such as Kafka, which can then land data into S3-based lakehouses.
Getting data from transactional databases into S3-based data lakes traditionally requires batch ETL with full table scans. Debezium captures changes as they happen, enabling near-real-time ingestion into lakehouse tables without impacting source database performance.
Real-time database replication to S3 lakehouses, CDC-driven Iceberg/Delta/Hudi ingestion, event-sourced data pipelines.
Connections 7
Outbound 6
enables1solves1used_by2Inbound 1
depends_on1Resources 3
Official Debezium documentation for the leading open-source CDC platform that captures database changes for streaming into S3-based lakehouses.
Debezium source repository with connectors for MySQL, PostgreSQL, MongoDB, and other databases feeding CDC pipelines to object storage.
Debezium blog on the Iceberg sink connector enabling direct CDC-to-Iceberg ingestion without intermediate staging.