Debezium
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.
Summary
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.
Debezium sits at the ingestion boundary between operational databases and the S3 data lake. It captures INSERT, UPDATE, and DELETE events from database transaction logs and publishes them to Kafka, from which downstream connectors write to S3 in Parquet or Iceberg format.
- Debezium captures changes but does not write directly to S3. It requires a downstream sink (Kafka Connect S3 Sink, Flink, or a table format writer) to land data on object storage.
- CDC from databases generates many small events. Without batching and compaction downstream, this creates the small files problem on S3.
- Schema changes in the source database propagate through Debezium as schema change events. If the lakehouse layer does not handle schema evolution, pipeline breakage occurs.
scoped_toS3, Lakehouse — CDC ingestion into S3-based lakehousesenablesCDC into Lakehouse — the primary architecture pattern Debezium feedsused_byApache Flink, Apache Spark — stream processors that consume Debezium eventsdepends_onKafka Tiered Storage, Redpanda — message brokers that transport CDC events
Definition
An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into downstream systems such as Kafka, which can then land data into S3-based lakehouses.
Getting data from transactional databases into S3-based data lakes traditionally requires batch ETL with full table scans. Debezium captures changes as they happen, enabling near-real-time ingestion into lakehouse tables without impacting source database performance.
Real-time database replication to S3 lakehouses, CDC-driven Iceberg/Delta/Hudi ingestion, event-sourced data pipelines.
Recent developments
- Latest release: 3.5.2.Final (GA June 2, 2026). Tracking the upstream stable release line. Per debezium/debezium releases.
- Position vs Flink CDC stabilizes: Debezium is the Kafka-Connect-anchored choice; Flink CDC is the no-Kafka path. Per RisingWave's Debezium alternatives survey (April 2026) and Conduktor's Debezium CDC implementation guide, Debezium remains the most widely deployed CDC tool in 2026 — driven by mature Kafka Connect ecosystem integration, well-understood operational shape, and broad source-database coverage. The 2026 framing positions Debezium as the answer when Kafka is already in the stack; Flink CDC is the answer when teams want to skip the Kafka hop entirely and write directly to Iceberg/Paimon/Hudi. Both paths are valid; the choice now comes down to whether you want Kafka as the durable streaming substrate or are willing to delegate that durability to the lakehouse table format.
Connections 7
Outbound 6
enables1solves1used_by2Inbound 1
depends_on1Resources 3
Official Debezium documentation for the leading open-source CDC platform that captures database changes for streaming into S3-based lakehouses.
Debezium source repository with connectors for MySQL, PostgreSQL, MongoDB, and other databases feeding CDC pipelines to object storage.
Debezium blog on the Iceberg sink connector enabling direct CDC-to-Iceberg ingestion without intermediate staging.