Technology

Debezium

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.

7 connections 3 resources 1 post

Summary

What it is

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.

Where it fits

Debezium sits at the ingestion boundary between operational databases and the S3 data lake. It captures INSERT, UPDATE, and DELETE events from database transaction logs and publishes them to Kafka, from which downstream connectors write to S3 in Parquet or Iceberg format.

Misconceptions / Traps
  • Debezium captures changes but does not write directly to S3. It requires a downstream sink (Kafka Connect S3 Sink, Flink, or a table format writer) to land data on object storage.
  • CDC from databases generates many small events. Without batching and compaction downstream, this creates the small files problem on S3.
  • Schema changes in the source database propagate through Debezium as schema change events. If the lakehouse layer does not handle schema evolution, pipeline breakage occurs.
Key Connections
  • scoped_to S3, Lakehouse — CDC ingestion into S3-based lakehouses
  • enables CDC into Lakehouse — the primary architecture pattern Debezium feeds
  • used_by Apache Flink, Apache Spark — stream processors that consume Debezium events
  • depends_on Kafka Tiered Storage, Redpanda — message brokers that transport CDC events

Definition

What it is

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into downstream systems such as Kafka, which can then land data into S3-based lakehouses.

Why it exists

Getting data from transactional databases into S3-based data lakes traditionally requires batch ETL with full table scans. Debezium captures changes as they happen, enabling near-real-time ingestion into lakehouse tables without impacting source database performance.

Primary use cases

Real-time database replication to S3 lakehouses, CDC-driven Iceberg/Delta/Hudi ingestion, event-sourced data pipelines.

Recent developments

Latest signals
  • Latest release: 3.5.2.Final (GA June 2, 2026). Tracking the upstream stable release line. Per debezium/debezium releases.
  • Position vs Flink CDC stabilizes: Debezium is the Kafka-Connect-anchored choice; Flink CDC is the no-Kafka path. Per RisingWave's Debezium alternatives survey (April 2026) and Conduktor's Debezium CDC implementation guide, Debezium remains the most widely deployed CDC tool in 2026 — driven by mature Kafka Connect ecosystem integration, well-understood operational shape, and broad source-database coverage. The 2026 framing positions Debezium as the answer when Kafka is already in the stack; Flink CDC is the answer when teams want to skip the Kafka hop entirely and write directly to Iceberg/Paimon/Hudi. Both paths are valid; the choice now comes down to whether you want Kafka as the durable streaming substrate or are willing to delegate that durability to the lakehouse table format.

Connections 7

Outbound 6
Inbound 1

Resources 3

Featured in