Technology

Debezium

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.

7 connections 3 resources

Summary

What it is

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into event streams, enabling real-time ingestion into S3-based lakehouses.

Where it fits

Debezium sits at the ingestion boundary between operational databases and the S3 data lake. It captures INSERT, UPDATE, and DELETE events from database transaction logs and publishes them to Kafka, from which downstream connectors write to S3 in Parquet or Iceberg format.

Misconceptions / Traps
  • Debezium captures changes but does not write directly to S3. It requires a downstream sink (Kafka Connect S3 Sink, Flink, or a table format writer) to land data on object storage.
  • CDC from databases generates many small events. Without batching and compaction downstream, this creates the small files problem on S3.
  • Schema changes in the source database propagate through Debezium as schema change events. If the lakehouse layer does not handle schema evolution, pipeline breakage occurs.
Key Connections
  • scoped_to S3, Lakehouse — CDC ingestion into S3-based lakehouses
  • enables CDC into Lakehouse — the primary architecture pattern Debezium feeds
  • used_by Apache Flink, Apache Spark — stream processors that consume Debezium events
  • depends_on Kafka Tiered Storage, Redpanda — message brokers that transport CDC events

Definition

What it is

An open-source distributed platform for change data capture (CDC) that streams row-level changes from databases (PostgreSQL, MySQL, MongoDB, and others) into downstream systems such as Kafka, which can then land data into S3-based lakehouses.

Why it exists

Getting data from transactional databases into S3-based data lakes traditionally requires batch ETL with full table scans. Debezium captures changes as they happen, enabling near-real-time ingestion into lakehouse tables without impacting source database performance.

Primary use cases

Real-time database replication to S3 lakehouses, CDC-driven Iceberg/Delta/Hudi ingestion, event-sourced data pipelines.

Connections 7

Outbound 6
Inbound 1

Resources 3