Technology

Flink CDC

Apache Flink connectors for reading database change logs (MySQL binlog, PostgreSQL WAL) and streaming them directly into lakehouse formats on S3 without an intermediate message broker.

8 connections 3 resources

Summary

What it is

Apache Flink connectors for reading database change logs (MySQL binlog, PostgreSQL WAL) and streaming them directly into lakehouse formats on S3 without an intermediate message broker.

Where it fits

Flink CDC removes Kafka from the CDC pipeline. Instead of Database → Debezium → Kafka → Flink → S3, the architecture becomes Database → Flink CDC → S3. This reduces latency, operational complexity, and infrastructure costs for database-to-lakehouse replication.

Misconceptions / Traps
  • Eliminating Kafka also eliminates its replay buffer. If the Flink job fails, replay must come from the database logs, which may have limited retention.
  • Memory usage can be significant under high-throughput workloads. Capacity planning for Flink CDC is critical.
Key Connections
  • depends_on Apache Flink — runs as Flink connectors
  • enables Apache Paimon, Apache Iceberg, Apache Hudi — writes CDC data directly to lakehouse formats
  • scoped_to Table Formats — ingestion framework for S3-based table formats

Definition

What it is

A set of Apache Flink connectors that read database change logs (MySQL binlog, PostgreSQL WAL, MongoDB oplog) and stream them directly into lakehouse table formats on S3, without requiring an intermediate message broker.

Why it exists

Traditional CDC pipelines require Kafka or a similar message queue between the source database and the lake. Flink CDC eliminates this intermediate layer by reading change logs directly and writing to Iceberg, Paimon, or Hudi on S3, reducing operational complexity and latency.

Primary use cases

Database-to-lakehouse replication without Kafka, real-time data mirroring from operational databases to S3, streaming CDC ingestion into Iceberg or Paimon tables.

Connections 8

Outbound 6
Inbound 2
used_by1

Resources 3