Technology

Flink CDC

Apache Flink connectors for reading database change logs (MySQL binlog, PostgreSQL WAL) and streaming them directly into lakehouse formats on S3 without an intermediate message broker.

8 connections 3 resources

Summary

What it is

Apache Flink connectors for reading database change logs (MySQL binlog, PostgreSQL WAL) and streaming them directly into lakehouse formats on S3 without an intermediate message broker.

Where it fits

Flink CDC removes Kafka from the CDC pipeline. Instead of Database → Debezium → Kafka → Flink → S3, the architecture becomes Database → Flink CDC → S3. This reduces latency, operational complexity, and infrastructure costs for database-to-lakehouse replication.

Misconceptions / Traps

Eliminating Kafka also eliminates its replay buffer. If the Flink job fails, replay must come from the database logs, which may have limited retention.
Memory usage can be significant under high-throughput workloads. Capacity planning for Flink CDC is critical.

Key Connections

depends_on Apache Flink — runs as Flink connectors
enables Apache Paimon, Apache Iceberg, Apache Hudi — writes CDC data directly to lakehouse formats
scoped_to Table Formats — ingestion framework for S3-based table formats

Definition

What it is

A set of Apache Flink connectors that read database change logs (MySQL binlog, PostgreSQL WAL, MongoDB oplog) and stream them directly into lakehouse table formats on S3, without requiring an intermediate message broker.

Why it exists

Traditional CDC pipelines require Kafka or a similar message queue between the source database and the lake. Flink CDC eliminates this intermediate layer by reading change logs directly and writing to Iceberg, Paimon, or Hudi on S3, reducing operational complexity and latency.

Primary use cases

Database-to-lakehouse replication without Kafka, real-time data mirroring from operational databases to S3, streaming CDC ingestion into Iceberg or Paimon tables.

Recent developments

Latest signals

Sub-second end-to-end latency, no-Kafka pipelines positioned as the primary CDC option. Per the Flink CDC official docs, the current release ships incremental snapshot algorithm (no source-database lock), schema evolution with automatic downstream table creation and DDL application, full streaming pipeline with sub-second end-to-end latency, and SQL-shaped transformations (projection, filtering, computed columns). The "skip the Kafka hop" framing is now the canonical reason to pick Flink CDC over Debezium + Kafka Connect for Iceberg/Paimon/Hudi sinks.
Operational tradeoff documented honestly in CDC tooling surveys. Per RisingWave's CDC tools comparison (April 2026), Flink CDC's strengths (sub-second latency, no Kafka required, full Java/SQL transformations) come with real costs: JVM expertise required, checkpoint configuration, RocksDB state backend tuning, JobManager / TaskManager cluster ops, and a SQL surface area narrower than RisingWave's. Decision framing: pick Flink CDC when transformation capability and operational depth justify the JVM overhead; pick a SQL-shaped streaming database when the workload fits the SQL surface area cleanly.