Guide 26

CDC Failure Modes — What Breaks When Streaming Database Logs to S3

Problem Framing

Change Data Capture into a lakehouse is marketed as a turnkey path to "real-time analytics," but production CDC pipelines routinely fail from schema drift in the source database, out-of-order events in Kafka, unhandled hard deletes, and compaction debt that outpaces ingestion speed. These failure modes are predictable and structurally inherent to the architecture. Engineers need a failure-mode catalogue, idempotent pipeline design patterns, and monitoring strategies that catch problems before they corrupt downstream tables.

Relevant Nodes

  • Topics: S3, Table Formats
  • Technologies: Debezium, Flink CDC, Apache Paimon, Apache Hudi
  • Architectures: CDC into Lakehouse, Compaction
  • Pain Points: Schema Evolution, Small Files Problem, Read / Write Amplification, Legacy Ingestion Bottlenecks

Decision Path

  1. Choose your CDC source connector. Debezium reads database WAL (write-ahead log) and publishes change events to Kafka. Flink CDC reads WAL directly into Flink without Kafka as an intermediary. The choice affects your failure surface:

    • Debezium + Kafka: more mature, but Kafka adds a replication lag layer and topic management overhead. Schema Registry is required for schema evolution.
    • Flink CDC: lower latency, fewer moving parts, but less community tooling and harder to debug.
  2. Configure merge-on-read vs. copy-on-write for target tables. Merge-on-read (MoR) writes change logs as delta files and merges at query time — fast writes, slower reads. Copy-on-write (CoW) rewrites affected data files on each commit — slow writes, fast reads.

    • MoR is preferred for high-velocity CDC where write throughput matters more than query latency.
    • CoW is preferred when downstream consumers are dashboards or BI tools that cannot tolerate merge overhead.
    • Apache Hudi and Paimon support both modes. Iceberg supports MoR via positional deletes (v2) or deletion vectors (v3).
  3. Handle schema drift. Source databases change schemas (add columns, widen types, rename fields) without coordinating with downstream consumers. Configure your pipeline to handle drift:

    • Enable schema evolution on the target table format (Iceberg and Delta support additive schema changes automatically).
    • Reject breaking changes (column drops, type narrowing) and route them to a dead-letter queue for manual review.
    • Integrate Schema Registry (Confluent or AWS Glue) to version schemas and enforce compatibility.
  4. Design idempotent consumers. CDC events may be delivered more than once (at-least-once semantics in Kafka) or arrive out of order. Idempotent consumers must:

    • Deduplicate by primary key + event timestamp.
    • Use upsert semantics (not append) to handle replayed events.
    • Maintain a checkpoint or watermark to resume from the last committed offset after failure.
  5. Monitor compaction debt. Every CDC commit produces small files (one per checkpoint interval per partition). Without aggressive compaction, file counts grow linearly with time, degrading query performance and inflating S3 GET costs.

    • Set compaction frequency based on write velocity — high-velocity tables may need compaction every 15 minutes.
    • Alert on file count per partition exceeding a threshold (e.g., 1,000 files).
  6. Plan backfill strategy for historical re-ingestion. When a CDC pipeline fails and misses events, or when a new table is onboarded, you need to backfill historical data from the source database. This is a full-table scan, not a WAL read, and produces different file sizes and partition distributions than streaming CDC.

    • Run backfills as separate batch jobs with appropriate parallelism.
    • Compact backfill output before enabling streaming CDC to avoid mixed file sizes.
  7. Test failure recovery end-to-end. Simulate source database failover, Kafka partition rebalance, Flink checkpoint failure, and S3 throttling. Verify that the pipeline recovers without data loss or duplication. Document the recovery runbook.

What Changed Over Time

  • Early CDC (pre-2020) used batch-based approaches: periodic full dumps or query-based change detection. Latency was minutes to hours.
  • Debezium (2017) popularized WAL-based CDC with Kafka Connect, enabling near-real-time change streaming.
  • Table formats added native CDC support: Hudi's record-level upserts (original design), Iceberg's row-level deletes (v2), Paimon's changelog-native design.
  • Flink CDC (2021) eliminated the Kafka intermediary for Flink-based pipelines, reducing operational complexity.
  • The dominant failure mode has shifted from "data not arriving" to "data arriving but corrupting the target" — schema drift, out-of-order events, and compaction debt are now the primary operational challenges.

Sources