Technology

Apache Flink

A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.

10 connections 3 resources 1 post

Summary

What it is

A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.

Where it fits

Flink is the streaming complement to Spark's batch processing. In the S3 world, Flink continuously ingests data into lakehouse tables (Iceberg, Delta) and uses S3 for fault-tolerant checkpointing.

Misconceptions / Traps
  • Flink streaming writes to S3 inherently produce small files (one file per checkpoint interval per writer). Compaction is mandatory — either via the table format or a separate job.
  • Flink's S3 filesystem plugin requires careful configuration. The wrong S3 filesystem implementation (s3:// vs s3a:// vs s3p://) causes silent failures.
Key Connections
  • used_by Medallion Architecture, Lakehouse Architecture — streaming data into lakehouse layers
  • constrained_by Small Files Problem — streaming writes produce many small files
  • scoped_to S3, Data Lake

Definition

What it is

A distributed stream processing framework that processes data in real-time, with S3 serving as a checkpoint store, state backend, and output sink.

Why it exists

Batch processing alone cannot satisfy requirements for fresh data. Flink enables continuous processing of streaming data, with S3 as the durable layer for checkpoints (fault tolerance) and as the final destination for processed output.

Primary use cases

Real-time data ingestion into S3-backed lakehouses, streaming ETL with S3 sink, checkpoint storage on S3 for fault tolerance.

Recent developments

Latest signals
  • Flink 1.20.4 ships 41 bug fixes; Flink 2.3 lining up behind it. Per the Apache Flink 1.20.4 release announcement (April 22, 2026), the 1.20 LTS line continues to receive maintenance with 41 bug fixes and minor improvements. The April 2026 community update flags Flink 2.3 as the next major release, with Materialized Tables work, Flink CDC 3.6.0, and a Flink Agents 0.2.1 line that brings agentic-LLM execution into the Flink runtime — the project is broadening from "stream processor" toward "stream-native compute substrate" for AI workloads.
  • Native S3 FileSystem benchmarks at ~2× Presto S3 for checkpoint throughput. Per the Apache Flink wiki benchmark (February 2026), Native S3 checkpoints sustain ~190 MB/s versus Presto S3 at ~89 MB/s — a 2× advantage that matters most for stateful jobs with high checkpoint frequency. Combined with ForSt (Flink 2.0's tiered state backend), recovery times drop below 10 seconds even for large stateful jobs per RisingWave's Flink comparison — the structural bet against having to keep all state in RAM is paying off.
  • Alibaba's Realtime Compute for Flink lands AIOps + fine-grained permissions (April 20, 2026). Per Alibaba Cloud's release notes, the managed Realtime Compute service now ships AIOps integration, end-to-end observability, and fine-grained permission controls — productizing Flink for the enterprise-governance shape that Databricks and Snowflake have set for the analytical side. The competitive frame: managed Flink is being repositioned from "compute primitive" to "governed platform tier."

Connections 10

Outbound 5
Inbound 5

Resources 3

Featured in