Apache Flink
Summary
What it is
A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.
Where it fits
Flink is the streaming complement to Spark's batch processing. In the S3 world, Flink continuously ingests data into lakehouse tables (Iceberg, Delta) and uses S3 for fault-tolerant checkpointing.
Misconceptions / Traps
- Flink streaming writes to S3 inherently produce small files (one file per checkpoint interval per writer). Compaction is mandatory — either via the table format or a separate job.
- Flink's S3 filesystem plugin requires careful configuration. The wrong S3 filesystem implementation (s3:// vs s3a:// vs s3p://) causes silent failures.
Key Connections
used_byMedallion Architecture, Lakehouse Architecture — streaming data into lakehouse layersconstrained_bySmall Files Problem — streaming writes produce many small filesscoped_toS3, Data Lake
Definition
What it is
A distributed stream processing framework that processes data in real-time, with S3 serving as a checkpoint store, state backend, and output sink.
Why it exists
Batch processing alone cannot satisfy requirements for fresh data. Flink enables continuous processing of streaming data, with S3 as the durable layer for checkpoints (fault tolerance) and as the final destination for processed output.
Primary use cases
Real-time data ingestion into S3-backed lakehouses, streaming ETL with S3 sink, checkpoint storage on S3 for fault tolerance.
Relationships
Outbound Relationships
constrained_byResources
Official Apache Flink documentation covering the distributed stream and batch processing framework.
Primary Flink repository with the Java/Scala source for the DataStream API, Table API, and all connectors.
Flink's dedicated S3 filesystem documentation covers S3 configuration for checkpoints, savepoints, and high-availability storage.