Spark Structured Streaming
Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.
Summary
Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.
Spark Structured Streaming is the streaming ingestion layer for Spark-centric lakehouses. It reads from Kafka, Kinesis, or file streams, applies transformations, and writes to Iceberg, Delta, or Hudi tables on S3 using exactly-once semantics via checkpoint state.
- Micro-batch processing is not true event-at-a-time streaming. Default trigger intervals (e.g., every 10 seconds) introduce latency. For sub-second latency, Flink is typically a better fit.
- Checkpoint state is stored on S3 or HDFS. Corrupted or lost checkpoints require manual recovery and may cause data duplication or loss.
- Each micro-batch produces a new set of files on S3. Without compaction, this is a primary source of the small files problem.
scoped_toS3, Lakehouse — streaming ingestion into S3-based tablesdepends_onApache Spark — the Spark runtimeenablesApache Iceberg, Delta Lake — writes streaming data to table formatsconstrained_bySmall Files Problem — micro-batches produce many small files
Definition
Apache Spark's streaming engine that processes continuous data streams using the same DataFrame/Dataset API as batch Spark, with support for writing streaming results directly to Iceberg, Delta, and Hudi tables on S3.
Batch-only pipelines introduce latency between data arrival and data availability. Structured Streaming enables micro-batch or continuous processing that lands results into S3-based lakehouse tables with exactly-once guarantees, bridging the gap between real-time and batch.
Streaming ingestion into Iceberg/Delta tables on S3, real-time ETL, continuous aggregation pipelines writing to object storage.
Connections 7
Outbound 6
depends_on2enables1constrained_by1Inbound 1
depends_on1Resources 3
Official Apache Spark guide for Structured Streaming, the micro-batch and continuous processing engine for streaming data into S3-based lakehouses.
Iceberg's Spark Structured Streaming integration guide covering streaming writes with exactly-once semantics to Iceberg tables on S3.
Apache Spark source repository with the Structured Streaming module, S3A filesystem integration, and table format connectors.