Technology

Spark Structured Streaming

Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.

7 connections 3 resources

Summary

What it is

Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.

Where it fits

Spark Structured Streaming is the streaming ingestion layer for Spark-centric lakehouses. It reads from Kafka, Kinesis, or file streams, applies transformations, and writes to Iceberg, Delta, or Hudi tables on S3 using exactly-once semantics via checkpoint state.

Misconceptions / Traps
  • Micro-batch processing is not true event-at-a-time streaming. Default trigger intervals (e.g., every 10 seconds) introduce latency. For sub-second latency, Flink is typically a better fit.
  • Checkpoint state is stored on S3 or HDFS. Corrupted or lost checkpoints require manual recovery and may cause data duplication or loss.
  • Each micro-batch produces a new set of files on S3. Without compaction, this is a primary source of the small files problem.
Key Connections
  • scoped_to S3, Lakehouse — streaming ingestion into S3-based tables
  • depends_on Apache Spark — the Spark runtime
  • enables Apache Iceberg, Delta Lake — writes streaming data to table formats
  • constrained_by Small Files Problem — micro-batches produce many small files

Definition

What it is

Apache Spark's streaming engine that processes continuous data streams using the same DataFrame/Dataset API as batch Spark, with support for writing streaming results directly to Iceberg, Delta, and Hudi tables on S3.

Why it exists

Batch-only pipelines introduce latency between data arrival and data availability. Structured Streaming enables micro-batch or continuous processing that lands results into S3-based lakehouse tables with exactly-once guarantees, bridging the gap between real-time and batch.

Primary use cases

Streaming ingestion into Iceberg/Delta tables on S3, real-time ETL, continuous aggregation pipelines writing to object storage.

Connections 7

Outbound 6
scoped_to2
constrained_by1
Inbound 1

Resources 3