Technology

Spark Structured Streaming

Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.

7 connections 3 resources

Summary

What it is

Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.

Where it fits

Spark Structured Streaming is the streaming ingestion layer for Spark-centric lakehouses. It reads from Kafka, Kinesis, or file streams, applies transformations, and writes to Iceberg, Delta, or Hudi tables on S3 using exactly-once semantics via checkpoint state.

Misconceptions / Traps

Micro-batch processing is not true event-at-a-time streaming. Default trigger intervals (e.g., every 10 seconds) introduce latency. For sub-second latency, Flink is typically a better fit.
Checkpoint state is stored on S3 or HDFS. Corrupted or lost checkpoints require manual recovery and may cause data duplication or loss.
Each micro-batch produces a new set of files on S3. Without compaction, this is a primary source of the small files problem.

Key Connections

scoped_to S3, Lakehouse — streaming ingestion into S3-based tables
depends_on Apache Spark — the Spark runtime
enables Apache Iceberg, Delta Lake — writes streaming data to table formats
constrained_by Small Files Problem — micro-batches produce many small files

Definition

What it is

Apache Spark's streaming engine that processes continuous data streams using the same DataFrame/Dataset API as batch Spark, with support for writing streaming results directly to Iceberg, Delta, and Hudi tables on S3.

Why it exists

Batch-only pipelines introduce latency between data arrival and data availability. Structured Streaming enables micro-batch or continuous processing that lands results into S3-based lakehouse tables with exactly-once guarantees, bridging the gap between real-time and batch.

Primary use cases

Streaming ingestion into Iceberg/Delta tables on S3, real-time ETL, continuous aggregation pipelines writing to object storage.

Recent developments

Latest signals

Real-Time Mode reaches GA on Databricks — sub-second end-to-end latency. Per the Databricks announcement (March 19, 2026), Spark Structured Streaming's Real-Time Mode (RTM) is now generally available, delivering millisecond-tier processing latency rather than the historical micro-batch trigger floor. Public-Preview origins go back to August 2025.
End-to-end benchmark: Spark RTM 92% faster than Flink on feature computation. Per the databricks-solutions/latency-benchmarks repo, the same workload posts Spark RTM p99 ≈14ms vs Flink p99 ≈45ms on Query B, with both engines effectively tied (~1-3ms) on stateless Query A. The headline reframing: the historical "Spark for batch, Flink for streaming" split now has a credible Spark answer at the millisecond tier, removing the operational pressure to run two streaming engines.