Spark Structured Streaming
Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.
Summary
Apache Spark's stream processing API that enables continuous, micro-batch, or near-real-time ingestion of data streams into S3-backed tables using the same DataFrame/SQL abstractions as batch Spark.
Spark Structured Streaming is the streaming ingestion layer for Spark-centric lakehouses. It reads from Kafka, Kinesis, or file streams, applies transformations, and writes to Iceberg, Delta, or Hudi tables on S3 using exactly-once semantics via checkpoint state.
- Micro-batch processing is not true event-at-a-time streaming. Default trigger intervals (e.g., every 10 seconds) introduce latency. For sub-second latency, Flink is typically a better fit.
- Checkpoint state is stored on S3 or HDFS. Corrupted or lost checkpoints require manual recovery and may cause data duplication or loss.
- Each micro-batch produces a new set of files on S3. Without compaction, this is a primary source of the small files problem.
scoped_toS3, Lakehouse — streaming ingestion into S3-based tablesdepends_onApache Spark — the Spark runtimeenablesApache Iceberg, Delta Lake — writes streaming data to table formatsconstrained_bySmall Files Problem — micro-batches produce many small files
Definition
Apache Spark's streaming engine that processes continuous data streams using the same DataFrame/Dataset API as batch Spark, with support for writing streaming results directly to Iceberg, Delta, and Hudi tables on S3.
Batch-only pipelines introduce latency between data arrival and data availability. Structured Streaming enables micro-batch or continuous processing that lands results into S3-based lakehouse tables with exactly-once guarantees, bridging the gap between real-time and batch.
Streaming ingestion into Iceberg/Delta tables on S3, real-time ETL, continuous aggregation pipelines writing to object storage.
Recent developments
- Real-Time Mode reaches GA on Databricks — sub-second end-to-end latency. Per the Databricks announcement (March 19, 2026), Spark Structured Streaming's Real-Time Mode (RTM) is now generally available, delivering millisecond-tier processing latency rather than the historical micro-batch trigger floor. Public-Preview origins go back to August 2025.
- End-to-end benchmark: Spark RTM 92% faster than Flink on feature computation. Per the databricks-solutions/latency-benchmarks repo, the same workload posts Spark RTM p99 ≈14ms vs Flink p99 ≈45ms on Query B, with both engines effectively tied (~1-3ms) on stateless Query A. The headline reframing: the historical "Spark for batch, Flink for streaming" split now has a credible Spark answer at the millisecond tier, removing the operational pressure to run two streaming engines.
Connections 7
Outbound 6
depends_on2enables1constrained_by1Inbound 1
depends_on1Resources 3
Official Apache Spark guide for Structured Streaming, the micro-batch and continuous processing engine for streaming data into S3-based lakehouses.
Iceberg's Spark Structured Streaming integration guide covering streaming writes with exactly-once semantics to Iceberg tables on S3.
Apache Spark source repository with the Structured Streaming module, S3A filesystem integration, and table format connectors.