Architecture

Batch vs Streaming

The architectural decision between processing S3 data in periodic batch jobs (hourly/daily) versus continuous streaming ingestion, and the tradeoffs each approach introduces for latency, cost, complexity, and file organization.

8 connections 3 resources

Summary

What it is

Where it fits

Batch vs streaming is the fundamental ingestion architecture choice for S3-based lakehouses. Batch produces larger, well-sized files but with higher latency. Streaming produces fresher data but generates many small files requiring compaction. Most production lakehouses use a hybrid approach.

Misconceptions / Traps

"Real-time" streaming into S3 is constrained by S3's eventual consistency for overwrite scenarios and by the minimum practical file size. Sub-second latency to S3 is achievable but creates extreme small file problems.
Batch is not inherently cheaper. Large batch jobs that scan terabytes on a scheduled cadence may cost more than a steady stream of small writes, depending on compute pricing.
The choice is not binary. Hybrid architectures (streaming for ingestion, batch for compaction and aggregation) are the norm in mature lakehouses.

Key Connections

scoped_to Lakehouse, S3 — ingestion architecture for S3-based data
constrains Small Files Problem — streaming creates small files; batch creates large files
relates_to Compaction — streaming requires compaction to maintain file sizes
relates_to Event-Driven Ingestion — streaming is one form of event-driven architecture

Definition

What it is

The architectural decision between processing S3-stored data in scheduled batch intervals versus continuous streaming, and the hybrid patterns (micro-batch, lambda, kappa) that combine both approaches.

Why it exists

Batch processing is simpler and cheaper but introduces latency. Streaming provides freshness but adds complexity and may produce small files on S3. The choice depends on business latency requirements, cost constraints, and operational maturity.

Primary use cases

Choosing ingestion cadence for lakehouse tables, designing hybrid pipelines with streaming ingestion and batch compaction, evaluating cost-latency tradeoffs.

Recent developments

Latest signals

"2026 is the year real-time CDC replaces batch ETL" — IOMETE 2026 framing. Streaming-first lakehouse architecture (Kafka + CDC + Iceberg) crosses the production-default threshold in 2026; batch ETL retreats to the long-tail of low-priority workloads. Per IOMETE — Streaming-First Lakehouse Architecture: Why 2026 Is the Year Real-Time CDC Replaces Batch ETL.
Iceberg + Kafka + Flink is the canonical unified batch-stream architecture. Single Iceberg table accessible to both batch (Spark/Trino nightly) and streaming (Flink real-time) jobs — eliminates data duplication + lets organizations stop maintaining parallel batch + stream pipelines. Per Kai Waehner — Data Streaming Meets Lakehouse: Apache Iceberg for Unified Real-Time + Batch Analytics.
Flink wins low-latency + checkpointing + failure-recovery + batch-stream unification. Sub-millisecond latency, millions of events/sec, event-driven + incremental-state-snapshots architecture — Flink is the recommended engine for unified workloads where latency matters. Per Onehouse — Apache Flink vs Kafka Streams vs Spark Structured Streaming.
Spark Structured Streaming + Delta Lake medallion is the developer-experience winner. Spark SQL remains the most-widely-adopted big-data SQL dialect; Delta Lake's medallion (bronze/silver/gold) pattern is the canonical streaming-medallion architecture for Spark-centric shops. Per Confluent — Spark Streaming vs Apache Flink.
Apache Fluss formalizes streaming + lakehouse unification. Fluss (incubating) is the 2026 effort to unify streaming storage + lakehouse storage at the substrate level — designed from scratch for the unified architecture rather than bolting streaming onto a lakehouse. Per Apache Fluss — Towards a Unified Streaming + Lakehouse Architecture.
Flink Materialized Tables: declarative unified stream + batch ETL. Alibaba's Flink Materialized Table feature lets practitioners declare materialized views once + Flink runs them as either batch or streaming based on freshness SLA — the architectural decision "batch or streaming?" becomes a property of the SLA, not the pipeline. Per Alibaba Cloud — Flink Materialized Table: Building Unified Stream + Batch ETL.