Architecture

Batch vs Streaming

The architectural decision between processing S3 data in periodic batch jobs (hourly/daily) versus continuous streaming ingestion, and the tradeoffs each approach introduces for latency, cost, complexity, and file organization.

8 connections 3 resources

Summary

What it is

The architectural decision between processing S3 data in periodic batch jobs (hourly/daily) versus continuous streaming ingestion, and the tradeoffs each approach introduces for latency, cost, complexity, and file organization.

Where it fits

Batch vs streaming is the fundamental ingestion architecture choice for S3-based lakehouses. Batch produces larger, well-sized files but with higher latency. Streaming produces fresher data but generates many small files requiring compaction. Most production lakehouses use a hybrid approach.

Misconceptions / Traps
  • "Real-time" streaming into S3 is constrained by S3's eventual consistency for overwrite scenarios and by the minimum practical file size. Sub-second latency to S3 is achievable but creates extreme small file problems.
  • Batch is not inherently cheaper. Large batch jobs that scan terabytes on a scheduled cadence may cost more than a steady stream of small writes, depending on compute pricing.
  • The choice is not binary. Hybrid architectures (streaming for ingestion, batch for compaction and aggregation) are the norm in mature lakehouses.
Key Connections
  • scoped_to Lakehouse, S3 — ingestion architecture for S3-based data
  • constrains Small Files Problem — streaming creates small files; batch creates large files
  • relates_to Compaction — streaming requires compaction to maintain file sizes
  • relates_to Event-Driven Ingestion — streaming is one form of event-driven architecture

Definition

What it is

The architectural decision between processing S3-stored data in scheduled batch intervals versus continuous streaming, and the hybrid patterns (micro-batch, lambda, kappa) that combine both approaches.

Why it exists

Batch processing is simpler and cheaper but introduces latency. Streaming provides freshness but adds complexity and may produce small files on S3. The choice depends on business latency requirements, cost constraints, and operational maturity.

Primary use cases

Choosing ingestion cadence for lakehouse tables, designing hybrid pipelines with streaming ingestion and batch compaction, evaluating cost-latency tradeoffs.

Connections 8

Outbound 6
Inbound 2

Resources 3