Apache Spark
A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.
Summary
A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.
Spark is the workhorse of the S3 data ecosystem. It is the primary engine for building and maintaining lakehouse tables (Iceberg, Delta, Hudi), running ETL pipelines, and processing data at petabyte scale.
- Spark's S3 access goes through the Hadoop S3A connector, not a native S3 client. S3A configuration (committers, credential providers, connection pooling) is a common source of operational issues.
- Spark produces small files by default when writing with high parallelism. Use coalesce, repartition, or table format compaction to control output file sizes.
used_byLakehouse Architecture, Medallion Architecture — the primary compute engineconstrained_bySmall Files Problem — high parallelism produces many small output filesscoped_toS3, Data Lake
Definition
A distributed compute engine for large-scale data processing, supporting batch ETL, streaming, SQL, and machine learning workloads over S3-stored data.
Single-machine processing cannot handle petabyte-scale data. Spark distributes computation across clusters while reading from and writing to S3, making it the workhorse of most data lake and lakehouse architectures.
Batch ETL pipelines on S3 data, lakehouse data transformations, large-scale ML feature engineering, streaming data into S3 via Structured Streaming.
Recent developments
- Spark 4.0.0 — flagship release shipped January 2026. The 4.0 line is the largest Spark release in years. Headline additions: Spark Connect (a client-server protocol that decouples driver code from cluster JVMs — clients can now run Spark workloads from Python/Go/Swift/Rust without packaging the full Spark JAR), a redesigned 1.5 MB Python client (down from the historical 200+ MB), SQL scripting with stored procedure-style control flow, the PIPE syntax for chainable SQL operations (think Unix pipes for SQL), and the VARIANT data type for semi-structured / JSON-shaped fields. The new Go / Swift / Rust API clients are the first non-JVM-language official Spark clients and signal a serious commitment to Spark Connect as the long-term integration surface.
- Spark 3.5.8 maintenance release (January 15, 2026). Bug fixes + security patches on the 3.5 LTS line. Operators not ready to upgrade to 4.0 should pin to 3.5.8 — it's the current stable point for the LTS branch.
- Spark 4.2 RTM (Real-Time Mode) — sub-100ms streaming latency. A major streaming-side investment: 4.2's RTM mode targets O(100ms) end-to-end latency for Kafka-sourced streams, closing most of the gap to dedicated streaming engines like Flink. Practical implication: shops that previously ran two engines (Spark for batch, Flink for streaming) can consolidate on Spark for many use cases.
- Spark 4.2 vs Flink 1.16 — Flink still wins on Delta Lake streaming. A head-to-head benchmark measured 0.81s end-to-end latency on Flink vs 7.15s on Spark for Delta Lake CDC-shape streaming workloads. Spark's RTM closed most of the general-purpose streaming gap, but for Delta-Lake-targeted streaming under high commit frequency, Flink remains meaningfully faster. Stay-on-Flink decision: when CDC into Delta Lake is the dominant write pattern. Move-to-Spark decision: when batch + ML + streaming on a single engine is more valuable than pure streaming latency.
- Snowpark Connect for Spark v1.24.0 (April 2026). Snowflake's actively-maintained Spark adapter — released April 2026, lets Snowpark code run on Spark clusters via Spark Connect. Strategic implication: Snowflake is hedging on Spark as a compute substrate the same way it hedged on Iceberg as a table format. The rivalry-with-Databricks framing is incomplete — both are layering on the other's primitives faster than either is walking away.
- Industry adoption — Spark is the engine layer of choice. Per Gartner Peer Insights (4.5/5 from 47 reviews), enterprise praise focuses on deployment ease, SQL compatibility, in-memory processing, and real-time capabilities; the most-cited operational complaints are high memory consumption, Python performance issues (Spark Connect helps on this dimension), and slowdown with small files (the recurring lakehouse pain point). Per the 2025-2026 Iceberg ecosystem survey, Spark commands 96.4% engine adoption for Iceberg workloads — with Trino at 60.7% and DuckDB / Flink rising. Spark is structurally the lakehouse default; the version question is 4.0 vs 3.5.x LTS, not "should we use Spark."
Connections 13
Outbound 5
constrained_by1Inbound 8
depends_on2augments1Resources 4
Official Apache Spark documentation covering the unified analytics engine for large-scale data processing.
Primary Spark repository with the full source for Spark SQL, Structured Streaming, MLlib, and all data source connectors.
Spark's cloud integration guide covers S3A connector configuration, credential providers, and performance tuning for S3-based workloads.
Spark uses Hadoop's S3A connector under the hood; this is the authoritative reference for S3 access configuration, committers, and troubleshooting.