Apache Spark
Summary
What it is
A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.
Where it fits
Spark is the workhorse of the S3 data ecosystem. It is the primary engine for building and maintaining lakehouse tables (Iceberg, Delta, Hudi), running ETL pipelines, and processing data at petabyte scale.
Misconceptions / Traps
- Spark's S3 access goes through the Hadoop S3A connector, not a native S3 client. S3A configuration (committers, credential providers, connection pooling) is a common source of operational issues.
- Spark produces small files by default when writing with high parallelism. Use coalesce, repartition, or table format compaction to control output file sizes.
Key Connections
used_byLakehouse Architecture, Medallion Architecture — the primary compute engineconstrained_bySmall Files Problem — high parallelism produces many small output filesscoped_toS3, Data Lake
Definition
What it is
A distributed compute engine for large-scale data processing, supporting batch ETL, streaming, SQL, and machine learning workloads over S3-stored data.
Why it exists
Single-machine processing cannot handle petabyte-scale data. Spark distributes computation across clusters while reading from and writing to S3, making it the workhorse of most data lake and lakehouse architectures.
Primary use cases
Batch ETL pipelines on S3 data, lakehouse data transformations, large-scale ML feature engineering, streaming data into S3 via Structured Streaming.
Relationships
Outbound Relationships
constrained_byInbound Relationships
Resources
Official Apache Spark documentation covering the unified analytics engine for large-scale data processing.
Primary Spark repository with the full source for Spark SQL, Structured Streaming, MLlib, and all data source connectors.
Spark's cloud integration guide covers S3A connector configuration, credential providers, and performance tuning for S3-based workloads.
Spark uses Hadoop's S3A connector under the hood; this is the authoritative reference for S3 access configuration, committers, and troubleshooting.