Guide 1

How S3 Shapes Lakehouse Design

Problem Framing

Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constraints fundamentally shape how lakehouses are designed. Engineers building lakehouses need to understand which S3 behaviors are features, which are limitations, and how table formats work around both.

Relevant Nodes

  • Topics: S3, Object Storage, Lakehouse, Table Formats
  • Technologies: AWS S3, MinIO, Ceph, Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, ClickHouse, StarRocks
  • Standards: S3 API, Apache Parquet, Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec
  • Architectures: Lakehouse Architecture, Separation of Storage and Compute, Medallion Architecture
  • Pain Points: Lack of Atomic Rename, Cold Scan Latency, Small Files Problem, Metadata Overhead at Scale, Object Listing Performance

Decision Path

  1. Choose your S3 layer. AWS S3 for managed convenience, MinIO for self-hosted control, Ceph for unified storage needs. This choice determines consistency model, available features, and egress economics.

  2. Choose a table format. This is the most consequential decision:

    • Iceberg if you need multi-engine access (Spark + Trino + Flink reading the same tables), hidden partitioning, and broad community adoption.
    • Delta Lake if you are in the Databricks ecosystem and want tight Spark integration with streaming+batch unification.
    • Hudi if your primary workload is CDC ingestion with record-level upserts.
    • All three use Parquet as the data file format. The difference is in metadata structure, commit protocol, and partition management.
  3. Understand the S3 constraints you are inheriting:

    • No atomic rename → table commits require workarounds (DynamoDB for Delta, metadata pointers for Iceberg). Plan for this complexity.
    • LIST is slow → table formats reduce listing dependency through manifests, but metadata itself grows and must be maintained.
    • Cold scan latency → first queries are slow. Metadata-driven pruning (partition pruning, column statistics) is essential, not optional.
    • Small files → streaming writes and high-parallelism batch jobs produce small files by default. Compaction is mandatory.
  4. Choose your query engines. Separation of storage and compute means multiple engines can read the same S3 data:

    • Spark for batch ETL and large-scale transformations
    • Trino for interactive federated queries
    • DuckDB for single-machine ad-hoc exploration
    • StarRocks/ClickHouse for low-latency dashboards
  5. Plan metadata operations. Snapshot expiration, orphan file cleanup, manifest merging, and compaction are operational requirements, not optional maintenance tasks. At scale, these consume significant compute.

What Changed Over Time

  • Early data lakes on S3 had no table semantics — raw Parquet files with Hive-style partitioning and no transactions.
  • Table formats (Hudi 2016, Delta 2019, Iceberg 2018 graduated to Apache TLP 2020) added ACID, schema evolution, and time-travel.
  • AWS S3 moved from eventual to strong consistency (December 2020), eliminating a class of bugs but not the atomic rename gap.
  • Iceberg has converged toward becoming the de-facto standard, with Databricks adding Iceberg support alongside Delta.
  • Metadata management (catalogs, compaction, GC) has shifted from "nice to have" to a core operational requirement as lakehouse deployments have matured.

Sources