Architecture

Lakehouse Architecture

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

39 connections 3 resources 1 post

Summary

What it is

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

Where it fits

Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.

Misconceptions / Traps
  • A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
  • Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.
Key Connections
  • depends_on S3 API, Apache Parquet — the storage interface and file format
  • solves Cold Scan Latency — metadata-driven query planning reduces unnecessary S3 scans
  • constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
  • Apache Iceberg, Delta Lake, Apache Hudi implements Lakehouse Architecture
  • Trino, Apache Spark, StarRocks, Apache Flink used_by Lakehouse Architecture
  • scoped_to Lakehouse, Object Storage

Definition

What it is

A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.

Why it exists

Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.

Primary use cases

Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.

Recent developments

Latest signals
  • Real-time layer is no longer optional in the 2026 lakehouse shape. Per RisingWave's Data Lakehouse Architecture in 2026 analysis, the canonical lakehouse architecture has shifted from "batch lake + warehouse-on-top" to "streaming substrate + lake + analytical engines" — the real-time layer (Flink, Spark Structured Streaming Real-Time Mode, RisingWave) is now treated as a peer to the batch layer rather than an optional bolt-on. The lakehouse is increasingly the convergence point where streaming and batch read/write the same Iceberg/Delta tables.
  • Catalog choice has become the load-bearing architectural decision. Per the Databricks data warehousing concepts guide, the "which table format" question is now downstream of "which catalog" — Unity Catalog vs Polaris vs Iceberg REST Catalog vs Hive Metastore. The 2026 lakehouse pattern assumes catalog-managed tables as the default; teams still on filesystem-coordinated metadata are working in an architecture that the ecosystem treats as legacy.

Connections 39

Outbound 7
Inbound 32click to expand

Resources 3

Featured in