Lakehouse Architecture
A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.
Summary
A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.
Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.
- A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
- Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.
depends_onS3 API, Apache Parquet — the storage interface and file formatsolvesCold Scan Latency — metadata-driven query planning reduces unnecessary S3 scansconstrained_byMetadata Overhead at Scale, Lack of Atomic Rename- Apache Iceberg, Delta Lake, Apache Hudi
implementsLakehouse Architecture - Trino, Apache Spark, StarRocks, Apache Flink
used_byLakehouse Architecture scoped_toLakehouse, Object Storage
Definition
A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.
Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.
Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.
Connections 33
Outbound 7
scoped_to2depends_on2solves1constrained_by2Inbound 26
enables16implements3depends_on1augments1Resources 3
The foundational CIDR 2021 paper by Zaharia et al. that coined the Lakehouse concept, arguing that open data formats on object storage can unify warehousing and ML workloads.
Databricks' official product page explaining Lakehouse architecture, the commercial realization of the CIDR paper's vision.
Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.