Lakehouse Architecture
A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.
Summary
A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.
Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.
- A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
- Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.
depends_onS3 API, Apache Parquet — the storage interface and file formatsolvesCold Scan Latency — metadata-driven query planning reduces unnecessary S3 scansconstrained_byMetadata Overhead at Scale, Lack of Atomic Rename- Apache Iceberg, Delta Lake, Apache Hudi
implementsLakehouse Architecture - Trino, Apache Spark, StarRocks, Apache Flink
used_byLakehouse Architecture scoped_toLakehouse, Object Storage
Definition
A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.
Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.
Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.
Recent developments
- Real-time layer is no longer optional in the 2026 lakehouse shape. Per RisingWave's Data Lakehouse Architecture in 2026 analysis, the canonical lakehouse architecture has shifted from "batch lake + warehouse-on-top" to "streaming substrate + lake + analytical engines" — the real-time layer (Flink, Spark Structured Streaming Real-Time Mode, RisingWave) is now treated as a peer to the batch layer rather than an optional bolt-on. The lakehouse is increasingly the convergence point where streaming and batch read/write the same Iceberg/Delta tables.
- Catalog choice has become the load-bearing architectural decision. Per the Databricks data warehousing concepts guide, the "which table format" question is now downstream of "which catalog" — Unity Catalog vs Polaris vs Iceberg REST Catalog vs Hive Metastore. The 2026 lakehouse pattern assumes catalog-managed tables as the default; teams still on filesystem-coordinated metadata are working in an architecture that the ecosystem treats as legacy.
Connections 39
Outbound 7
scoped_to2depends_on2solves1constrained_by2Inbound 32
enables19augments2depends_on1Resources 3
The foundational CIDR 2021 paper by Zaharia et al. that coined the Lakehouse concept, arguing that open data formats on object storage can unify warehousing and ML workloads.
Databricks' official product page explaining Lakehouse architecture, the commercial realization of the CIDR paper's vision.
Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.