Lakehouse
Summary
What it is
The convergence of data lake storage (raw files on object storage) with data warehouse capabilities — ACID transactions, schema enforcement, SQL access, time-travel.
Where it fits
Lakehouse sits between raw object storage and business analytics. It is the architectural layer where table formats (Iceberg, Delta, Hudi) add structure to S3 data, enabling SQL engines to query it reliably.
Misconceptions / Traps
- A lakehouse is not just "a data lake with SQL." The key differentiator is transactional guarantees — ACID, schema evolution, snapshot isolation — provided by table format specs.
- Lakehouse does not eliminate ETL. It eliminates the second copy of data in a separate warehouse, but data still needs transformation.
Key Connections
scoped_toObject Storage — the lakehouse stores all data on object storage- Lakehouse Architecture
scoped_toLakehouse — the concrete architectural pattern - Apache Iceberg, Delta Lake, Apache Hudi
scoped_toLakehouse — table format technologies - Medallion Architecture
scoped_toLakehouse — a data quality pattern within lakehouses - Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec
scoped_toLakehouse — the specifications that define table semantics
Definition
What it is
The convergence of data lake storage (raw files on object storage) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, time-travel).
Why it exists
Data lakes offered cheap, scalable storage but lacked reliability guarantees. Data warehouses offered guarantees but were expensive and siloed. The lakehouse concept unifies both on a single object storage layer.
Relationships
Outbound Relationships
scoped_toResources
"Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics" (Armbrust et al., CIDR 2021) is the canonical academic paper defining the lakehouse paradigm.
Databricks' glossary entry distills the lakehouse concept into an accessible overview with diagrams comparing it to data lakes and warehouses.
Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.