Architecture

Lakehouse Architecture

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

38 connections 3 resources 1 post

Summary

What it is

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

Where it fits

Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.

Misconceptions / Traps

A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.

Key Connections

depends_on S3 API, Apache Parquet — the storage interface and file format
solves Cold Scan Latency — metadata-driven query planning reduces unnecessary S3 scans
constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
Apache Iceberg, Delta Lake, Apache Hudi implements Lakehouse Architecture
Trino, Apache Spark, StarRocks, Apache Flink used_by Lakehouse Architecture
scoped_to Lakehouse, Object Storage

Definition

What it is

A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.

Why it exists

Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.

Primary use cases

Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.

Recent developments

Latest signals

Real-time layer is no longer optional in the 2026 lakehouse shape. Per RisingWave's Data Lakehouse Architecture in 2026 analysis, the canonical lakehouse architecture has shifted from "batch lake + warehouse-on-top" to "streaming substrate + lake + analytical engines" — the real-time layer (Flink, Spark Structured Streaming Real-Time Mode, RisingWave) is now treated as a peer to the batch layer rather than an optional bolt-on. The lakehouse is increasingly the convergence point where streaming and batch read/write the same Iceberg/Delta tables.
Catalog choice has become the load-bearing architectural decision. Per the Databricks data warehousing concepts guide, the "which table format" question is now downstream of "which catalog" — Unity Catalog vs Polaris vs Iceberg REST Catalog vs Hive Metastore. The 2026 lakehouse pattern assumes catalog-managed tables as the default; teams still on filesystem-coordinated metadata are working in an architecture that the ecosystem treats as legacy.

Connections 38

Outbound 7

scoped_to2

Lakehouse Object Storage

depends_on2

S3 API Apache Parquet

solves1

Cold Scan Latency

constrained_by2

Metadata Overhead at Scale Lack of Atomic Rename

Inbound 31click to expand

enables18

Technology12

AWS S3 MinIO Aliyun OSS Hitachi Vantara Apache Paimon Estuary Flow Bytewax Apache Airflow Alarik RustFS Marquez Apache Ranger

Standard5

S3 API Apache Parquet Iceberg Table Spec Delta Lake Protocol Iceberg REST Catalog Spec

Architecture1

Partitioning

implements5

Apache Iceberg Delta Lake Apache Hudi Databricks Real-Time AI Lakehouse

used_by4

Trino Apache Spark StarRocks Apache Flink

augments2

ClickHouse Semantic Search

is_a1

Medallion Architecture

depends_on1

Lakehouse for AI Workflows

Resources 3

PaperHigh

www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

The foundational CIDR 2021 paper by Zaharia et al. that coined the Lakehouse concept, arguing that open data formats on object storage can unify warehousing and ML workloads.

DocsHigh

www.databricks.com/product/data-lakehouse

Databricks' official product page explaining Lakehouse architecture, the commercial realization of the CIDR paper's vision.

DocsHigh

docs.databricks.com/aws/en/lakehouse-architecture/

Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.

Summary

Definition

Recent developments

Connections 38

Resources 3

Featured in