Guide 34

DuckLake and the Future of Lakehouse Metadata

Problem Framing

Every open table format — Iceberg, Delta Lake, Hudi — stores its metadata as files on S3. Iceberg writes Avro manifests and JSON table metadata. Delta writes a sequential JSON transaction log. Hudi writes a timeline of action files. Every commit creates new metadata files (PUT operations), and every query plan reads them (GET operations). As tables grow to thousands of commits, this metadata I/O becomes the dominant bottleneck — not the data scan itself.

DuckLake, released by the DuckDB team in 2025, takes a fundamentally different approach: store all metadata in a SQL database (DuckDB, PostgreSQL, or MySQL) while keeping data files as Parquet on S3. This eliminates the file-listing overhead entirely. But it also introduces a database dependency and currently works only with DuckDB — a steep tradeoff against Iceberg's engine-agnostic ecosystem.

Relevant Nodes

Topics: Table Formats, Lakehouse, Metadata Management
Technologies: DuckLake, DuckDB, Apache Iceberg, Delta Lake, Apache Hudi, Apache Polaris
Standards: Iceberg Table Spec, Delta Lake Protocol, Apache Parquet
Architectures: Lakehouse Architecture, Separation of Storage and Compute
Pain Points: Metadata Overhead at Scale, Request Amplification

Decision Path

How many engines query your lakehouse? If only DuckDB, DuckLake is a strong fit — instant metadata resolution, zero S3 round-trips for catalog operations. If Spark, Trino, Flink, or Snowflake also need access, Iceberg remains the only viable option with broad multi-engine support.
What's your table scale? Small tables with few commits see negligible metadata overhead in any format. DuckLake's advantage emerges at scale — hundreds of tables with thousands of commits where Iceberg's manifest listing becomes measurably slow without aggressive compaction.
Can you accept a stateful metadata dependency? DuckLake trades S3's stateless metadata (files you can copy and restore) for a database that must be backed up, migrated, and kept available. For single-node labs this is trivial; for production multi-tenant environments it is a meaningful operational concern.
What's your maturity tolerance? Iceberg is battle-tested across the industry with years of production deployments. DuckLake is experimental — suitable for prototyping, personal lakehouses, and single-engine analytical workflows, but not yet for mission-critical pipelines.

What Changed Over Time

DuckDB released DuckLake (May 2025), demonstrating that SQL-based metadata can outperform file-based manifests for single-engine workloads by eliminating all S3 metadata round-trips.
Iceberg v3 added deletion vectors and row lineage, improving write performance but not solving the fundamental metadata file listing problem.
AWS launched S3 Tables with managed Iceberg compaction, but early users reported 2.5-3 hour compaction delays and 20-30x cost surprises — highlighting that even managed file-based metadata has limits.
The "metadata as database" concept gained traction as DuckDB's embedded SQL model proved that a zero-infrastructure catalog is achievable without cloud services.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources