Guide 34

DuckLake and the Future of Lakehouse Metadata

Problem Framing

Every open table format — Iceberg, Delta Lake, Hudi — stores its metadata as files on S3. Iceberg writes Avro manifests and JSON table metadata. Delta writes a sequential JSON transaction log. Hudi writes a timeline of action files. Every commit creates new metadata files (PUT operations), and every query plan reads them (GET operations). As tables grow to thousands of commits, this metadata I/O becomes the dominant bottleneck — not the data scan itself.

DuckLake, released by the DuckDB team in 2025, takes a fundamentally different approach: store all metadata in a SQL database (DuckDB, PostgreSQL, or MySQL) while keeping data files as Parquet on S3. This eliminates the file-listing overhead entirely. But it also introduces a database dependency and currently works only with DuckDB — a steep tradeoff against Iceberg's engine-agnostic ecosystem.

Relevant Nodes

  • Topics: Table Formats, Lakehouse, Metadata Management
  • Technologies: DuckLake, DuckDB, Apache Iceberg, Delta Lake, Apache Hudi, Apache Polaris
  • Standards: Iceberg Table Spec, Delta Lake Protocol, Apache Parquet
  • Architectures: Lakehouse Architecture, Separation of Storage and Compute
  • Pain Points: Metadata Overhead at Scale, Request Amplification

Decision Path

  1. How many engines query your lakehouse? If only DuckDB, DuckLake is a strong fit — instant metadata resolution, zero S3 round-trips for catalog operations. If Spark, Trino, Flink, or Snowflake also need access, Iceberg remains the only viable option with broad multi-engine support.

  2. What's your table scale? Small tables with few commits see negligible metadata overhead in any format. DuckLake's advantage emerges at scale — hundreds of tables with thousands of commits where Iceberg's manifest listing becomes measurably slow without aggressive compaction.

  3. Can you accept a stateful metadata dependency? DuckLake trades S3's stateless metadata (files you can copy and restore) for a database that must be backed up, migrated, and kept available. For single-node labs this is trivial; for production multi-tenant environments it is a meaningful operational concern.

  4. What's your maturity tolerance? Iceberg is battle-tested across the industry with years of production deployments. DuckLake is experimental — suitable for prototyping, personal lakehouses, and single-engine analytical workflows, but not yet for mission-critical pipelines.

What Changed Over Time

  • DuckDB released DuckLake (May 2025), demonstrating that SQL-based metadata can outperform file-based manifests for single-engine workloads by eliminating all S3 metadata round-trips.
  • Iceberg v3 added deletion vectors and row lineage, improving write performance but not solving the fundamental metadata file listing problem.
  • AWS launched S3 Tables with managed Iceberg compaction, but early users reported 2.5-3 hour compaction delays and 20-30x cost surprises — highlighting that even managed file-based metadata has limits.
  • The "metadata as database" concept gained traction as DuckDB's embedded SQL model proved that a zero-infrastructure catalog is achievable without cloud services.

Sources