Technology

DuckLake

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

5 connections 2 resources

Summary

What it is

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

Where it fits

DuckLake challenges the Iceberg/Delta approach of storing metadata as JSON and Avro files on S3. By placing metadata in a SQL database, it eliminates the metadata file listing and parsing overhead that plagues large Iceberg tables — while keeping data files in Parquet on S3. It is the natural extension of DuckDB's "zero-infrastructure" philosophy to the lakehouse metadata layer.

Misconceptions / Traps

DuckLake is early-stage (2025). It is not a production-ready replacement for Iceberg or Delta Lake. Evaluate for experimentation and single-node workflows, not mission-critical multi-engine environments.
The metadata database (DuckDB, PostgreSQL, MySQL) becomes a stateful dependency. This partially trades the "no server needed" benefit of file-based table formats for a database dependency.
Multi-engine support is limited. DuckLake is tightly coupled to DuckDB today — unlike Iceberg, which works across Spark, Trino, Flink, and others.

Key Connections

depends_on DuckDB — uses DuckDB as the embedded metadata engine
alternative_to Apache Iceberg — SQL-based metadata vs file-based manifests
solves Metadata Overhead at Scale — eliminates file-based metadata listing overhead
solves Request Amplification — metadata queries replace S3 LIST and GET operations

Definition

What it is

An emerging open table format that stores lakehouse metadata in an embedded SQL database (DuckDB) rather than the file-based manifests used by Iceberg, Delta, and Hudi. Provides instant commit cycles by avoiding S3 round-trips for metadata operations.

Why it exists

File-based table formats store metadata as Parquet/JSON/Avro manifests on S3. Every commit requires multiple PUT operations and every query plan requires multiple GET operations against these manifests. DuckLake replaces this with a local SQL database, eliminating the metadata I/O bottleneck entirely.

Primary use cases

Low-latency lakehouse metadata operations, interactive data exploration without metadata scan overhead, single-node lakehouse workflows where DuckDB is the primary engine.

Connections 5

Outbound 5

scoped_to1

Table Formats

depends_on1

DuckDB

alternative_to1

Apache Iceberg

solves2

Metadata Overhead at Scale Request Amplification

Resources 2

BlogHigh

duckdb.org/2025/05/07/ducklake.html

Announcement post explaining DuckLake's SQL-first metadata approach as an alternative to file-based catalogs like Iceberg's REST catalog.

GitHubHigh

github.com/duckdb/ducklake

Source code and specification for the DuckDB-native lakehouse format that stores catalog metadata in a database instead of manifest files on S3.