Technology

DuckLake

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

5 connections 2 resources

Summary

What it is

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

Where it fits

DuckLake challenges the Iceberg/Delta approach of storing metadata as JSON and Avro files on S3. By placing metadata in a SQL database, it eliminates the metadata file listing and parsing overhead that plagues large Iceberg tables — while keeping data files in Parquet on S3. It is the natural extension of DuckDB's "zero-infrastructure" philosophy to the lakehouse metadata layer.

Misconceptions / Traps
  • DuckLake is early-stage (2025). It is not a production-ready replacement for Iceberg or Delta Lake. Evaluate for experimentation and single-node workflows, not mission-critical multi-engine environments.
  • The metadata database (DuckDB, PostgreSQL, MySQL) becomes a stateful dependency. This partially trades the "no server needed" benefit of file-based table formats for a database dependency.
  • Multi-engine support is limited. DuckLake is tightly coupled to DuckDB today — unlike Iceberg, which works across Spark, Trino, Flink, and others.
Key Connections
  • depends_on DuckDB — uses DuckDB as the embedded metadata engine
  • alternative_to Apache Iceberg — SQL-based metadata vs file-based manifests
  • solves Metadata Overhead at Scale — eliminates file-based metadata listing overhead
  • solves Request Amplification — metadata queries replace S3 LIST and GET operations

Definition

What it is

An emerging open table format that stores lakehouse metadata in an embedded SQL database (DuckDB) rather than the file-based manifests used by Iceberg, Delta, and Hudi. Provides instant commit cycles by avoiding S3 round-trips for metadata operations.

Why it exists

File-based table formats store metadata as Parquet/JSON/Avro manifests on S3. Every commit requires multiple PUT operations and every query plan requires multiple GET operations against these manifests. DuckLake replaces this with a local SQL database, eliminating the metadata I/O bottleneck entirely.

Primary use cases

Low-latency lakehouse metadata operations, interactive data exploration without metadata scan overhead, single-node lakehouse workflows where DuckDB is the primary engine.

Connections 5

Outbound 5

Resources 2