Technology

DuckLake

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

9 connections 2 resources

Summary

What it is

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3. Emerging project from the DuckDB team.

Where it fits

DuckLake challenges the Iceberg/Delta approach of storing metadata as JSON and Avro files on S3. By placing metadata in a SQL database, it eliminates the metadata file listing and parsing overhead that plagues large Iceberg tables — while keeping data files in Parquet on S3. It is the natural extension of DuckDB's "zero-infrastructure" philosophy to the lakehouse metadata layer.

Misconceptions / Traps
  • DuckLake is early-stage (2025). It is not a production-ready replacement for Iceberg or Delta Lake. Evaluate for experimentation and single-node workflows, not mission-critical multi-engine environments.
  • The metadata database (DuckDB, PostgreSQL, MySQL) becomes a stateful dependency. This partially trades the "no server needed" benefit of file-based table formats for a database dependency.
  • Multi-engine support is limited. DuckLake is tightly coupled to DuckDB today — unlike Iceberg, which works across Spark, Trino, Flink, and others.
Key Connections
  • depends_on DuckDB — uses DuckDB as the embedded metadata engine
  • alternative_to Apache Iceberg — SQL-based metadata vs file-based manifests
  • solves Metadata Overhead at Scale — eliminates file-based metadata listing overhead
  • solves Request Amplification — metadata queries replace S3 LIST and GET operations

Definition

What it is

An open MIT-licensed table format that stores all lakehouse metadata — schemas, transaction logs, file locations, version histories, snapshot isolations — inside an ACID-compliant SQL database (PostgreSQL, MySQL, embedded DuckDB, or SQLite) rather than as immutable JSON/Avro files on object storage. Physical data still lives as standard Parquet files on S3-compatible blob storage; only the control plane moves into the RDBMS. Authored by the DuckDB team (Mark Raasveldt and Hannes Mühleisen of CWI Amsterdam) and published as the "DuckLake Manifesto: SQL as a Lakehouse Format."

Why it exists

File-based table formats (Iceberg, Delta, Hudi) treat S3 as both the data store *and* the transaction log. That works for batch analytics but produces severe S3 request amplification under streaming or AI workloads: a single 10,000-row update against a 100M-row Iceberg table triggers a multi-step sequence of metadata.json read → manifest-list.avro download → manifest files traversal → new Parquet writes → position-delete files → new manifests → new manifest-list → atomic catalog swap, generating ~24+ S3 API calls per micro-batch. DuckLake collapses metadata I/O into a single indexed SQL query against an RDBMS, eliminating the "Manifest Maze" entirely.

Primary use cases

Low-latency lakehouse metadata operations, real-time CDC into the lakehouse, interactive embedded analytics where multi-second query planning is unacceptable, single-team to mid-market deployments where a managed PostgreSQL can handle the metadata throughput, AI agent retrieval pipelines that need sub-second context lookups.

Recent developments

Latest signals
  • The DuckLake Manifesto landed as the spec foundation. Per the DuckLake Manifesto, DuckLake is positioned as "SQL as a Lakehouse Format" — an explicit philosophical break from the file-based-metadata orthodoxy of Iceberg/Delta/Hudi. The format is open MIT, not proprietary to DuckDB; standard tooling at ducklake.select and endjin's hands-on tutorial show production setups against PostgreSQL or MySQL backends. The architectural framing has shifted from "another open table format" to "the third generation of lakehouse architecture — database-backed metadata."
  • DuckDB 1.4 LTS — production readiness signal (October 2025). Per the DuckDB business-case analysis, DuckDB 1.4 LTS (October 2025) shipped AES-256 encryption, native Iceberg writes, and a rewritten sort engine — definitively signaling enterprise readiness for the underlying engine that powers DuckLake's embedded variant. Stack Overflow Developer Survey adoption jumped from 1.4% → 3.3% in one year; the project surpassed 30,000 GitHub stars; ClickBench gives DuckDB the #1 position for in-memory analytical workloads.
  • The Duck Stack economic case — 70% TCO reduction. Definite, an analytics platform provider, migrated their entire production infrastructure from Snowflake to the "Duck Stack" (DuckDB + DuckLake) and reports 70% reduction in underlying infrastructure costs — see DuckDB and DuckLake: Why We Bet the Company. Concrete numbers: standard object storage at ~$20/TB/month, plus a single dedicated 16-vCPU/64GB VM at ~$500/month with zero per-query execution charges, enabled sub-second query latency at a fraction of cloud-warehouse TCO. The shift lets Definite offer an all-in-one analytics platform from $250/month.
  • Operator's deployment matrix. Per Duck Lake vs Iceberg: An Operator's Verdict, the production sweet spot is workloads under 5 TB — micro-scale (≤100 GB) maps to local DuckDB/SQLite catalogs; mid-market (100 GB – 5 TB) maps to managed PostgreSQL catalogs; enterprise core (1–50 TB) is hybrid / engine-dependent; large-enterprise and hyperscale (>5 PB) still favor Iceberg's decentralized optimistic-concurrency model for multi-engine federation. Critically, migration between formats is reversible — both use standard Parquet for physical data, so DuckLake↔Iceberg conversions are metadata-translation operations, not physical-data rewrites. Architectural decisions made today are not permanent traps.
  • The AI-retrieval latency case (the strongest 2026 argument). Per the Designing an AI-Native Lakehouse on Iceberg engineering case study, AI agents querying an Iceberg-hosted vector embeddings table see median end-to-end retrieval latencies around 680ms with p95 at 1.8s, and 40–60% of that time is exclusively manifest/snapshot lookups on object storage — independent of the actual vector similarity search. DuckLake's single-SQL-call metadata path eliminates that entire latency component, returning the file URI list in milliseconds.
  • Tigris's hosted DuckLake offering. Per Tigris's DuckLake writeup, the geo-distributed S3-compatible object store now bundles DuckLake support as a managed pattern — single global namespace, edge replication of the catalog, zero-egress economics. Useful when the deployment requirement crosses regulatory boundaries that AWS S3's residency model can't satisfy.
  • MotherDuck — hybrid DuckLake-cloud. Per Quacks & Stacks, MotherDuck (the hosted DuckDB platform by the DuckDB team) is positioning as the "glue" that lets a single SQL query join local high-velocity DuckLake metadata against historical petabyte-scale Iceberg data on S3 — without the analyst or AI agent needing to know where the physical network boundary lies.
  • The scalability ceiling — honest about where it doesn't fit. Per the Rethinking the Lakehouse operator analysis: the centralized PostgreSQL catalog is a write-coordination bottleneck above ~50 TB / thousands of concurrent distributed writers. For hyperscale globally-distributed write fan-out, Iceberg's decentralized optimistic concurrency model remains the correct architectural choice despite its S3 amplification flaws. DuckLake is not a universal Iceberg replacement; it is the correct format for the workload tier where the centralized catalog's vertical simplicity beats Iceberg's horizontal complexity.
  • Where Iceberg is converging (long-term). Per the data-lakehouse hub 2025/2026 guide, the Iceberg V4 spec direction is moving toward a pluggable catalog model with RDBMS-backed metadata as an official option — effectively absorbing DuckLake's core thesis while maintaining Iceberg's massive ecosystem footprint. By the late 2020s the ideological divide between file-based and database-backed metadata is expected to narrow significantly. DuckLake's bet is that being early to the SQL-native pattern earns the format the developer-experience defaults for the embedded-and-edge tier even as Iceberg catches up at the hyperscale tier.

Connections 9

Outbound 9

Resources 2