Standard

Iceberg V3 Spec

The 2025 evolution of the Apache Iceberg table specification, introducing Row Lineage for row-level provenance tracking, native CDC detection, enhanced deletion handling, and metadata designed to make the lakehouse "agent-ready" for AI systems.

6 connections 2 resources 3 posts

Summary

What it is

The 2025 evolution of the Apache Iceberg table specification, introducing Row Lineage for row-level provenance tracking, native CDC detection, enhanced deletion handling, and metadata designed to make the lakehouse "agent-ready" for AI systems.

Where it fits

As Iceberg becomes the dominant lakehouse format, V3 addresses the gaps that emerged at scale: Row Lineage exposes where each row originated and how it was transformed, native CDC detection eliminates external change tracking, and improved deletion vectors support streaming updates. V3 is the spec that makes Iceberg both batch/streaming-capable and AI-agent-readable.

Misconceptions / Traps
  • Engine support for V3 features is not immediate. Query engines need time to implement Row Lineage and native CDC; check engine compatibility before depending on V3-specific capabilities.
  • V3 is backwards-compatible with V2 data. Upgrading the spec version does not require rewriting existing tables.
  • "Agent-ready" refers to metadata granularity, not an AI integration layer. V3 exposes provenance metadata that AI systems can consume, but does not include built-in agent APIs.
Key Connections
  • extends Iceberg Table Spec — evolutionary improvement to the existing standard
  • enables Apache Iceberg — new capabilities for Iceberg implementations
  • scoped_to Table Formats, S3

Definition

What it is

The 2025–2026 evolution of the Apache Iceberg table specification. V3 introduces four substantive changes: **Row Lineage** (every row carries a unique row ID and a sequence number that timestamps its last modification, enabling zero-scan incremental reads), **Deletion Vectors** (Puffin-encoded Roaring bitmaps that mark logically deleted positions instead of rewriting whole Parquet files — up to 10× faster MERGE/UPDATE), **native CDC detection**, and the **VARIANT data type** for shredded semi-structured payloads (nested JSON, IoT telemetry, application logs stored alongside strict relational columns with columnar-equivalent scan performance). V3 reached **Public Preview in Snowflake (March 2026)** and entered **bidirectional interop with Databricks Unity Catalog** the same quarter; AWS announced support for v3 deletion vectors and row lineage in November 2025.

Why it exists

V2 revealed three structural limits at scale: copy-on-write for any update made CDC pipelines economically punishing, lack of row provenance forced full-table scans for incremental processing, and the strict relational schema required separate normalization ETL for any semi-structured ingest. V3 addresses all three at the spec layer so engines (Spark, Trino, Flink, Athena, Snowflake, Databricks) inherit the gains without bespoke patches.

Primary use cases

Row-level data lineage for compliance and AI provenance, native CDC detection in Iceberg tables, high-frequency MERGE/UPDATE workloads via deletion vectors, querying semi-structured payloads (JSON, telemetry) without normalization ETL, agent-ready metadata exposure.

Connections 6

Outbound 5
Inbound 1

Resources 2

Featured in