Apache Iceberg
An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage.
Summary
An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage.
Iceberg is the central table format in the S3 ecosystem. It turns a pile of Parquet files on S3 into a reliable, evolvable, SQL-queryable table — without requiring a database server. It has become the de-facto standard across engines (Spark, Trino, Flink, DuckDB).
- Iceberg is not a query engine. It is a table format specification plus libraries. You still need Spark, Trino, DuckDB, or another engine to query Iceberg tables.
- Hidden partitioning is powerful but not magic. Poor sort order or excessive partition granularity still produces small files and slow queries.
implementsLakehouse Architecture — the primary table format for lakehousesdepends_onApache Parquet — default data file formatsolvesSmall Files Problem (compaction), Schema Evolution (column-ID-based evolution), Partition Pruning Complexity (hidden partitioning)constrained_byMetadata Overhead at Scale, Lack of Atomic Renamescoped_toTable Formats, Lakehouse
Definition
An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) stored on object storage.
Raw files on S3 have no concept of a "table." Iceberg adds transactional table semantics — schema enforcement, hidden partitioning, snapshot isolation, time-travel — on top of object storage without requiring a specialized database engine.
Lakehouse table management, schema evolution, partition management, concurrent read/write isolation over S3 data.
Recent developments
- Ryft Study — Iceberg is now foundational in enterprise data architectures. Ryft published "The State of Apache Iceberg in the Enterprise (2026)," an independent survey of 252 senior data and IT leaders. The headline numbers: 58% of organizations now use Iceberg for business-critical analytics, 95% are using or planning to use it for AI/ML workloads, and 79% plan to move remaining data to Iceberg within 12 months. The framing has shifted from "a table format option" to "the table format" — operational maturity is the next focus area.
- Apache Polaris — graduated to Top-Level Project February 18, 2026. Polaris is the open-source, vendor-neutral Iceberg catalog service implementing the Iceberg REST Catalog API. Snowflake donated Polaris to the Apache Software Foundation in August 2024; it graduated to a self-governing Top-Level Project on February 18, 2026. Snowflake's Horizon Catalog is built on the same Polaris backbone as the open-source community version. Dremio's Enterprise Catalog is also Polaris-derived. Polaris is positioned to evolve from "an Iceberg catalog" into a true lakehouse control plane managing views, models, data products, and governance policies across multi-vendor, multi-format environments.
- REST Catalog became the default for new deployments in 2026. Per Conduktor's catalog management guide, the REST Catalog is now recommended over Hive, Glue, and Nessie for greenfield deployments — vendor-neutral API, broad engine support, modern architecture. Project Nessie still wins where Git-like versioning semantics matter; Polaris wins where enterprise access control matters; the article recommends a multi-catalog strategy using different implementations for different use cases. The REST Catalog spec is now an explicit implementation target for AWS, Snowflake, Databricks, and Nessie.
- Streaming integration matured in 2026. RisingWave's 2026 retrospective reports: Iceberg v2 row-level deletes are now consistent across Trino, Spark, and DuckDB. RisingWave added stable Iceberg sink support with exactly-once semantics. Query engines closed the performance gap — DuckDB and Trino now deliver warehouse-class performance for most analytical shapes, with sub-second response times on S3 for small-to-medium datasets. The implication: Iceberg is no longer "batch-only" — streaming write paths into Iceberg are production-ready.
- The platform war: Snowflake vs Databricks framed through Iceberg. Per the data-lakehouse comparison analysis, the Iceberg-vs-Delta-Lake competition is inseparable from the Snowflake ($4.84B ARR) vs Databricks ($5.4B ARR) platform war. Snowflake's embrace of Iceberg is strategic offense (Polaris open-sourced 2024, donated to Apache, graduated 2026). Databricks responded by open-sourcing Unity Catalog in June 2024 with Iceberg REST Catalog API support. With the Tabular acquisition, Databricks now influences both Iceberg and Delta — and is publicly arguing for format convergence rather than competition.
- Engine adoption — Spark dominant, DuckDB rising. Per the 2025 State of the Apache Iceberg Ecosystem survey (referenced via the managed-Iceberg guide), engine adoption stands at 96.4% Spark, 60.7% Trino, with growing DuckDB and Flink usage. The hottest 2026 role is Lakehouse Platform Engineer ($180k-260k) — owns table catalog, partitioning standards, compaction policies, and query latency SLAs. Hudi has lost mindshare outside Uber; Iceberg adoption eclipsed Hudi by 5-7×. Iceberg vs Paimon: Iceberg wins batch ETL + multi-branch versioning; Paimon wins streaming latency (100-500ms single-row updates vs Iceberg copy-on-write's 15s).
- Iceberg v3 GA on Snowflake (May 7, 2026) — geography, geometry, nanosecond timestamps, variant ship. Snowflake released v3 spec support to GA: four new data types (
geography,geometry, nanosecondtimestamp,variant), column-level default values, deletion vectors for faster updates and deletes, and row lineage for change-data-capture against Iceberg tables. Reading Snowflake-managed v3 tables from external engines through the Horizon Iceberg REST Catalog API is also GA — though writes from external engines into Snowflake-managed v3 tables remain unsupported. The vendor-neutral v3 spec is now production on at least one major commercial engine; expect parity rollout across Trino, Databricks, and AWS through the rest of 2026. - Iceberg Summit 2026 (April 8-9, San Francisco) — third edition, expanded to two days. The Iceberg Summit 2026 ran two full in-person days at the Marriott Marquis under ASF sanction with Apache Iceberg PMC oversight — up from one day in 2025. Session formats: 30-minute breakouts, 15-minute lightning talks, 45-minute panels, 60-180-minute workshops/labs. The doubling signals genuine community-meets-enterprise mass; the format mix (heavy on hands-on workshops) signals operational maturity is the next focus.
- Iceberg Rust 0.9.0 release (March 2026) — native Rust implementation gaining ground. The Apache Iceberg PMC shipped iceberg-rust 0.9.0, covering January through early March 2026: 109 PRs from 28 contributors (8 new). iceberg-rust is the native Rust implementation of the Iceberg spec — high-performance read/write/manage of Iceberg tables, with Python bindings via
pyiceberg-core. The implication for the broader ecosystem: Iceberg's reference implementation is no longer JVM-monoculture; Rust-language clients (DataFusion, Polars, GreptimeDB, RisingWave) get first-class Iceberg support without a Java runtime in the path. - 99% data pruning before I/O — why Iceberg dominates high-concurrency AI writes. Per Onehouse's deep comparison and Spec deep-dives, Iceberg's three-layer metadata tree (catalog → manifest list → manifest files → data files) lets query engines prune >99% of data files before initiating I/O. For continuous embedding pipelines and Mixture-of-Experts checkpoint flushes — workloads where thousands of agents write concurrently — this is the determining factor: Iceberg consistently outperforms alternatives on highly-concurrent INSERT/UPDATE/DELETE under MVCC, with sub-second query planning against tables holding millions of files.
- The Paimon-Iceberg bridge — real-time AI lakehouse interoperability. Per Alibaba Cloud's Paimon-Iceberg writeup, Apache Paimon now uses Iceberg V3 deletion vectors to automatically generate Iceberg-compatible snapshots from its LSM-tree streaming layer. ByteDance and Alibaba Group run individual Paimon tables at 40 million rows/second with Iceberg snapshots auto-published so Trino/StarRocks read the same data without separate ETL. This bridge is now a key reason enterprises pick Iceberg as the analytical layer for real-time AI ingestion — Paimon owns streaming, Iceberg owns the engine ecosystem.
Connections 29
Outbound 9
scoped_to2implements1depends_on1constrained_by2Inbound 20
alternative_to1enables10augments6reads_from1retrieves1used_by1Resources 4
Official Apache Iceberg documentation covering the table format specification, catalog integrations, and query engine compatibility.
The primary Iceberg repository containing the spec, Java/Python libraries, and the core table format implementation that operates on S3.
The formal Iceberg table format specification — the authoritative reference for how Iceberg organizes metadata and data files on object stores.
Iceberg's dedicated AWS integration page documenting S3 file I/O, S3 catalog support, and AWS SDK configuration for Iceberg tables.