Apache Iceberg
Summary
What it is
An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage.
Where it fits
Iceberg is the central table format in the S3 ecosystem. It turns a pile of Parquet files on S3 into a reliable, evolvable, SQL-queryable table — without requiring a database server. It has become the de-facto standard across engines (Spark, Trino, Flink, DuckDB).
Misconceptions / Traps
- Iceberg is not a query engine. It is a table format specification plus libraries. You still need Spark, Trino, DuckDB, or another engine to query Iceberg tables.
- Hidden partitioning is powerful but not magic. Poor sort order or excessive partition granularity still produces small files and slow queries.
Key Connections
implementsLakehouse Architecture — the primary table format for lakehousesdepends_onApache Parquet — default data file formatsolvesSmall Files Problem (compaction), Schema Evolution (column-ID-based evolution), Partition Pruning Complexity (hidden partitioning)constrained_byMetadata Overhead at Scale, Lack of Atomic Renamescoped_toTable Formats, Lakehouse
Definition
What it is
An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) stored on object storage.
Why it exists
Raw files on S3 have no concept of a "table." Iceberg adds transactional table semantics — schema enforcement, hidden partitioning, snapshot isolation, time-travel — on top of object storage without requiring a specialized database engine.
Primary use cases
Lakehouse table management, schema evolution, partition management, concurrent read/write isolation over S3 data.
Relationships
Outbound Relationships
scoped_toimplementsdepends_onconstrained_byInbound Relationships
Resources
Official Apache Iceberg documentation covering the table format specification, catalog integrations, and query engine compatibility.
The primary Iceberg repository containing the spec, Java/Python libraries, and the core table format implementation that operates on S3.
The formal Iceberg table format specification — the authoritative reference for how Iceberg organizes metadata and data files on object stores.
Iceberg's dedicated AWS integration page documenting S3 file I/O, S3 catalog support, and AWS SDK configuration for Iceberg tables.