Table Formats
The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.
Summary
The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.
Table formats bridge the gap between raw files on S3 and the structured tables that SQL engines expect. They are the enabling layer for lakehouse architectures.
- Table formats are specifications, not databases. They define how metadata and data files are organized — the query engine is separate.
- Choosing a table format is increasingly a convergent decision. Iceberg has become the de-facto standard, but Delta and Hudi remain relevant in their ecosystems.
scoped_toS3 — all table formats operate on S3-stored files- Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec
scoped_toTable Formats — the three major specifications - Apache Parquet
scoped_toTable Formats — the dominant data file format under all three - Schema Evolution
scoped_toTable Formats — the problem table formats exist to solve - Metadata Overhead at Scale
scoped_toTable Formats — the problem table formats introduce
Definition
The category of specifications (Iceberg, Delta, Hudi, Paimon, DuckLake) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files stored on object storage. As of 2026 the category is no longer a pure "data lake metadata layer" question; it is the load-bearing abstraction beneath both batch analytics and AI/ML retrieval pipelines, and the choice of format increasingly dictates the analytical engine ecosystem (Iceberg→Trino/Snowflake/Polaris, Delta→Databricks/Photon, Hudi→Onehouse/streaming-CDC).
Raw files on S3 have no transactional guarantees, no schema enforcement, and no efficient way to track which files belong to a logical table. Table format specifications solve this by adding a metadata layer on top of the files. The 2026 second-order reason matters more than the original: AI workloads need sub-second metadata lookups for agentic retrieval, streaming ingestion needs sub-minute commit cycles, and the file-based-metadata-on-S3 generation hits a hard ceiling on both. The category is fragmenting into two architectural ideologies: distributed file-based catalogs (Iceberg/Delta/Hudi) and SQL-native catalog-backed metadata (DuckLake/Paimon hybrids).
Recent developments
- Format positioning has stabilized — Iceberg as the vendor-neutral default. Per RisingWave's 2026 comparison, the three Apache-grade open table formats now occupy distinct niches: Iceberg holds the vendor-neutral position (most diverse engine support, REST Catalog standard, broadest cloud-platform integration); Delta Lake retains its position originating at Databricks and gets first-class treatment in the Photon-Spark-Unity-Catalog stack; Hudi keeps its Uber-origin CDC + record-level upsert advantage. The 2026 frame: format choice has matured from "which spec is technically best" to "which catalog ecosystem are you committing to."
- Streaming providers are converging on Iceberg as the persistence-tier default. Per Ventana Research's analysis of open table formats, streaming-data providers — Cloudera, Confluent, Redpanda, and StreamNative — are adding native support for converting streaming events into Apache Iceberg tables for long-term persistence and analytical query. This codifies a pattern where Iceberg is the shared analytical-side substrate even when the streaming layer itself remains Kafka/Pulsar/Redpanda. Parallel signal on the Chinese cloud side: Apache Paimon now generates Iceberg V3 deletion-vector snapshots automatically, letting Trino and StarRocks read what Flink writes without a separate ETL hop. Cross-vendor and cross-geography, Iceberg is becoming the analytical lingua franca.
- The format question is now an interview-required topic. Per DataDriven's analysis of 1,042 verified data engineering interview rounds, lakehouse-architecture questions covering Delta vs Iceberg vs Hudi vs Paimon now appear in production-grade interviews at unprecedented frequency — the format choice is no longer an arch-team detail, it is a candidate-screening signal. Practical implication: table-format decisions made in 2024-2025 are now reviewed in hiring discussions, accelerating organizational stickiness around the format that already runs in production.
- DuckLake is the 2026 architectural break. The first genuine alternative to the Iceberg/Delta/Hudi file-based-metadata paradigm has arrived — see DuckLake for the full enrichment. DuckLake stores metadata in an ACID RDBMS (PostgreSQL / MySQL / DuckDB / SQLite) rather than as immutable Avro/JSON files on S3, collapsing query planning from multi-second manifest tree traversal to single-digit-millisecond SQL lookups. Iceberg V4 spec direction is moving toward pluggable catalog support — effectively absorbing DuckLake's thesis at the hyperscale tier — but DuckLake owns the embedded-and-edge tier developer-experience defaults in the meantime.
- Real-Time AI Lakehouse — Paimon on OSS at hyperscale. Apache Paimon on Aliyun OSS now sustains 40 million rows per second of streaming writes at ByteDance, TikTok, and Alibaba Group — see the Real-Time AI Lakehouse pattern for the full architectural shape. The category-level implication: there are now two viable answers to "what table format do I pick for real-time AI ingestion": Hudi for record-level upsert throughput, or Paimon for LSM-tree-on-Parquet streaming with auto-published Iceberg snapshots for the read side.
Connections 46
Outbound 1
scoped_to1Inbound 45
scoped_to45Resources 4
The Apache Iceberg specification is the definitive reference for the most widely adopted open table format, defining snapshot isolation, schema evolution, and partition evolution.
The Delta Lake transaction log protocol specification defines the ACID transaction semantics and metadata structure at the wire level.
Apache Hudi's official documentation covers its copy-on-write and merge-on-read table types, incremental processing model, and timeline-based versioning.
Dremio's comprehensive comparison shows how each table format handles metadata, partitioning, and storage at petabyte scale.