Topic

Table Formats

The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.

46 connections 4 resources 1 post

Summary

What it is

The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.

Where it fits

Table formats bridge the gap between raw files on S3 and the structured tables that SQL engines expect. They are the enabling layer for lakehouse architectures.

Misconceptions / Traps
  • Table formats are specifications, not databases. They define how metadata and data files are organized — the query engine is separate.
  • Choosing a table format is increasingly a convergent decision. Iceberg has become the de-facto standard, but Delta and Hudi remain relevant in their ecosystems.
Key Connections
  • scoped_to S3 — all table formats operate on S3-stored files
  • Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec scoped_to Table Formats — the three major specifications
  • Apache Parquet scoped_to Table Formats — the dominant data file format under all three
  • Schema Evolution scoped_to Table Formats — the problem table formats exist to solve
  • Metadata Overhead at Scale scoped_to Table Formats — the problem table formats introduce

Definition

What it is

The category of specifications (Iceberg, Delta, Hudi, Paimon, DuckLake) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files stored on object storage. As of 2026 the category is no longer a pure "data lake metadata layer" question; it is the load-bearing abstraction beneath both batch analytics and AI/ML retrieval pipelines, and the choice of format increasingly dictates the analytical engine ecosystem (Iceberg→Trino/Snowflake/Polaris, Delta→Databricks/Photon, Hudi→Onehouse/streaming-CDC).

Why it exists

Raw files on S3 have no transactional guarantees, no schema enforcement, and no efficient way to track which files belong to a logical table. Table format specifications solve this by adding a metadata layer on top of the files. The 2026 second-order reason matters more than the original: AI workloads need sub-second metadata lookups for agentic retrieval, streaming ingestion needs sub-minute commit cycles, and the file-based-metadata-on-S3 generation hits a hard ceiling on both. The category is fragmenting into two architectural ideologies: distributed file-based catalogs (Iceberg/Delta/Hudi) and SQL-native catalog-backed metadata (DuckLake/Paimon hybrids).

Recent developments

Latest signals
  • Format positioning has stabilized — Iceberg as the vendor-neutral default. Per RisingWave's 2026 comparison, the three Apache-grade open table formats now occupy distinct niches: Iceberg holds the vendor-neutral position (most diverse engine support, REST Catalog standard, broadest cloud-platform integration); Delta Lake retains its position originating at Databricks and gets first-class treatment in the Photon-Spark-Unity-Catalog stack; Hudi keeps its Uber-origin CDC + record-level upsert advantage. The 2026 frame: format choice has matured from "which spec is technically best" to "which catalog ecosystem are you committing to."
  • Streaming providers are converging on Iceberg as the persistence-tier default. Per Ventana Research's analysis of open table formats, streaming-data providers — Cloudera, Confluent, Redpanda, and StreamNative — are adding native support for converting streaming events into Apache Iceberg tables for long-term persistence and analytical query. This codifies a pattern where Iceberg is the shared analytical-side substrate even when the streaming layer itself remains Kafka/Pulsar/Redpanda. Parallel signal on the Chinese cloud side: Apache Paimon now generates Iceberg V3 deletion-vector snapshots automatically, letting Trino and StarRocks read what Flink writes without a separate ETL hop. Cross-vendor and cross-geography, Iceberg is becoming the analytical lingua franca.
  • The format question is now an interview-required topic. Per DataDriven's analysis of 1,042 verified data engineering interview rounds, lakehouse-architecture questions covering Delta vs Iceberg vs Hudi vs Paimon now appear in production-grade interviews at unprecedented frequency — the format choice is no longer an arch-team detail, it is a candidate-screening signal. Practical implication: table-format decisions made in 2024-2025 are now reviewed in hiring discussions, accelerating organizational stickiness around the format that already runs in production.
  • DuckLake is the 2026 architectural break. The first genuine alternative to the Iceberg/Delta/Hudi file-based-metadata paradigm has arrived — see DuckLake for the full enrichment. DuckLake stores metadata in an ACID RDBMS (PostgreSQL / MySQL / DuckDB / SQLite) rather than as immutable Avro/JSON files on S3, collapsing query planning from multi-second manifest tree traversal to single-digit-millisecond SQL lookups. Iceberg V4 spec direction is moving toward pluggable catalog support — effectively absorbing DuckLake's thesis at the hyperscale tier — but DuckLake owns the embedded-and-edge tier developer-experience defaults in the meantime.
  • Real-Time AI Lakehouse — Paimon on OSS at hyperscale. Apache Paimon on Aliyun OSS now sustains 40 million rows per second of streaming writes at ByteDance, TikTok, and Alibaba Group — see the Real-Time AI Lakehouse pattern for the full architectural shape. The category-level implication: there are now two viable answers to "what table format do I pick for real-time AI ingestion": Hudi for record-level upsert throughput, or Paimon for LSM-tree-on-Parquet streaming with auto-published Iceberg snapshots for the read side.

Connections 46

Outbound 1
scoped_to1
Inbound 45click to expand

Resources 4

Featured in