Puffin Format
Apache Iceberg's binary file format for storing **arbitrary statistics, indexes, and metadata blobs** that don't fit naturally in the Iceberg manifest itself. A Puffin file is a sequence of typed "blobs" + a footer describing how to find and interpret each one. Originally introduced as a sidecar format for things like Theta sketches (NDV estimates) and Bloom filters; in Iceberg V3 it became the canonical storage for **deletion vectors** — Roaring-bitmap deletes referenced by `content_offset` + `content_size_in_bytes` from the Puffin footer.
Definition
Apache Iceberg's binary file format for storing **arbitrary statistics, indexes, and metadata blobs** that don't fit naturally in the Iceberg manifest itself. A Puffin file is a sequence of typed "blobs" + a footer describing how to find and interpret each one. Originally introduced as a sidecar format for things like Theta sketches (NDV estimates) and Bloom filters; in Iceberg V3 it became the canonical storage for **deletion vectors** — Roaring-bitmap deletes referenced by `content_offset` + `content_size_in_bytes` from the Puffin footer.
Iceberg manifests are tightly schema-controlled — adding new statistical fields means a spec revision and a coordinated reader/writer upgrade. That makes manifests the wrong place for evolving optimization-aid data (column stats sketches, Bloom filters, secondary indexes) where new variants ship faster than the manifest spec evolves. Puffin is the "manifest's evolving sibling": typed blobs, easy to add new blob types, decoupled from the manifest revision cycle. The V3 expansion to host deletion vectors collapses what would have been "another set of files in the table" into the existing Puffin-statistics-files infrastructure.
Storing Theta sketches for NDV estimates (per-column cardinality without scanning the data), Bloom filters for predicate pushdown on high-cardinality columns, deletion vectors (V3's new delete mechanism — compact binary bitmaps per data file), arbitrary engine-specific secondary indexes that don't need spec standardization, and as the staging format for future Iceberg metadata extensions before they're promoted into the manifest spec proper.
Recent developments
- Deletion vectors land in Puffin in Iceberg V3. The binary deletion vector mechanism in V3 is stored as a compact Roaring bitmap inside a Puffin statistics file — at most one per data file, with strict offset+size matching to the Puffin footer. Per Apache Iceberg — Puffin Spec.
- PR #11240 — official addition of deletion vectors to the table spec. Pull request merged by Ryan Blue formalizing deletion-vector blob layout in Puffin for V3. Per GitHub — apache/iceberg PR #11240.
- Spec source-of-truth in iceberg/format/puffin-spec.md. The canonical Puffin format spec lives in the Apache Iceberg repository alongside the table spec. Per GitHub — puffin-spec.md.
- Independent technical writeup on deletion-vectors-in-Puffin merge. Vincent Daniel's Medium piece (Feb 2026) walks through how deletion vectors and Puffin merged into a single V3 mechanism — replacing both V2 positional delete files and the separate sidecar-statistics path. Per Medium — Deletion vectors and Puffin in V3.
- CDC pipeline implications. The Iceberg V3 + Puffin combination dramatically reduces delete-file accumulation under high-frequency CDC workloads, which had been Iceberg V2's weakest performance area against Hudi. Per Data Lakehouse Hub — V3 advances for CDC pipelines.
Connections 4
Outbound 4
scoped_to1used_by1enables1solves1