Puffin File Format
A binary format defined inside the Apache Iceberg specification for storing table-level statistics, indexes, and (in V3) deletion vectors as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed blobs plus a footer cataloging blob offsets, sizes, and types.
Summary
A binary format defined inside the Apache Iceberg specification for storing table-level statistics, indexes, and (in V3) deletion vectors as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed blobs plus a footer cataloging blob offsets, sizes, and types.
Puffin is the load-bearing format that turns Iceberg V3 from "Iceberg V2 plus features" into "Iceberg V2 with order-of-magnitude faster MERGE/UPDATE." V3 stores Roaring-bitmap deletion vectors in Puffin blobs instead of rewriting full Parquet data files for every modification.
- Puffin is not a replacement for Parquet — it is auxiliary. Data files remain Parquet (or ORC); Puffin sits beside them holding indexes and deletion bitmaps.
- The Puffin spec is permissive about blob types — engines that don't recognize a blob type just skip it. New blob types can roll out without spec versioning friction.
- Backward compatibility is per-blob-type, not per-Puffin-file. An engine reading a Puffin file with both
bloom-filter-v1anddeletion-vector-v1may understand only one.
used_byIceberg V3 Spec — deletion vectors specificallyused_byApache Iceberg — table-level NDV and bloom blobssolvesRead / Write Amplification — bitmap deletes replace file rewritessolvesMetadata Overhead at Scale — index/sketch storage separated from data files
Definition
A binary format defined inside the Apache Iceberg specification for storing **table-level statistics, indexes, and deletion vectors** as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed "blobs" (each tagged with a blob-type identifier such as `apache-datasketches-theta-v1`, `deletion-vector-v1`, `bloom-filter-v1`) plus a small footer that catalogs blob offsets, sizes, and types. Iceberg V3 elevates Puffin from a curiosity to a load-bearing format by storing **Roaring-bitmap deletion vectors** in Puffin blobs instead of rewriting full Parquet data files for every UPDATE/DELETE.
Iceberg V2 supported only positional and equality delete files — both of which required rewriting one or more Parquet files per modification (copy-on-write or merge-on-read with full file rewrites). At CDC scale, that turns every small change into a multi-gigabyte rewrite. Puffin encodes deletes as compact bitmaps over row positions, so a million-row UPDATE writes a kilobyte-scale Puffin blob instead of regenerating data files — yielding **up to 10× faster MERGE/UPDATE** in Iceberg V3 implementations.
Iceberg V3 deletion vectors for high-frequency CDC and GDPR-style point deletes, table-level NDV (number-of-distinct-values) sketches for query planner optimizations, bloom filters and other index sidecars that accelerate predicate pushdown without bloating the data files themselves.
Connections 7
Outbound 6
scoped_to2used_by2Inbound 1
depends_on1Resources 3
The canonical Puffin format specification covering blob types, the footer layout, and the type-tagged extensibility model.
Iceberg V3 specification's deletion-vector chapter — defines how Roaring-bitmap deletes are stored as Puffin blobs and how engines must read them.
AWS launch blog confirming S3 Tables support for Iceberg V3 deletion vectors and Puffin-encoded sidecars in production.