Standard

Puffin File Format

A binary format defined inside the Apache Iceberg specification for storing table-level statistics, indexes, and (in V3) deletion vectors as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed blobs plus a footer cataloging blob offsets, sizes, and types.

7 connections 3 resources 2 posts

Summary

What it is

A binary format defined inside the Apache Iceberg specification for storing table-level statistics, indexes, and (in V3) deletion vectors as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed blobs plus a footer cataloging blob offsets, sizes, and types.

Where it fits

Puffin is the load-bearing format that turns Iceberg V3 from "Iceberg V2 plus features" into "Iceberg V2 with order-of-magnitude faster MERGE/UPDATE." V3 stores Roaring-bitmap deletion vectors in Puffin blobs instead of rewriting full Parquet data files for every modification.

Misconceptions / Traps
  • Puffin is not a replacement for Parquet — it is auxiliary. Data files remain Parquet (or ORC); Puffin sits beside them holding indexes and deletion bitmaps.
  • The Puffin spec is permissive about blob types — engines that don't recognize a blob type just skip it. New blob types can roll out without spec versioning friction.
  • Backward compatibility is per-blob-type, not per-Puffin-file. An engine reading a Puffin file with both bloom-filter-v1 and deletion-vector-v1 may understand only one.
Key Connections
  • used_by Iceberg V3 Spec — deletion vectors specifically
  • used_by Apache Iceberg — table-level NDV and bloom blobs
  • solves Read / Write Amplification — bitmap deletes replace file rewrites
  • solves Metadata Overhead at Scale — index/sketch storage separated from data files

Definition

What it is

A binary format defined inside the Apache Iceberg specification for storing **table-level statistics, indexes, and deletion vectors** as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed "blobs" (each tagged with a blob-type identifier such as `apache-datasketches-theta-v1`, `deletion-vector-v1`, `bloom-filter-v1`) plus a small footer that catalogs blob offsets, sizes, and types. Iceberg V3 elevates Puffin from a curiosity to a load-bearing format by storing **Roaring-bitmap deletion vectors** in Puffin blobs instead of rewriting full Parquet data files for every UPDATE/DELETE.

Why it exists

Iceberg V2 supported only positional and equality delete files — both of which required rewriting one or more Parquet files per modification (copy-on-write or merge-on-read with full file rewrites). At CDC scale, that turns every small change into a multi-gigabyte rewrite. Puffin encodes deletes as compact bitmaps over row positions, so a million-row UPDATE writes a kilobyte-scale Puffin blob instead of regenerating data files — yielding **up to 10× faster MERGE/UPDATE** in Iceberg V3 implementations.

Primary use cases

Iceberg V3 deletion vectors for high-frequency CDC and GDPR-style point deletes, table-level NDV (number-of-distinct-values) sketches for query planner optimizations, bloom filters and other index sidecars that accelerate predicate pushdown without bloating the data files themselves.

Connections 7

Outbound 6
Inbound 1
depends_on1

Resources 3

Featured in