Architecture

Manifest Pruning

The optimization technique used by table formats (especially Iceberg) to skip reading irrelevant manifest files during query planning by using upper-level metadata (manifest lists) to eliminate manifests whose data files cannot match the query predicates.

5 connections 3 resources

Summary

What it is

Where it fits

Manifest pruning is a critical performance optimization for large Iceberg tables on S3. Without it, query planning requires reading every manifest file (one S3 GET per manifest), which at scale can mean thousands of requests before a single data file is read.

Misconceptions / Traps

Manifest pruning effectiveness depends on data organization. If data for a given predicate value is spread across all manifests (poor clustering), pruning eliminates nothing.
Manifest pruning operates on partition-level bounds stored in the manifest list. It does not use column-level min/max statistics — that happens at the data file level during file pruning.
Adding too many partitions increases the number of manifests. Partition design directly affects manifest pruning efficiency.

Key Connections

solves Metadata Overhead at Scale — reduces the number of manifest files read during planning
solves Cold Scan Latency — fewer S3 GETs during query planning means faster time-to-first-row
depends_on Clustering / Sort Order — well-organized data produces more prunable manifests
scoped_to Apache Iceberg, S3 — Iceberg's metadata pruning mechanism

Definition

What it is

The practice of periodically cleaning up expired snapshots, orphaned manifests, and unreferenced data files from Iceberg, Delta, or Hudi tables on S3 to reclaim storage and reduce metadata scan overhead.

Why it exists

Table formats accumulate metadata over time — each commit creates new manifest files, and time-travel retention keeps old snapshots. Without pruning, metadata growth degrades query planning performance and inflates S3 storage costs.

Primary use cases

Iceberg snapshot expiration and orphan file cleanup, Delta VACUUM operations, metadata size management for high-frequency write tables.

Recent developments

Latest signals

Three-level Iceberg pruning: partition (99.7% eliminated) → manifest file stats (60-95%) → row group via Bloom filters (80-99% for point lookups). Each layer is more granular + more expensive; engines descend only as far as needed. The cumulative effect can be 4+ orders of magnitude scan reduction. Per Cazpian — Iceberg Query Performance Tuning: Partition Pruning, Bloom Filters, Spark Configs.
Manifest list as index over manifest files — plan without reading all manifests. A manifest's partition summary lets engines skip entire manifests + thousands of referenced data files in a single check. This is the load-bearing optimization that makes Iceberg's scan-planning O(matching-manifests) rather than O(all-files). Per Apache Iceberg — Performance docs.
Hidden partitioning eliminates accidental full-table scans. Iceberg's key insight: filters on source columns auto-prune partitions defined by partition transforms; users never reference partition columns directly. The dominant Hive-era footgun ("forgot to filter on the partition column") doesn't exist in Iceberg. Per DataLakehouseHub — Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans.
2026 anti-pattern: query planning grew from 200ms → 45s on tables with too many manifests. Without rewrite_manifests maintenance, thousands of small manifests pile up — query planning latency degrades catastrophically. IOMETE's 2026 anti-patterns piece names this as the #1 Iceberg production failure mode. Per IOMETE — Apache Iceberg Production Anti-Patterns 2026.
rewrite_manifests must run periodically (especially after large rewrite_data_files). rewrite_data_files only consolidates data files; the manifest tier needs its own periodic compaction via rewrite_manifests. Forgetting this is one of the most common production-degradation patterns. Per Iceberg Lakehouse — Performance and Apache Iceberg's Metadata (April 2026).
Bloom filters at row-group level eliminate 80-99% of row groups for point lookups. Column-level Bloom filters in Parquet row groups let engines skip row groups that don't contain a queried value — the third + finest pruning layer below manifests. Per Medium — How Apache Iceberg Prunes Files Beyond Partitions.