Small Files Amplification

Summary

What it is

The compounding negative effect of large numbers of small files on object storage operations — not just query performance (the Small Files Problem), but also metadata operations, compaction jobs, object listing, and garbage collection.

Where it fits

Small files amplification extends the Small Files Problem beyond query performance into operational burden. Each small file incurs metadata overhead, lifecycle evaluation cost, listing time, and compaction work. At billions of small files, these operational costs dominate storage management.

Misconceptions / Traps

Compaction reduces the number of data files but generates new metadata (manifest files, commit logs). In extreme cases, compaction of billions of small files can itself become a bottleneck.
Small files often originate from streaming ingestion (Flink, Kafka Connect) where each micro-batch produces a separate file. Fixing the source is more effective than compacting after the fact.

Key Connections

amplifies Small Files Problem — operational impact beyond query performance
constrains Metadata Overhead at Scale — each small file adds metadata entries
SeaweedFS solves Small Files Amplification — O(1) lookup architecture
scoped_to S3, Object Storage, Table Formats

Definition

What it is

The compounding effect where small files degrade not just query performance but also metadata operations, compaction efficiency, listing throughput, and garbage collection — each degraded operation amplifying the original problem.

Recent developments

Latest signals

Minor optimization cadence: every 5-15 minutes targeting files under 16 MB. 2026 production pattern: run lightweight bin-packing compaction every 5-15 minutes against fragment files under 16 MB. Catches accumulation before it cascades into the manifest tier. Per Microsoft Fabric Blog — Announcing Optimized Compaction in Fabric Spark.
"Automate table maintenance before small files accumulate." DataLakehouseHub's May 2026 framing: the discipline that distinguishes production-mature lakehouse deployments is automated table-maintenance scheduling that runs before visible problems — preventive rather than reactive. The reactive teams discover small-files amplification at scale; preventive teams never see it. Per DataLakehouseHub — Automating Table Maintenance Before Small Files Accumulate (May 2026).
Cascade: small files → manifest bloat → planning failures → maintenance skipped → more small files. The amplification mechanism documented in 2026 analyses: small files cause manifest bloat which causes planning failures which cause teams to skip maintenance which produces more small files. The feedback loop is what makes this a "amplification" not just "problem." Per Uplatz Blog — Compaction Strategies and the Small File Problem in Object Storage.
Delete file accumulation creates parallel read-amplification cascade. Beyond data-file small-files: delete-file accumulation (in V2-style position deletes) creates serious read amplification — production maintenance now includes delete-file rewrites alongside data-file compaction. Iceberg V3 deletion vectors structurally avoid this branch. Per Medium — Efficient Data Compaction Strategies in Large Data Lakes (April 2026).
Three performance gain mechanisms from compaction: metadata pruning + task reduction + sequential I/O. Compaction delivers performance gains via three structural mechanisms: metadata pruning (fewer manifests + smaller stats footprint), task reduction (fewer files = fewer Spark tasks), and sequential I/O optimization (larger files read sequentially vs scattered small-file random I/O). Each layer compounds. Per Uplatz — Compaction Strategies and the Small File Problem.
Metadata compaction (USPTO 11467774) covers the metadata-tier equivalent. Patent covers metadata compaction patterns — the manifest-rewrite work that complements data-file bin-packing. Closes the metadata-tier loop in the small-files-amplification cascade. Per USPTO 11467774 — Metadata Compaction.

Connections 3

Outbound 2

scoped_to2

S3 Object Storage

Inbound 1

solves1

Compaction

Resources 2

BlogHigh

delta.io/blog/2023-01-25-delta-lake-small-file-compaction-op...

Delta Lake blog on small file compaction with the OPTIMIZE command, directly addressing write amplification from many small files.

DocsHigh

iceberg.apache.org/docs/latest/maintenance/

Iceberg table maintenance documentation covering compaction, orphan file cleanup, and snapshot expiration to mitigate small file overhead.