Small Files Amplification
The compounding negative effect of large numbers of small files on object storage operations — not just query performance (the Small Files Problem), but also metadata operations, compaction jobs, object listing, and garbage collection.
Summary
The compounding negative effect of large numbers of small files on object storage operations — not just query performance (the Small Files Problem), but also metadata operations, compaction jobs, object listing, and garbage collection.
Small files amplification extends the Small Files Problem beyond query performance into operational burden. Each small file incurs metadata overhead, lifecycle evaluation cost, listing time, and compaction work. At billions of small files, these operational costs dominate storage management.
- Compaction reduces the number of data files but generates new metadata (manifest files, commit logs). In extreme cases, compaction of billions of small files can itself become a bottleneck.
- Small files often originate from streaming ingestion (Flink, Kafka Connect) where each micro-batch produces a separate file. Fixing the source is more effective than compacting after the fact.
amplifiesSmall Files Problem — operational impact beyond query performanceconstrainsMetadata Overhead at Scale — each small file adds metadata entries- SeaweedFS
solvesSmall Files Amplification — O(1) lookup architecture scoped_toS3, Object Storage, Table Formats
Definition
The compounding effect where small files degrade not just query performance but also metadata operations, compaction efficiency, listing throughput, and garbage collection — each degraded operation amplifying the original problem.
Connections 3
Outbound 2
scoped_to2Inbound 1
solves1Resources 2
Delta Lake blog on small file compaction with the OPTIMIZE command, directly addressing write amplification from many small files.
Iceberg table maintenance documentation covering compaction, orphan file cleanup, and snapshot expiration to mitigate small file overhead.