Compaction
The background maintenance operation that merges many small data files into fewer, larger files within a table format (Iceberg, Delta, Hudi) to improve query performance and reduce S3 request overhead.
Summary
The background maintenance operation that merges many small data files into fewer, larger files within a table format (Iceberg, Delta, Hudi) to improve query performance and reduce S3 request overhead.
Compaction is the primary remedy for the small files problem in S3-based lakehouses. Streaming ingestion, CDC pipelines, and frequent batch writes all produce small files that degrade scan performance. Compaction rewrites those files into optimally sized Parquet files while preserving table semantics.
- Compaction is not free. It reads existing files from S3, merges them, writes new files, and updates metadata. This consumes compute, S3 GET/PUT requests, and temporary storage.
- Running compaction too aggressively conflicts with active writers. In Iceberg, concurrent compaction and writes can cause commit conflicts requiring retry.
- Compaction does not reduce data volume. It reorganizes files for efficiency but does not delete or deduplicate data. Storage usage may temporarily increase during compaction before old files are garbage-collected.
solvesSmall Files Problem — the primary purpose of compactionsolvesSmall Files Amplification — reduces metadata and request overheadscoped_toTable Formats, S3 — operates within table format maintenanceused_byApache Iceberg, Delta Lake, Apache Hudi — all formats provide compaction mechanisms
Definition
The process of rewriting many small data files on S3 into fewer, larger files to improve query performance, reduce metadata overhead, and lower API call costs — without changing the logical content of the table.
Streaming ingestion, CDC, and frequent small writes produce many small files on S3. Without periodic compaction, query engines must open thousands of files per query, inflating latency and S3 GET costs. Compaction restores optimal file sizes.
Post-ingestion file consolidation in Iceberg/Delta/Hudi tables, scheduled maintenance for streaming lakehouse pipelines, metadata size reduction.
Connections 9
Outbound 6
scoped_to2enables1constrained_by1Inbound 3
depends_on1Resources 3
Iceberg's official compaction guide covering rewrite_data_files, bin-packing, and sort-based compaction strategies for S3-hosted tables.
Delta Lake OPTIMIZE command documentation for compacting small files into right-sized Parquet files on S3.
Hudi's compaction documentation covering inline and async compaction strategies for merge-on-read tables on object storage.