Architecture

Compaction

The background maintenance operation that merges many small data files into fewer, larger files within a table format (Iceberg, Delta, Hudi) to improve query performance and reduce S3 request overhead.

9 connections 3 resources

Summary

What it is

The background maintenance operation that merges many small data files into fewer, larger files within a table format (Iceberg, Delta, Hudi) to improve query performance and reduce S3 request overhead.

Where it fits

Compaction is the primary remedy for the small files problem in S3-based lakehouses. Streaming ingestion, CDC pipelines, and frequent batch writes all produce small files that degrade scan performance. Compaction rewrites those files into optimally sized Parquet files while preserving table semantics.

Misconceptions / Traps
  • Compaction is not free. It reads existing files from S3, merges them, writes new files, and updates metadata. This consumes compute, S3 GET/PUT requests, and temporary storage.
  • Running compaction too aggressively conflicts with active writers. In Iceberg, concurrent compaction and writes can cause commit conflicts requiring retry.
  • Compaction does not reduce data volume. It reorganizes files for efficiency but does not delete or deduplicate data. Storage usage may temporarily increase during compaction before old files are garbage-collected.
Key Connections
  • solves Small Files Problem — the primary purpose of compaction
  • solves Small Files Amplification — reduces metadata and request overhead
  • scoped_to Table Formats, S3 — operates within table format maintenance
  • used_by Apache Iceberg, Delta Lake, Apache Hudi — all formats provide compaction mechanisms

Definition

What it is

The process of rewriting many small data files on S3 into fewer, larger files to improve query performance, reduce metadata overhead, and lower API call costs — without changing the logical content of the table.

Why it exists

Streaming ingestion, CDC, and frequent small writes produce many small files on S3. Without periodic compaction, query engines must open thousands of files per query, inflating latency and S3 GET costs. Compaction restores optimal file sizes.

Primary use cases

Post-ingestion file consolidation in Iceberg/Delta/Hudi tables, scheduled maintenance for streaming lakehouse pipelines, metadata size reduction.

Connections 9

Outbound 6
Inbound 3

Resources 3