Architecture

File Sizing Strategy

The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.

8 connections 3 resources

Summary

What it is

Where it fits

File sizing is the tuning knob that connects ingestion throughput, query performance, and storage cost in S3-based lakehouses. Too-small files cause request amplification and metadata bloat; too-large files reduce parallelism and increase write amplification during compaction.

Misconceptions / Traps

There is no universal optimal file size. The right size depends on query patterns (point lookups favor smaller files; full scans favor larger), column count, and compression ratio.
File sizing interacts with partition design. A partition with a 256 MB target size but only 10 MB of data per partition produces under-sized files regardless of configuration.
Spark's spark.sql.files.maxPartitionBytes and Iceberg's target-file-size-bytes control different things. The former controls read-side split size; the latter controls write-side file size.

Key Connections

solves Small Files Problem — targets optimal file sizes to prevent small files
relates_to Compaction — compaction enforces target file sizes
constrains Read / Write Amplification — file size determines rewrite cost
scoped_to Table Formats, S3 — file sizing is a table-format-level configuration

Definition

What it is

The deliberate planning of target file sizes for data stored on S3, balancing between files too small (high API overhead) and too large (wasted reads for selective queries). Typical targets range from 128 MB to 1 GB depending on workload.

Why it exists

S3 charges per-request and has per-request latency. Files that are too small multiply these costs; files that are too large waste bandwidth when only a fraction of the data is needed. An explicit file sizing strategy optimizes the cost-performance tradeoff.

Primary use cases

Compaction target sizing, streaming ingestion write batching, partition file count management.

Recent developments

Latest signals

2026 reference targets: 256-512 MB for Parquet (read-heavy >100GB); workload-tuned bands below. Read-heavy (>100GB): 256-512 MB. Mixed workload: 128-256 MB. Write-heavy: 64-128 MB. Small tables: 32-64 MB. Files below 128 MB create excessive metadata overhead; files above 1 GB reduce pruning effectiveness. Per Medium — Ultimate Guide to File Sizing + Compression for Apache Iceberg on S3.
Iceberg bin-pack uses 75% / 180% boundaries around the target. For a 512 MB target, a file is a compaction candidate if its size is under 384 MB or over 922 MB. Default heuristic of Iceberg's bin-pack planner — practitioners tune the bounds rather than the target for fine-grained control. Per Dremio — Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table's Data Files.
Bin-pack is "usually all you need" for append-heavy + CDC tables. Bin-pack's only goal is fixing file size — for the most common workloads (append-heavy + CDC + general-purpose), it's the default + sufficient strategy. Sort/Z-order kicks in only when query patterns demand the data layout work. Per Cloudera — Optimization Strategies for Iceberg Tables.
Compaction frequency: daily for read-heavy / every-few-hours for streaming. 2026 cadence reference: read-heavy tables run compaction once per day off-peak; streaming/CDC tables run every few hours to keep the small-files accumulation bounded. Per AWS Prescriptive Guidance — Maintaining tables by using compaction.
Apache Amoro automates Iceberg compaction as a service. Amoro is the 2026 open-source pattern: instead of writing custom Spark DAGs for compaction, run Amoro alongside the catalog — it watches tables + triggers compaction when configured thresholds cross. Closes the "you have to engineer your own compaction infrastructure" gap. Per Olake — How to Compact Apache Iceberg Tables: Small Files + Automation with Apache Amoro.
11 documented compaction optimizations: sort, Z-order, bin-pack, delete-file compaction, partial-progress commits, EC partitioning, and more. 2026 practitioners now have a documented optimization catalog beyond the default bin-pack — pick the right strategy per workload shape. Per DEV — 11 Compaction Optimizations for Iceberg Data Lakes.