File Sizing Strategy
The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.
Summary
The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.
File sizing is the tuning knob that connects ingestion throughput, query performance, and storage cost in S3-based lakehouses. Too-small files cause request amplification and metadata bloat; too-large files reduce parallelism and increase write amplification during compaction.
- There is no universal optimal file size. The right size depends on query patterns (point lookups favor smaller files; full scans favor larger), column count, and compression ratio.
- File sizing interacts with partition design. A partition with a 256 MB target size but only 10 MB of data per partition produces under-sized files regardless of configuration.
- Spark's
spark.sql.files.maxPartitionBytesand Iceberg'starget-file-size-bytescontrol different things. The former controls read-side split size; the latter controls write-side file size.
solvesSmall Files Problem — targets optimal file sizes to prevent small filesrelates_toCompaction — compaction enforces target file sizesconstrainsRead / Write Amplification — file size determines rewrite costscoped_toTable Formats, S3 — file sizing is a table-format-level configuration
Definition
The deliberate planning of target file sizes for data stored on S3, balancing between files too small (high API overhead) and too large (wasted reads for selective queries). Typical targets range from 128 MB to 1 GB depending on workload.
S3 charges per-request and has per-request latency. Files that are too small multiply these costs; files that are too large waste bandwidth when only a fraction of the data is needed. An explicit file sizing strategy optimizes the cost-performance tradeoff.
Compaction target sizing, streaming ingestion write batching, partition file count management.
Recent developments
- 2026 reference targets: 256-512 MB for Parquet (read-heavy >100GB); workload-tuned bands below. Read-heavy (>100GB): 256-512 MB. Mixed workload: 128-256 MB. Write-heavy: 64-128 MB. Small tables: 32-64 MB. Files below 128 MB create excessive metadata overhead; files above 1 GB reduce pruning effectiveness. Per Medium — Ultimate Guide to File Sizing + Compression for Apache Iceberg on S3.
- Iceberg bin-pack uses 75% / 180% boundaries around the target. For a 512 MB target, a file is a compaction candidate if its size is under 384 MB or over 922 MB. Default heuristic of Iceberg's bin-pack planner — practitioners tune the bounds rather than the target for fine-grained control. Per Dremio — Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table's Data Files.
- Bin-pack is "usually all you need" for append-heavy + CDC tables. Bin-pack's only goal is fixing file size — for the most common workloads (append-heavy + CDC + general-purpose), it's the default + sufficient strategy. Sort/Z-order kicks in only when query patterns demand the data layout work. Per Cloudera — Optimization Strategies for Iceberg Tables.
- Compaction frequency: daily for read-heavy / every-few-hours for streaming. 2026 cadence reference: read-heavy tables run compaction once per day off-peak; streaming/CDC tables run every few hours to keep the small-files accumulation bounded. Per AWS Prescriptive Guidance — Maintaining tables by using compaction.
- Apache Amoro automates Iceberg compaction as a service. Amoro is the 2026 open-source pattern: instead of writing custom Spark DAGs for compaction, run Amoro alongside the catalog — it watches tables + triggers compaction when configured thresholds cross. Closes the "you have to engineer your own compaction infrastructure" gap. Per Olake — How to Compact Apache Iceberg Tables: Small Files + Automation with Apache Amoro.
- 11 documented compaction optimizations: sort, Z-order, bin-pack, delete-file compaction, partial-progress commits, EC partitioning, and more. 2026 practitioners now have a documented optimization catalog beyond the default bin-pack — pick the right strategy per workload shape. Per DEV — 11 Compaction Optimizations for Iceberg Data Lakes.
Connections 8
Outbound 6
scoped_to2depends_on2Inbound 2
Resources 3
Databricks guide to tuning target file sizes for Delta Lake tables, balancing scan efficiency against S3 request costs.
Iceberg write configuration reference including target-file-size-bytes for controlling Parquet file sizes on object storage.
AWS S3 performance optimization guide covering the relationship between object size, request rates, and throughput.