Architecture

File Sizing Strategy

The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.

8 connections 3 resources

Summary

What it is

The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.

Where it fits

File sizing is the tuning knob that connects ingestion throughput, query performance, and storage cost in S3-based lakehouses. Too-small files cause request amplification and metadata bloat; too-large files reduce parallelism and increase write amplification during compaction.

Misconceptions / Traps
  • There is no universal optimal file size. The right size depends on query patterns (point lookups favor smaller files; full scans favor larger), column count, and compression ratio.
  • File sizing interacts with partition design. A partition with a 256 MB target size but only 10 MB of data per partition produces under-sized files regardless of configuration.
  • Spark's spark.sql.files.maxPartitionBytes and Iceberg's target-file-size-bytes control different things. The former controls read-side split size; the latter controls write-side file size.
Key Connections
  • solves Small Files Problem — targets optimal file sizes to prevent small files
  • relates_to Compaction — compaction enforces target file sizes
  • constrains Read / Write Amplification — file size determines rewrite cost
  • scoped_to Table Formats, S3 — file sizing is a table-format-level configuration

Definition

What it is

The deliberate planning of target file sizes for data stored on S3, balancing between files too small (high API overhead) and too large (wasted reads for selective queries). Typical targets range from 128 MB to 1 GB depending on workload.

Why it exists

S3 charges per-request and has per-request latency. Files that are too small multiply these costs; files that are too large waste bandwidth when only a fraction of the data is needed. An explicit file sizing strategy optimizes the cost-performance tradeoff.

Primary use cases

Compaction target sizing, streaming ingestion write batching, partition file count management.

Connections 8

Outbound 6
Inbound 2

Resources 3