File Sizing Strategy
The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.
Summary
The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request overhead, metadata volume, query parallelism, and write amplification.
File sizing is the tuning knob that connects ingestion throughput, query performance, and storage cost in S3-based lakehouses. Too-small files cause request amplification and metadata bloat; too-large files reduce parallelism and increase write amplification during compaction.
- There is no universal optimal file size. The right size depends on query patterns (point lookups favor smaller files; full scans favor larger), column count, and compression ratio.
- File sizing interacts with partition design. A partition with a 256 MB target size but only 10 MB of data per partition produces under-sized files regardless of configuration.
- Spark's
spark.sql.files.maxPartitionBytesand Iceberg'starget-file-size-bytescontrol different things. The former controls read-side split size; the latter controls write-side file size.
solvesSmall Files Problem — targets optimal file sizes to prevent small filesrelates_toCompaction — compaction enforces target file sizesconstrainsRead / Write Amplification — file size determines rewrite costscoped_toTable Formats, S3 — file sizing is a table-format-level configuration
Definition
The deliberate planning of target file sizes for data stored on S3, balancing between files too small (high API overhead) and too large (wasted reads for selective queries). Typical targets range from 128 MB to 1 GB depending on workload.
S3 charges per-request and has per-request latency. Files that are too small multiply these costs; files that are too large waste bandwidth when only a fraction of the data is needed. An explicit file sizing strategy optimizes the cost-performance tradeoff.
Compaction target sizing, streaming ingestion write batching, partition file count management.
Connections 8
Outbound 6
scoped_to2depends_on2Inbound 2
Resources 3
Databricks guide to tuning target file sizes for Delta Lake tables, balancing scan efficiency against S3 request costs.
Iceberg write configuration reference including target-file-size-bytes for controlling Parquet file sizes on object storage.
AWS S3 performance optimization guide covering the relationship between object size, request rates, and throughput.