Pain Point

Read / Write Amplification

The ratio between the logical data volume involved in an operation and the actual bytes read from or written to S3, arising from immutable file formats, copy-on-write semantics, and metadata overhead inherent in S3-based table formats.

6 connections 3 resources

Summary

What it is

The ratio between the logical data volume involved in an operation and the actual bytes read from or written to S3, arising from immutable file formats, copy-on-write semantics, and metadata overhead inherent in S3-based table formats.

Where it fits

Read/write amplification quantifies the hidden I/O cost of operations on S3-based lakehouses. A single row update in Iceberg's copy-on-write mode rewrites an entire data file (write amplification); a query that needs 100 rows may read entire Parquet row groups (read amplification). Both inflate S3 costs and latency.

Misconceptions / Traps
  • Merge-on-read (Iceberg, Hudi MOR) reduces write amplification by deferring rewrites but increases read amplification because delete files must be applied at query time. The tradeoff shifts cost from writers to readers.
  • Parquet's columnar format reduces read amplification for column-selective queries but not for row-selective queries. Reading one row still requires reading the entire row group.
  • Compaction reduces read amplification (fewer files to scan) but temporarily increases write amplification (rewriting files). The net effect depends on the read/write ratio of the workload.
Key Connections
  • scoped_to Table Formats, S3 — I/O amplification in S3-based tables
  • amplifies Request Pricing Models — amplified I/O means amplified request costs
  • constrains Cold Scan Latency — read amplification increases scan time
  • relates_to Compaction — compaction trades write amplification for reduced read amplification

Definition

What it is

The ratio of actual bytes read from or written to S3 versus the logical bytes needed by the operation. Copy-on-write table formats and compaction strategies can amplify physical I/O well beyond the logical change size.

Connections 6

Outbound 3
Inbound 3

Resources 3