Architecture

Deletion Vector

A metadata pattern that tracks which rows in a data file have been logically deleted or updated, using a compact bitmap instead of rewriting the entire file.

5 connections 2 resources

Summary

What it is

A metadata pattern that tracks which rows in a data file have been logically deleted or updated, using a compact bitmap instead of rewriting the entire file.

Where it fits

Deletion vectors are the key mechanism that makes merge-on-read (MoR) practical for lakehouse formats on S3. Instead of the expensive copy-on-write approach (rewriting a 128MB Parquet file to delete one row), a tiny deletion vector file marks the invalidated rows. Query engines skip those rows at read time, and periodic compaction reconciles the deletes.

Misconceptions / Traps
  • Deletion vectors improve write performance at the cost of read performance. Queries must check deletion vectors for every data file, adding overhead until compaction runs.
  • Not all engines support deletion vectors equally. Check your query engine's support before depending on this pattern for high-throughput reads.
Key Connections
  • enables Apache Iceberg, Delta Lake — efficient row-level operations
  • solves Small Files Problem — reduces write amplification
  • scoped_to Table Formats, S3

Definition

What it is

A metadata pattern used by lakehouse table formats to track which rows in a data file have been deleted or updated, without rewriting the entire data file. Instead of copy-on-write, a compact bitmap or vector records the positions of invalidated rows.

Why it exists

Copy-on-write (CoW) updates in lakehouse formats require rewriting entire Parquet files to delete or update a single row, causing massive write amplification. Deletion vectors enable merge-on-read (MoR) by recording row-level deletions in lightweight metadata files, dramatically reducing write costs for high-frequency update workloads.

Primary use cases

Efficient row-level deletes and updates in Iceberg and Delta Lake, high-frequency CDC ingestion with low write amplification, streaming update workloads on S3.

Connections 5

Outbound 5

Resources 2