Deletion Vector
A metadata pattern that tracks which rows in a data file have been logically deleted or updated, using a compact bitmap instead of rewriting the entire file.
Summary
A metadata pattern that tracks which rows in a data file have been logically deleted or updated, using a compact bitmap instead of rewriting the entire file.
Deletion vectors are the key mechanism that makes merge-on-read (MoR) practical for lakehouse formats on S3. Instead of the expensive copy-on-write approach (rewriting a 128MB Parquet file to delete one row), a tiny deletion vector file marks the invalidated rows. Query engines skip those rows at read time, and periodic compaction reconciles the deletes.
- Deletion vectors improve write performance at the cost of read performance. Queries must check deletion vectors for every data file, adding overhead until compaction runs.
- Not all engines support deletion vectors equally. Check your query engine's support before depending on this pattern for high-throughput reads.
enablesApache Iceberg, Delta Lake — efficient row-level operationssolvesSmall Files Problem — reduces write amplificationscoped_toTable Formats, S3
Definition
A metadata pattern used by lakehouse table formats to track which rows in a data file have been deleted or updated, without rewriting the entire data file. Instead of copy-on-write, a compact bitmap or vector records the positions of invalidated rows.
Copy-on-write (CoW) updates in lakehouse formats require rewriting entire Parquet files to delete or update a single row, causing massive write amplification. Deletion vectors enable merge-on-read (MoR) by recording row-level deletions in lightweight metadata files, dramatically reducing write costs for high-frequency update workloads.
Efficient row-level deletes and updates in Iceberg and Delta Lake, high-frequency CDC ingestion with low write amplification, streaming update workloads on S3.
Connections 5
Outbound 5
Resources 2
Databricks documentation explaining deletion vector mechanics, performance benefits, and configuration for Delta Lake.
Comparison of delete strategies across Iceberg, Delta, and Hudi including deletion vector approaches and their performance trade-offs.