Pain Point

Compression Economics

The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.

4 connections 3 resources

Summary

What it is

The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.

Where it fits

Compression economics determine which codec (Snappy, ZSTD, LZ4, Gzip) to use for Parquet files on S3. Higher compression ratios reduce S3 storage and egress costs but increase compute costs during queries. The optimal choice depends on the ratio of storage cost to compute cost in a given environment.

Misconceptions / Traps

Snappy is not always the best default. ZSTD provides 20-40% better compression than Snappy with comparable decompression speed. For read-heavy workloads, ZSTD often dominates.
Compression ratio varies dramatically by data type. High-cardinality string columns compress poorly; sorted numeric columns compress extremely well. Blanket compression settings miss optimization opportunities.
Compression interacts with file sizing. A 128 MB target file size with high compression may contain far more rows than expected, which affects query parallelism and memory requirements during decompression.

Key Connections

scoped_to S3, Apache Parquet — compression codec selection for S3-stored data
constrains Cold Scan Latency — decompression CPU time adds to query latency
relates_to File Sizing Strategy — compression ratio affects effective file size
constrains Egress Cost — better compression reduces bytes transferred

Definition

What it is

The cost-performance tradeoff of compressing data stored on S3 — balancing reduced storage costs and faster transfers against increased CPU usage for compression and decompression.

Recent developments

Latest signals

Snappy + LZ4 hit ~1.12× compression; zstd-3 reaches near-best ratio without CPU explosion. Benchmark reality: Snappy + LZ4 only deliver ~12% savings — "just enough to make it look like they tried." zstd-3 gets close to best-possible ratio + scales reasonably; zstd-19 + gzip win on absolute size but at heavy CPU cost. Per DEV — Which Compression Saves the Most Storage $? (gzip, Snappy, LZ4, zstd).
Compression-speed scoreboard: Snappy/LZ4 >3.5 GB/s; zstd ~1 GB/s across all levels. Speed-vs-size tradeoff: streaming workloads + ingest paths benefit from Snappy/LZ4's throughput; batch ETL + archive benefit from zstd's compression density. Per mamonas.dev — Which Compression Algorithm Saves Most.
Decision tree: streaming → Snappy/LZ4; batch ETL → zstd-1/zstd-3; archive → zstd-9/zstd-19. 2026 consolidating decision tree by workload type. The "Snappy is the default" Parquet-era assumption no longer holds — zstd is the default for any workload where storage cost matters more than per-batch compression time. Per Spartan Blog — Compression Algorithms in Parquet.
At 500 TB scale: Snappy/LZ4 ≈ $10.7K/month vs zstd-3 ≈ $9.7K — $12K/year saving. Concrete cost arithmetic from EU-Central pricing ($0.0235/GB): codec choice on a 500 TB workload is a $1K/month difference, $12K/year. Not huge per-bucket; significant across a 10-bucket fleet. Per DEV — Which Compression Saves the Most Storage $.
e6data 2026 benchmark: ZSTD beats Snappy for Iceberg fast writes despite higher CPU. Specific Iceberg benchmark — ZSTD's better compression cuts S3 PUT volume + bandwidth enough to beat Snappy on end-to-end ingestion time even with higher CPU. Validates ZSTD as the 2026 Iceberg-write default. Per e6data — Snappy vs ZSTD in Iceberg for Fast Writes.
Codec-aware compaction strategy: re-compress with stronger codec at compaction time. Operational pattern: ingest with Snappy/LZ4 for write throughput; re-compress to zstd-3 at compaction time when the data is already being rewritten anyway. Doubles the codec benefit without doubling the write cost. Per Medium — Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived.

Connections 4

Outbound 3

scoped_to3

S3 Object Storage Table Formats

Inbound 1

solves1

Capacity Planning

Resources 3

DocsHigh

parquet.apache.org/docs/file-format/data-pages/compression/

Parquet compression documentation covering codec options (Snappy, Zstd, LZ4) and their impact on storage cost versus CPU overhead trade-offs.

DocsHigh

docs.databricks.com/aws/en/delta/best-practices

Delta Lake best practices covering compression codec selection and its effect on S3 storage costs and query performance.

DocsHigh

facebook.github.io/zstd/

Zstandard documentation for the compression algorithm increasingly adopted in lakehouse formats for its superior compression ratio and decompression speed.