Pain Point

Compression Economics

The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.

4 connections 3 resources

Summary

What it is

The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.

Where it fits

Compression economics determine which codec (Snappy, ZSTD, LZ4, Gzip) to use for Parquet files on S3. Higher compression ratios reduce S3 storage and egress costs but increase compute costs during queries. The optimal choice depends on the ratio of storage cost to compute cost in a given environment.

Misconceptions / Traps
  • Snappy is not always the best default. ZSTD provides 20-40% better compression than Snappy with comparable decompression speed. For read-heavy workloads, ZSTD often dominates.
  • Compression ratio varies dramatically by data type. High-cardinality string columns compress poorly; sorted numeric columns compress extremely well. Blanket compression settings miss optimization opportunities.
  • Compression interacts with file sizing. A 128 MB target file size with high compression may contain far more rows than expected, which affects query parallelism and memory requirements during decompression.
Key Connections
  • scoped_to S3, Apache Parquet — compression codec selection for S3-stored data
  • constrains Cold Scan Latency — decompression CPU time adds to query latency
  • relates_to File Sizing Strategy — compression ratio affects effective file size
  • constrains Egress Cost — better compression reduces bytes transferred

Definition

What it is

The cost-performance tradeoff of compressing data stored on S3 — balancing reduced storage costs and faster transfers against increased CPU usage for compression and decompression.

Connections 4

Outbound 3
Inbound 1

Resources 3