Compression Economics
The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.
Summary
The tradeoffs between storage cost savings from data compression and the CPU/memory overhead required to compress and decompress data at read and write time in S3-based data systems.
Compression economics determine which codec (Snappy, ZSTD, LZ4, Gzip) to use for Parquet files on S3. Higher compression ratios reduce S3 storage and egress costs but increase compute costs during queries. The optimal choice depends on the ratio of storage cost to compute cost in a given environment.
- Snappy is not always the best default. ZSTD provides 20-40% better compression than Snappy with comparable decompression speed. For read-heavy workloads, ZSTD often dominates.
- Compression ratio varies dramatically by data type. High-cardinality string columns compress poorly; sorted numeric columns compress extremely well. Blanket compression settings miss optimization opportunities.
- Compression interacts with file sizing. A 128 MB target file size with high compression may contain far more rows than expected, which affects query parallelism and memory requirements during decompression.
scoped_toS3, Apache Parquet — compression codec selection for S3-stored dataconstrainsCold Scan Latency — decompression CPU time adds to query latencyrelates_toFile Sizing Strategy — compression ratio affects effective file sizeconstrainsEgress Cost — better compression reduces bytes transferred
Definition
The cost-performance tradeoff of compressing data stored on S3 — balancing reduced storage costs and faster transfers against increased CPU usage for compression and decompression.
Connections 4
Outbound 3
scoped_to3Inbound 1
solves1Resources 3
Parquet compression documentation covering codec options (Snappy, Zstd, LZ4) and their impact on storage cost versus CPU overhead trade-offs.
Delta Lake best practices covering compression codec selection and its effect on S3 storage costs and query performance.
Zstandard documentation for the compression algorithm increasingly adopted in lakehouse formats for its superior compression ratio and decompression speed.