Pain Point

Cold Scan Latency

Summary

What it is

Slow first-query performance against S3-stored data, caused by object discovery, metadata fetching, and data transfer over HTTP.

Where it fits

Cold scan latency is the fundamental performance trade-off of the separation of storage and compute pattern. Every query against S3 starts with network overhead that does not exist when querying local disk.

Misconceptions / Traps

Cold scan latency is not the same as S3 being slow. S3 throughput is high, but initial latency per request is ~50-100ms. For queries touching many files, this adds up.
Caching helps with repeat queries but not with the first query. True cold scan mitigation requires metadata-driven pruning (table formats) and intelligent prefetching.

Key Connections

Apache Parquet solves Cold Scan Latency — columnar layout enables predicate pushdown
Lakehouse Architecture, Hybrid S3 + Vector Index solves Cold Scan Latency — metadata-driven access
Separation of Storage and Compute constrained_by Cold Scan Latency — inherent trade-off
StarRocks constrained_by Cold Scan Latency — first-query limited by S3 access
scoped_to S3, Object Storage

Definition

What it is

The delay experienced on the first query against S3-stored data, caused by object discovery (listing), metadata fetching, and data transfer over the network.

Relationships

Outbound Relationships

scoped_to

S3 Object Storage

Inbound Relationships

constrained_by

StarRocks Separation of Storage and Compute

solves

Apache Parquet ORC Lakehouse Architecture Hybrid S3 + Vector Index

Resources

DocsHigh

docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-per...

AWS's official S3 performance optimization guide covering request parallelization, prefix design, and throughput targets.

DocsHigh

docs.aws.amazon.com/AmazonS3/latest/userguide/intelligent-ti...

AWS documentation on S3 Intelligent-Tiering, explaining how automatic tier transitions affect retrieval latency for infrequently accessed data.