Pain Point

Cold Scan Latency

Summary

What it is

Slow first-query performance against S3-stored data, caused by object discovery, metadata fetching, and data transfer over HTTP.

Where it fits

Cold scan latency is the fundamental performance trade-off of the separation of storage and compute pattern. Every query against S3 starts with network overhead that does not exist when querying local disk.

Misconceptions / Traps

  • Cold scan latency is not the same as S3 being slow. S3 throughput is high, but initial latency per request is ~50-100ms. For queries touching many files, this adds up.
  • Caching helps with repeat queries but not with the first query. True cold scan mitigation requires metadata-driven pruning (table formats) and intelligent prefetching.

Key Connections

  • Apache Parquet solves Cold Scan Latency — columnar layout enables predicate pushdown
  • Lakehouse Architecture, Hybrid S3 + Vector Index solves Cold Scan Latency — metadata-driven access
  • Separation of Storage and Compute constrained_by Cold Scan Latency — inherent trade-off
  • StarRocks constrained_by Cold Scan Latency — first-query limited by S3 access
  • scoped_to S3, Object Storage

Definition

What it is

The delay experienced on the first query against S3-stored data, caused by object discovery (listing), metadata fetching, and data transfer over the network.

Relationships

Outbound Relationships

Resources