Architecture

Clustering / Sort Order

The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.

7 connections 3 resources

Summary

What it is

The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.

Where it fits

Clustering (also called Z-ordering, sort order, or spatial clustering) is a physical optimization within table formats on S3. By co-locating related data, it reduces the number of S3 GET requests needed to answer selective queries, directly addressing cold scan latency and request amplification.

Misconceptions / Traps
  • Clustering is not partitioning. Partitioning splits data into separate directories by exact values; clustering sorts data within files to improve min/max metadata pruning. They are complementary, not interchangeable.
  • Re-clustering requires a full rewrite of affected data files. It is a resource-intensive maintenance operation similar to compaction and should be scheduled during low-usage windows.
  • Clustering on high-cardinality columns (e.g., UUID) provides no benefit. The column must have meaningful locality — date ranges, geographic regions, customer segments — to be effective.
Key Connections
  • solves Cold Scan Latency — fewer files scanned means faster queries
  • relates_to Compaction — clustering is often combined with compaction
  • scoped_to Table Formats, S3 — physical data layout optimization
  • enables Manifest Pruning — sorted data produces tighter min/max bounds in manifests

Definition

What it is

The practice of physically ordering data within files on S3 by one or more columns (e.g., date, region, customer_id) so that queries with predicates on those columns can skip irrelevant file ranges via min/max statistics.

Why it exists

S3 has no indexing capability — query engines rely on file-level and row-group-level statistics to prune unnecessary reads. Clustering data by query-relevant columns maximizes the effectiveness of this pruning, directly reducing the number of S3 GET requests and bytes scanned.

Primary use cases

Optimizing Iceberg/Delta table layouts for common query patterns, reducing scan volume for time-series queries, improving predicate pushdown effectiveness.

Connections 7

Outbound 6
Inbound 1

Resources 3