Clustering / Sort Order
The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.
Summary
The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.
Clustering (also called Z-ordering, sort order, or spatial clustering) is a physical optimization within table formats on S3. By co-locating related data, it reduces the number of S3 GET requests needed to answer selective queries, directly addressing cold scan latency and request amplification.
- Clustering is not partitioning. Partitioning splits data into separate directories by exact values; clustering sorts data within files to improve min/max metadata pruning. They are complementary, not interchangeable.
- Re-clustering requires a full rewrite of affected data files. It is a resource-intensive maintenance operation similar to compaction and should be scheduled during low-usage windows.
- Clustering on high-cardinality columns (e.g., UUID) provides no benefit. The column must have meaningful locality — date ranges, geographic regions, customer segments — to be effective.
solvesCold Scan Latency — fewer files scanned means faster queriesrelates_toCompaction — clustering is often combined with compactionscoped_toTable Formats, S3 — physical data layout optimizationenablesManifest Pruning — sorted data produces tighter min/max bounds in manifests
Definition
The practice of physically ordering data within files on S3 by one or more columns (e.g., date, region, customer_id) so that queries with predicates on those columns can skip irrelevant file ranges via min/max statistics.
S3 has no indexing capability — query engines rely on file-level and row-group-level statistics to prune unnecessary reads. Clustering data by query-relevant columns maximizes the effectiveness of this pruning, directly reducing the number of S3 GET requests and bytes scanned.
Optimizing Iceberg/Delta table layouts for common query patterns, reducing scan volume for time-series queries, improving predicate pushdown effectiveness.
Connections 7
Outbound 6
scoped_to2enables1constrained_by1Inbound 1
depends_on1Resources 3
Iceberg's rewrite_data_files procedure documentation covering sort-order and Z-order clustering for optimizing scan performance on S3.
Delta Lake data skipping documentation explaining how Z-ordering and liquid clustering reduce I/O for analytical queries on S3.
Liquid clustering documentation for Delta Lake's incremental, adaptive clustering that replaces static Z-ordering on S3-backed tables.