Architecture

Clustering / Sort Order

The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.

7 connections 3 resources

Summary

What it is

The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.

Where it fits

Clustering (also called Z-ordering, sort order, or spatial clustering) is a physical optimization within table formats on S3. By co-locating related data, it reduces the number of S3 GET requests needed to answer selective queries, directly addressing cold scan latency and request amplification.

Misconceptions / Traps

Clustering is not partitioning. Partitioning splits data into separate directories by exact values; clustering sorts data within files to improve min/max metadata pruning. They are complementary, not interchangeable.
Re-clustering requires a full rewrite of affected data files. It is a resource-intensive maintenance operation similar to compaction and should be scheduled during low-usage windows.
Clustering on high-cardinality columns (e.g., UUID) provides no benefit. The column must have meaningful locality — date ranges, geographic regions, customer segments — to be effective.

Key Connections

solves Cold Scan Latency — fewer files scanned means faster queries
relates_to Compaction — clustering is often combined with compaction
scoped_to Table Formats, S3 — physical data layout optimization
enables Manifest Pruning — sorted data produces tighter min/max bounds in manifests

Definition

What it is

The practice of physically ordering data within files on S3 by one or more columns (e.g., date, region, customer_id) so that queries with predicates on those columns can skip irrelevant file ranges via min/max statistics.

Why it exists

S3 has no indexing capability — query engines rely on file-level and row-group-level statistics to prune unnecessary reads. Clustering data by query-relevant columns maximizes the effectiveness of this pruning, directly reducing the number of S3 GET requests and bytes scanned.

Primary use cases

Optimizing Iceberg/Delta table layouts for common query patterns, reducing scan volume for time-series queries, improving predicate pushdown effectiveness.

Recent developments

Latest signals

AWS shipped sort + Z-order compaction for Iceberg on S3. AWS now offers managed sort and Z-order compaction inside S3 Tables / Glue — flip a switch and the catalog runs the compaction job on your behalf. No more bespoke Spark rewrite-data-files DAGs for the common case. Per AWS Blog — Improve Apache Iceberg Query Performance in S3 with Sort and Z-Order Compaction.
Format-level capability matrix (May 2026): Hudi has both Z-order + Hilbert; Iceberg has Z-order only; Delta has Z-order + Hilbert (Liquid clustering). Hilbert curves give better multi-dimensional locality than Z-order but cost more to compute. Active discussion in 2026 to add Hilbert curve support to Apache Iceberg. Per Iceberg Lakehouse — Z-Order Clustering in Apache Iceberg and Medium — Apache Iceberg Is Missing a Critical Feature (Feb 2026).
Z-order is bit-interleaving of multiple column values into one sort key. Preserves locality across multiple dimensions simultaneously — query filters on any of the clustered columns benefit. The reason Z-order beats single-column sort when query patterns span multiple predicates. Per Dremio — How Z-Ordering in Apache Iceberg Helps Improve Performance.
Iceberg clustering is manual (rewrite_data_files Spark procedure); not a continuous service. Until AWS S3 Tables shipped managed compaction, Iceberg required users to explicitly trigger clustering — unlike Delta Lake's Liquid Clustering which runs as a continuous background service. The "automatic" vs "manual" capability gap is now closing via cloud-vendor managed services. Per Onehouse — What is Clustering in an Open Data Lakehouse.
Multi-dimensional clustering wins on queries with multiple where-clause predicates. The 2026 framing: single-column sort works if all queries filter the same column; Z-order/Hilbert when query patterns span multiple columns; clustering decision should follow actual query workload analysis. Per Medium — Data Locality and Multi-Dimensional Clustering in Data Lakehouse.
Sort + Z-order compaction usually paired with bin-pack file-sizing. Two-step compaction pattern: bin-pack (combine small files into target-sized files) THEN sort/Z-order (lay out the rows within those files). Most engines now run these as a single combined operation. Per DEV — Apache Iceberg Table Optimization #4: Smarter Data Layout.