Clustering / Sort Order
The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.
Summary
The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on those columns read fewer, more relevant files from S3.
Clustering (also called Z-ordering, sort order, or spatial clustering) is a physical optimization within table formats on S3. By co-locating related data, it reduces the number of S3 GET requests needed to answer selective queries, directly addressing cold scan latency and request amplification.
- Clustering is not partitioning. Partitioning splits data into separate directories by exact values; clustering sorts data within files to improve min/max metadata pruning. They are complementary, not interchangeable.
- Re-clustering requires a full rewrite of affected data files. It is a resource-intensive maintenance operation similar to compaction and should be scheduled during low-usage windows.
- Clustering on high-cardinality columns (e.g., UUID) provides no benefit. The column must have meaningful locality — date ranges, geographic regions, customer segments — to be effective.
solvesCold Scan Latency — fewer files scanned means faster queriesrelates_toCompaction — clustering is often combined with compactionscoped_toTable Formats, S3 — physical data layout optimizationenablesManifest Pruning — sorted data produces tighter min/max bounds in manifests
Definition
The practice of physically ordering data within files on S3 by one or more columns (e.g., date, region, customer_id) so that queries with predicates on those columns can skip irrelevant file ranges via min/max statistics.
S3 has no indexing capability — query engines rely on file-level and row-group-level statistics to prune unnecessary reads. Clustering data by query-relevant columns maximizes the effectiveness of this pruning, directly reducing the number of S3 GET requests and bytes scanned.
Optimizing Iceberg/Delta table layouts for common query patterns, reducing scan volume for time-series queries, improving predicate pushdown effectiveness.
Recent developments
- AWS shipped sort + Z-order compaction for Iceberg on S3. AWS now offers managed sort and Z-order compaction inside S3 Tables / Glue — flip a switch and the catalog runs the compaction job on your behalf. No more bespoke Spark rewrite-data-files DAGs for the common case. Per AWS Blog — Improve Apache Iceberg Query Performance in S3 with Sort and Z-Order Compaction.
- Format-level capability matrix (May 2026): Hudi has both Z-order + Hilbert; Iceberg has Z-order only; Delta has Z-order + Hilbert (Liquid clustering). Hilbert curves give better multi-dimensional locality than Z-order but cost more to compute. Active discussion in 2026 to add Hilbert curve support to Apache Iceberg. Per Iceberg Lakehouse — Z-Order Clustering in Apache Iceberg and Medium — Apache Iceberg Is Missing a Critical Feature (Feb 2026).
- Z-order is bit-interleaving of multiple column values into one sort key. Preserves locality across multiple dimensions simultaneously — query filters on any of the clustered columns benefit. The reason Z-order beats single-column sort when query patterns span multiple predicates. Per Dremio — How Z-Ordering in Apache Iceberg Helps Improve Performance.
- Iceberg clustering is manual (rewrite_data_files Spark procedure); not a continuous service. Until AWS S3 Tables shipped managed compaction, Iceberg required users to explicitly trigger clustering — unlike Delta Lake's Liquid Clustering which runs as a continuous background service. The "automatic" vs "manual" capability gap is now closing via cloud-vendor managed services. Per Onehouse — What is Clustering in an Open Data Lakehouse.
- Multi-dimensional clustering wins on queries with multiple where-clause predicates. The 2026 framing: single-column sort works if all queries filter the same column; Z-order/Hilbert when query patterns span multiple columns; clustering decision should follow actual query workload analysis. Per Medium — Data Locality and Multi-Dimensional Clustering in Data Lakehouse.
- Sort + Z-order compaction usually paired with bin-pack file-sizing. Two-step compaction pattern: bin-pack (combine small files into target-sized files) THEN sort/Z-order (lay out the rows within those files). Most engines now run these as a single combined operation. Per DEV — Apache Iceberg Table Optimization #4: Smarter Data Layout.
Connections 7
Outbound 6
scoped_to2enables1constrained_by1Inbound 1
depends_on1Resources 3
Iceberg's rewrite_data_files procedure documentation covering sort-order and Z-order clustering for optimizing scan performance on S3.
Delta Lake data skipping documentation explaining how Z-ordering and liquid clustering reduce I/O for analytical queries on S3.
Liquid clustering documentation for Delta Lake's incremental, adaptive clustering that replaces static Z-ordering on S3-backed tables.