Architecture

Partitioning

The strategy of physically organizing table data files by column values so query engines can skip irrelevant files. On S3-backed lakehouses, partitioning is the primary mechanism for reducing both I/O and API costs at scale.

5 connections 3 resources

Summary

What it is

The strategy of physically organizing table data files by column values so query engines can skip irrelevant files. On S3-backed lakehouses, partitioning is the primary mechanism for reducing both I/O and API costs at scale.

Where it fits

Foundational to all table formats (Iceberg, Delta, Hudi, Paimon). Iceberg hidden partitioning decouples the partition scheme from user-facing SQL, while Delta and Hudi use Hive-style directory layouts. Partition evolution — changing the scheme without rewriting data — is unique to Iceberg and directly impacts long-lived tables.

Misconceptions / Traps
  • Over-partitioning creates the Small Files Problem — too many partitions with too few rows each.
  • Hive-style partition columns waste storage and break schema evolution. Iceberg hidden partitioning avoids both.
  • Partitioning alone doesn't help if the query predicate doesn't match the partition key. Combine with Clustering / Sort Order for intra-partition pruning.
Key Connections
  • Directly impacts Cold Scan Latency, Object Listing Performance, and Request Pricing Models.
  • Modern partition evolution in Iceberg removes a major source of Schema Evolution pain.
  • Works alongside Manifest Pruning and Clustering / Sort Order in the query planning pipeline.

Definition

What it is

The strategy of organizing data files within a table into a directory hierarchy based on column values (e.g., date, region), enabling query engines to skip irrelevant files entirely by reading only the partitions that match the query predicate.

Why it exists

Without partitioning, every query must scan every file in a table. On S3, where list operations are expensive and data volumes reach petabytes, partitioning is the primary mechanism for reducing I/O. Modern table formats (Iceberg hidden partitioning, Hudi's partition-level indexing) improve on Hive-style partitioning by decoupling the physical layout from the SQL schema.

Primary use cases

Time-series data organized by date, multi-tenant data organized by customer ID, geographic data organized by region, event data organized by event type.

Connections 5

Outbound 5

Resources 3