Architecture

Partitioning

The strategy of physically organizing table data files by column values so query engines can skip irrelevant files. On S3-backed lakehouses, partitioning is the primary mechanism for reducing both I/O and API costs at scale.

5 connections 3 resources

Summary

What it is

The strategy of physically organizing table data files by column values so query engines can skip irrelevant files. On S3-backed lakehouses, partitioning is the primary mechanism for reducing both I/O and API costs at scale.

Where it fits

Foundational to all table formats (Iceberg, Delta, Hudi, Paimon). Iceberg hidden partitioning decouples the partition scheme from user-facing SQL, while Delta and Hudi use Hive-style directory layouts. Partition evolution — changing the scheme without rewriting data — is unique to Iceberg and directly impacts long-lived tables.

Misconceptions / Traps
  • Over-partitioning creates the Small Files Problem — too many partitions with too few rows each.
  • Hive-style partition columns waste storage and break schema evolution. Iceberg hidden partitioning avoids both.
  • Partitioning alone doesn't help if the query predicate doesn't match the partition key. Combine with Clustering / Sort Order for intra-partition pruning.
Key Connections
  • Directly impacts Cold Scan Latency, Object Listing Performance, and Request Pricing Models.
  • Modern partition evolution in Iceberg removes a major source of Schema Evolution pain.
  • Works alongside Manifest Pruning and Clustering / Sort Order in the query planning pipeline.

Definition

What it is

The strategy of organizing data files within a table into a directory hierarchy based on column values (e.g., date, region), enabling query engines to skip irrelevant files entirely by reading only the partitions that match the query predicate.

Why it exists

Without partitioning, every query must scan every file in a table. On S3, where list operations are expensive and data volumes reach petabytes, partitioning is the primary mechanism for reducing I/O. Modern table formats (Iceberg hidden partitioning, Hudi's partition-level indexing) improve on Hive-style partitioning by decoupling the physical layout from the SQL schema.

Primary use cases

Time-series data organized by date, multi-tenant data organized by customer ID, geographic data organized by region, event data organized by event type.

Recent developments

Latest signals
  • Iceberg hidden partitioning eliminates the Hive-era "forgot to filter on partition column" footgun. Hidden partitioning applies a transform to an existing column (days(timestamp), bucket(16, user_id)) — analysts query natural business columns; the engine auto-prunes partitions. The class of accidental full-table scans that haunted Hive doesn't exist in Iceberg. Per DataLakehouseHub — Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans and RisingWave — Iceberg Hidden Partitioning Explained: Why It Matters for Streaming.
  • bucket(16, user_id) is the canonical pattern for high-cardinality MERGE/CDC. Hash partitioning distributes data evenly across N buckets regardless of key distribution — for CDC upserts across millions of users, bucket(16, user_id) means each MERGE scans only 1/16 of the table. The reference pattern for CDC into Iceberg. Per Cazpian — Iceberg Table Design: Properties, Partitioning, Commit Best Practices.
  • Partition size target: 128 MB – 1 GB after compression. 2026 sizing guidance: each partition should hold 128 MB to 1 GB of compressed data — enough for well-sized files (256 MB each) without bloating partition count. Anything outside that band wastes either I/O (too small) or pruning (too large). Per Starburst — 3 Iceberg Partitioning Best Practices to Improve Performance.
  • "Start coarser than you think; optimize later" is the dominant 2026 anti-pattern correction. Teams over-partition early (DAY or HOUR on high-volume streams) and end up paying metadata + planning cost long before pruning benefits show up. The corrective: start at month granularity + descend to day only when the workload demands it. Per Cazpian — Iceberg Table Design Best Practices.
  • Partition evolution lets you change strategy without breaking existing data. Iceberg's partition-evolution capability is the load-bearing operational difference vs Hive — change the partition spec going forward; old data keeps its old partition layout; queries auto-handle both. Per Apache Iceberg — Partitioning docs.
  • Separation of concerns: engineers define partition spec once; analysts write natural SQL. Hidden partitioning + partition transforms decouple the physical layout from the SQL schema. Data engineers tune the partition strategy; analysts write business-column SQL; queries auto-optimize. The clean separation of operational + analytical concerns that Hive never delivered. Per Substack — Partitioning with Apache Iceberg: A Deep Dive.

Connections 5

Outbound 5

Resources 3