Architecture

Capacity Planning

The practice of forecasting and provisioning storage, compute, and network resources for S3-based data systems based on projected data volumes, query patterns, ingestion rates, and growth trajectories.

6 connections 3 resources

Summary

What it is

The practice of forecasting and provisioning storage, compute, and network resources for S3-based data systems based on projected data volumes, query patterns, ingestion rates, and growth trajectories.

Where it fits

Capacity planning is the operational discipline that prevents S3-based lakehouses from either over-provisioning (wasting money) or under-provisioning (hitting throttling limits, running out of catalog capacity, or degrading query performance under load).

Misconceptions / Traps
  • S3 storage is "infinite" but S3 request rates are not. Capacity planning must account for request-per-second limits (3,500 PUT/5,500 GET per prefix partition), not just storage volume.
  • Catalog capacity is often the binding constraint. Hive Metastore databases, Glue API rate limits, and Nessie commit throughput all have finite capacity that must be planned for.
  • Data growth rate is not the same as metadata growth rate. A single streaming ingestion job can produce millions of small files (and millions of metadata entries) per day even if total data volume is modest.
Key Connections
  • scoped_to S3, Lakehouse — resource planning for S3-based data systems
  • constrains Request Amplification — capacity limits determine acceptable request patterns
  • constrains Metadata Overhead at Scale — catalog sizing must be planned
  • relates_to Benchmarking Methodology — benchmarks provide the data for capacity models

Definition

What it is

The discipline of forecasting storage, throughput, and API call requirements for S3-based data systems based on growth trends, ingestion rates, query patterns, and retention policies.

Why it exists

S3 scales elastically, but costs scale linearly with usage. Without capacity planning, organizations face surprise bills from unchecked data growth, unplanned API costs from small files, and throughput limits from request rate partitioning.

Primary use cases

S3 cost forecasting, request rate planning for high-throughput ingestion, storage growth modeling for data lakes.

Connections 6

Outbound 5
Inbound 1

Resources 3