Pain Point

Small File I/O Storm

The dominant performance pathology in S3-based data systems — a workload pattern where millions of small objects (typically <1 MB each) produce a per-request-latency-dominated I/O profile that defeats S3's throughput-oriented design. Each S3 LIST returns at most 1,000 objects (so listing 1M objects = 1,000 round-trips); each GET pays the same per-request HTTP+TLS+S3-routing overhead regardless of object size; and analytical query engines must open each data file individually before they can read it. Net effect: a 582K-small-file Athena query measured **40 seconds**; the same data compacted to 336 files of 247 MB each ran in **9.7 seconds** — a 75% reduction at the engine level alone.

3 connections

Definition

What it is

The dominant performance pathology in S3-based data systems — a workload pattern where millions of small objects (typically <1 MB each) produce a per-request-latency-dominated I/O profile that defeats S3's throughput-oriented design. Each S3 LIST returns at most 1,000 objects (so listing 1M objects = 1,000 round-trips); each GET pays the same per-request HTTP+TLS+S3-routing overhead regardless of object size; and analytical query engines must open each data file individually before they can read it. Net effect: a 582K-small-file Athena query measured **40 seconds**; the same data compacted to 336 files of 247 MB each ran in **9.7 seconds** — a 75% reduction at the engine level alone.

Why it exists

S3 is purpose-built to optimize aggregate throughput on large objects (multi-megabyte reads), not single-object latency. Every object operation incurs a fixed cost (HTTP request, signature verification, S3 internal lookup, network round-trip) that's effectively constant whether the payload is 1 KB or 100 MB — so small objects pay the same overhead while delivering tiny payloads, destroying the throughput-per-request ratio. The pattern emerges naturally from streaming-ingestion pipelines (Kafka → S3 sink, IoT telemetry, CDC events) that produce many small objects per minute without any compaction step.

Primary use cases

*(as a pain point — this is what triggers it:)* streaming ingestion into S3 lakehouse tables without compaction, log/event/telemetry pipelines that micro-batch into S3 every few seconds, CDC pipelines emitting per-row delete files, AI training pipelines writing per-sample checkpoint metadata, and any workload where the natural object emission rate creates millions of small files before compaction can run.

Recent developments

Latest signals
  • The canonical Athena benchmark: 582K small files → 336 files = 75% query-time reduction. AWS-published benchmark of an Athena query on 582K small CloudTrail files (0.14 MB each) at 40 seconds vs 336 files (247 MB each) at 9.7 seconds. Per Upsolver — Small File Problem on S3.
  • S3 Tables compaction: 3.2× query acceleration + 8.5× fewer read requests vs uncompacted 1 MB files. Amazon S3 Tables (re:Invent 2024 GA) bundles automatic Iceberg compaction; their published benchmarks show 3.2× query acceleration and 8.5× read-request reduction compared to uncompacted 1 MB-file baselines. Per AWS — Iceberg optimization small files in EMR.
  • Compaction is the canonical fix. Run a maintenance job asynchronously to ingestion: read N small files, rewrite as fewer larger files (target 100-512 MB). Per Uplatz — Compaction Strategies for Object Storage.
  • 20× cost surprise with managed S3 Tables compaction. Onehouse's 2026 analysis documented operators hitting up to 20× higher costs than expected on managed S3 Tables when compaction processing fees are factored in — managed compaction isn't free. Per Onehouse — S3 Managed Tables 20x Surprise.
  • Existing site guide. The llms3.com Small Files Problem guide already covers the structural causes + mitigation menu (compaction, batching, tiered ingestion). Per LLMS3 guide — Small Files Problem.

Connections 3

Outbound 3