Architecture

Benchmarking Methodology

The discipline of designing, executing, and reporting reproducible performance tests for S3-based data systems, covering throughput, latency, concurrency, cost efficiency, and scalability across storage, query, and ingestion layers.

6 connections 3 resources

Summary

What it is

Where it fits

Benchmarking methodology provides the measurement framework for evaluating design decisions across the S3 ecosystem — comparing table formats, query engines, file sizes, storage tiers, and compaction strategies with controlled, reproducible experiments.

Misconceptions / Traps

S3 performance is not deterministic. Request latency varies by prefix partition, time of day, and region load. Benchmarks must account for variance with multiple runs and percentile reporting (p50, p99), not just averages.
Comparing table formats on the same benchmark (e.g., TPC-DS) does not capture real-world differences in maintenance cost, metadata overhead, or concurrent writer performance.
Benchmark results on one S3-compatible platform (AWS S3) do not transfer to another (MinIO, R2). S3 compatibility does not imply performance equivalence.

Key Connections

scoped_to S3, Lakehouse — performance measurement for S3-based systems
constrains Performance-per-Dollar — benchmarks quantify the cost/performance tradeoff
relates_to Capacity Planning — benchmark results feed capacity planning models
relates_to Cold Scan Latency — benchmarks measure scan latency under different configurations

Definition

What it is

A standardized approach to measuring and comparing performance of S3-compatible storage systems, query engines, and table formats — covering throughput, latency, cost-per-query, and scalability under controlled conditions.

Why it exists

Vendor benchmarks are often misleading due to cherry-picked configurations and non-representative workloads. A rigorous benchmarking methodology enables apples-to-apples comparison of S3-based systems, informing architecture decisions with reproducible evidence.

Primary use cases

Storage system selection, query engine comparison, validating performance claims, capacity planning validation.

Recent developments

Latest signals

MLPerf Storage v2.0 (August 2025): 200+ results from 26 submitting organizations. Submitters include Alluxio, DDN, Hammerspace, HPE, Huawei, Juicedata, KIOXIA, Micron, Oracle, Samsung, IBM, WDC, YanRong + others — vendor-neutral benchmark that closed the "every vendor uses their own benchmark" gap for AI storage. Per MLCommons — MLPerf Storage v2.0 Results Announcement.
Tested systems serve ~2× the accelerators of v1.0. Storage performance roughly doubled in one benchmark cycle — measured rate of AI-storage capability improvement, useful for capacity-planning long-horizon. Per MLCommons — MLPerf Storage v2.0 Results.
v2.0 adds checkpointing benchmarks replicating real-world AI training systems. Pure throughput numbers don't reflect production reality — v2.0's checkpointing tests measure the actual "save + restore" pattern that bounds end-to-end training time. Per MLCommons — MLPerf Storage v2.0 Results.
Architecture diversity: 6 local + 2 in-storage-accelerator + 13 software-defined + 12 block + 16 on-prem shared + 2 object stores. 51 distinct storage architectures in one benchmark round — the breadth lets practitioners compare across very different deployment shapes (cloud block, on-prem SAN, software-defined, object). Per MLCommons — MLPerf Storage v2.0 Results.
JuiceFS leads bandwidth utilization + scalability for AI training in v2.0. Specific result: JuiceFS topped the bandwidth + scalability tables — useful as a reference point when evaluating S3-backed file systems for AI training. Per JuiceFS Blog — MLPerf Storage v2.0: JuiceFS Leads in Bandwidth Utilization and Scalability.
Architecture-neutral, representative, reproducible: MLPerf Storage's design discipline. MLCommons frames the methodology around three principles — vendor-neutral architecture coverage, representative AI workloads (training + checkpointing), reproducible result format. The 2026 standard for "credible storage benchmark." Per MLCommons Working Group — MLPerf Storage.