Benchmarking Methodology
The discipline of designing, executing, and reporting reproducible performance tests for S3-based data systems, covering throughput, latency, concurrency, cost efficiency, and scalability across storage, query, and ingestion layers.
Summary
The discipline of designing, executing, and reporting reproducible performance tests for S3-based data systems, covering throughput, latency, concurrency, cost efficiency, and scalability across storage, query, and ingestion layers.
Benchmarking methodology provides the measurement framework for evaluating design decisions across the S3 ecosystem — comparing table formats, query engines, file sizes, storage tiers, and compaction strategies with controlled, reproducible experiments.
- S3 performance is not deterministic. Request latency varies by prefix partition, time of day, and region load. Benchmarks must account for variance with multiple runs and percentile reporting (p50, p99), not just averages.
- Comparing table formats on the same benchmark (e.g., TPC-DS) does not capture real-world differences in maintenance cost, metadata overhead, or concurrent writer performance.
- Benchmark results on one S3-compatible platform (AWS S3) do not transfer to another (MinIO, R2). S3 compatibility does not imply performance equivalence.
scoped_toS3, Lakehouse — performance measurement for S3-based systemsconstrainsPerformance-per-Dollar — benchmarks quantify the cost/performance tradeoffrelates_toCapacity Planning — benchmark results feed capacity planning modelsrelates_toCold Scan Latency — benchmarks measure scan latency under different configurations
Definition
A standardized approach to measuring and comparing performance of S3-compatible storage systems, query engines, and table formats — covering throughput, latency, cost-per-query, and scalability under controlled conditions.
Vendor benchmarks are often misleading due to cherry-picked configurations and non-representative workloads. A rigorous benchmarking methodology enables apples-to-apples comparison of S3-based systems, informing architecture decisions with reproducible evidence.
Storage system selection, query engine comparison, validating performance claims, capacity planning validation.
Recent developments
- MLPerf Storage v2.0 (August 2025): 200+ results from 26 submitting organizations. Submitters include Alluxio, DDN, Hammerspace, HPE, Huawei, Juicedata, KIOXIA, Micron, Oracle, Samsung, IBM, WDC, YanRong + others — vendor-neutral benchmark that closed the "every vendor uses their own benchmark" gap for AI storage. Per MLCommons — MLPerf Storage v2.0 Results Announcement.
- Tested systems serve ~2× the accelerators of v1.0. Storage performance roughly doubled in one benchmark cycle — measured rate of AI-storage capability improvement, useful for capacity-planning long-horizon. Per MLCommons — MLPerf Storage v2.0 Results.
- v2.0 adds checkpointing benchmarks replicating real-world AI training systems. Pure throughput numbers don't reflect production reality — v2.0's checkpointing tests measure the actual "save + restore" pattern that bounds end-to-end training time. Per MLCommons — MLPerf Storage v2.0 Results.
- Architecture diversity: 6 local + 2 in-storage-accelerator + 13 software-defined + 12 block + 16 on-prem shared + 2 object stores. 51 distinct storage architectures in one benchmark round — the breadth lets practitioners compare across very different deployment shapes (cloud block, on-prem SAN, software-defined, object). Per MLCommons — MLPerf Storage v2.0 Results.
- JuiceFS leads bandwidth utilization + scalability for AI training in v2.0. Specific result: JuiceFS topped the bandwidth + scalability tables — useful as a reference point when evaluating S3-backed file systems for AI training. Per JuiceFS Blog — MLPerf Storage v2.0: JuiceFS Leads in Bandwidth Utilization and Scalability.
- Architecture-neutral, representative, reproducible: MLPerf Storage's design discipline. MLCommons frames the methodology around three principles — vendor-neutral architecture coverage, representative AI workloads (training + checkpointing), reproducible result format. The 2026 standard for "credible storage benchmark." Per MLCommons Working Group — MLPerf Storage.
Connections 6
Outbound 5
Inbound 1
depends_on1Resources 3
TPC benchmarks (TPC-H, TPC-DS) are the standard analytical workload benchmarks used to evaluate query engine performance on S3-backed lakehouses.
DataFusion benchmark suite providing TPC-H and micro-benchmark implementations for evaluating query performance against object storage.
ClickBench provides standardized analytical benchmarks comparing query engines including those that run against S3-hosted Parquet data.