Pain Point

Performance-per-Dollar

The composite metric that evaluates S3-based data system efficiency by normalizing query throughput, scan latency, or ingestion rate against total cost (storage, requests, compute, egress, and caching), enabling apples-to-apples comparison of architectural choices.

4 connections 3 resources 1 post

Summary

What it is

The composite metric that evaluates S3-based data system efficiency by normalizing query throughput, scan latency, or ingestion rate against total cost (storage, requests, compute, egress, and caching), enabling apples-to-apples comparison of architectural choices.

Where it fits

Performance-per-dollar is the ultimate evaluation criterion for S3-based architecture decisions. Choosing between Parquet and ORC, Iceberg and Delta, Trino and Spark, or AWS S3 and MinIO should be grounded in measured performance-per-dollar, not raw performance alone.

Misconceptions / Traps
  • Raw performance benchmarks (queries per second, scan throughput) are meaningless without cost context. A system that is 2x faster but 5x more expensive is not a better choice.
  • Cost in S3-based systems has many components: storage per GB, request pricing, compute (spot vs on-demand), egress, and metadata API calls. Benchmarks that omit any component are misleading.
  • Performance-per-dollar changes with scale. A system that is cost-efficient at 1 TB may be uneconomical at 1 PB due to metadata overhead, request amplification, or catalog limits.
Key Connections
  • scoped_to S3, Lakehouse — cost efficiency across S3-based systems
  • depends_on Benchmarking Methodology — measured by controlled benchmarks
  • constrains Request Pricing Models — request costs are a key component
  • constrains Egress Cost — egress is a significant cost factor in multi-region designs

Definition

What it is

The metric of query throughput, latency, or processing speed normalized to total cost (storage + compute + API calls + egress) for S3-based data systems, used to compare architectures, engines, and storage configurations.

Recent developments

Latest signals
  • TPC-DS 10TB 2025-2026 results: Trino ~17s/query avg vs Spark ~38s/query avg. Recent TPC-DS benchmarks on 10TB datasets place Trino at ~17.46s average query latency vs Spark at ~38.24s — ~2× faster. The "Trino is faster" argument is now data-backed for analytical workloads. Per Hive on MR3 — TPC-DS Benchmark: Trino 476, Spark 4.0.0, Hive 4 on MR3 2.1.
  • Starburst Enterprise (Trino) 2.5×-7.1× faster than EMR alternatives in production. Cloud-deployment benchmark: Starburst Enterprise vs AWS EMR — 2.5× faster than EMR Presto, 3.9× faster than EMR Spark, 7.1× faster than EMR Hive. Performance-per-dollar improves further once you factor that you pay similar EC2 rates for slower-completing jobs. Per Concurrency Labs — Querying 6.35B Records: TPC-DS Performance + Cost Comparison Starburst Enterprise vs EMR.
  • StarRocks publishes TPC-DS benchmarks against the field. Open-source StarRocks (analytics-focused MPP) now publishes its own TPC-DS results — extends the engine-comparison field beyond the Trino/Spark/Hive trio. Per StarRocks Docs — TPC-DS Benchmarking.
  • Databricks SQL ships TPC-DS evaluation tooling natively. Databricks-on-AWS lets users run TPC-DS against their own deployments to validate cost-performance vs the spec — closes the gap between published benchmarks + customer-environment numbers. Per Databricks Docs — Use TPC-DS Sample Dataset to Evaluate System Performance.
  • Cost-efficiency calculation requires combining benchmark + current cloud pricing. Critical 2026 framing: raw benchmark numbers (sec/query) don't equal performance-per-dollar — must combine with hourly EC2/GCE pricing, query concurrency, and idle-cluster cost. Many published benchmarks don't include the cost-side math; practitioners must do it. Per Hive on MR3 — TPC-DS Benchmark 476 + Spark 4.0.0.
  • IBM ships open-source spark-tpc-ds-performance-test for reproducible Spark benchmarking. Open-source repo IBM/spark-tpc-ds-performance-test lets teams reproducibly run TPC-DS against their own Spark deployment configs. The trust-but-verify pattern for vendor-published benchmark numbers. Per GitHub — IBM/spark-tpc-ds-performance-test.

Connections 4

Outbound 2
Inbound 2

Resources 3

Featured in