Guide 27

SIMD and the C++ Query Engine Revolution

Problem Framing

Java-based query engines (Trino, Spark) dominate the lakehouse ecosystem but impose structural performance ceilings: JVM garbage collection pauses during large S3 fetches, row-at-a-time or small-batch execution models that underutilize modern CPU pipelines, and inability to exploit SIMD (Single Instruction, Multiple Data) instructions for vectorized data processing. C++ MPP engines (StarRocks, ClickHouse) and embedded engines (DuckDB, DataFusion) use SIMD-vectorized execution, columnar memory layouts, and local NVMe caching to deliver substantially faster analytics directly against open table formats on S3. Engineers need to understand the architectural reasons for this performance gap and how to evaluate these engines for their workload.

Relevant Nodes

Topics: S3, Lakehouse
Technologies: Trino, StarRocks, ClickHouse, Apache Spark, DuckDB, DataFusion, Velox, Apache Iceberg
Architectures: Separation of Storage and Compute
Pain Points: Cold Scan Latency, Performance-per-Dollar, Cache ROI

Decision Path

Profile your current query latency bottleneck. Determine whether your queries are bottlenecked on S3 I/O (network-bound), CPU processing (compute-bound), or memory pressure (GC-bound). JVM-based engines are most disadvantaged when queries are compute-bound on large in-memory datasets where GC pauses and lack of SIMD dominate.
- Use query profiling tools (Trino query plan, Spark UI) to identify the bottleneck phase.
- If your queries are purely I/O-bound on cold S3 data, switching engines may not help — the bottleneck is network, not CPU.
Understand SIMD and vectorized execution. SIMD instructions process 4, 8, or 16 values per CPU cycle instead of one. C++ engines compile query operators to SIMD instructions (AVX-256, AVX-512) that operate on columnar batches of 1,024–4,096 rows. This eliminates per-row function call overhead and exploits CPU cache locality.
- Velox (Meta's C++ execution library) provides SIMD-vectorized operators that can be integrated into multiple engines.
- DuckDB implements its own vectorized engine optimized for single-machine execution.
Compare JVM overhead vs. C++ for your dataset scale. At small scales (under 100 GB), the difference is negligible. At medium scales (100 GB–10 TB), C++ engines typically deliver 2–5x lower latency. At large scales (10+ TB), the gap widens further because JVM GC pauses become more frequent as heap sizes grow.
- StarRocks and ClickHouse can process Iceberg tables on S3 with latencies comparable to querying local databases.
Evaluate StarRocks and ClickHouse Iceberg support. Both engines now support reading Iceberg tables directly from S3:
- StarRocks: native Iceberg catalog integration, supports partition pruning, predicate pushdown, and manifest caching.
- ClickHouse: Iceberg table function and engine, supports S3-backed tables with local caching.
- Both are read-optimized — write/maintenance operations still require Spark or Flink.
Configure local NVMe caching for hot data. C++ engines benefit from local SSD caching that reduces repeated S3 GETs. Configure a cache tier on NVMe storage:
- StarRocks: built-in cache manager with LRU eviction and cache warming APIs.
- ClickHouse: filesystem cache on local SSD, configurable per-table.
- Cache hit rates above 80% can reduce query latency by 5–10x compared to cold S3 reads.
Benchmark with your actual workload before committing. Synthetic benchmarks (TPC-H, TPC-DS) favor C++ engines heavily, but real workloads differ. Test with your actual queries, data distribution, concurrency level, and S3 region latency.
- Pay attention to concurrent query performance — some C++ engines trade single-query speed for lower concurrency limits.

What Changed Over Time

Spark (2014) and Trino/Presto (2013) established the JVM-based distributed SQL pattern for data lakes, accepting GC overhead as the price of developer productivity and ecosystem breadth.
ClickHouse (open-sourced 2016) proved that C++ vectorized execution could deliver orders-of-magnitude performance gains for analytical workloads, initially on local storage.
StarRocks (2021, open-sourced from CelerData) brought C++ vectorized execution to the lakehouse pattern with native Iceberg and Hudi support.
Meta's Velox library (2022) extracted vectorized execution into a reusable C++ library, enabling any engine to adopt SIMD processing.
DuckDB (2019, production adoption 2023–2025) demonstrated that embedded C++ engines could handle multi-GB analytical workloads on a single machine, often faster than distributed JVM clusters.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources