Technology

DataFusion

An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.

7 connections 3 resources

Summary

What it is

An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.

Where it fits

DataFusion is the embedded query engine layer used by projects like Ballista, InfluxDB IOx, and Delta-rs. Rather than being a standalone analytics product, it is the foundation that other S3-native tools build upon for SQL query planning and columnar execution.

Misconceptions / Traps

DataFusion is a library, not a database. It provides query planning and execution but requires integration work to become a deployable analytics system.
DataFusion's Rust implementation offers memory safety and performance but limits extensibility to Rust or languages with Rust FFI bindings (Python via PyO3, C via extern).
Distributed execution requires Ballista or a custom scheduler. DataFusion alone runs single-node.

Key Connections

scoped_to S3, Lakehouse — query execution over S3-stored data
depends_on Apache Arrow — Arrow columnar format is the in-memory representation
depends_on Apache Parquet — reads Parquet files from S3
enables Apache Iceberg — used by the iceberg-rust implementation

Definition

What it is

An extensible, embeddable query engine written in Rust, built on Apache Arrow. Provides SQL and DataFrame APIs for querying data on S3, used as the query core in tools like Ballista, InfluxDB IOx, and Delta-rs.

Why it exists

Many projects need a fast, embeddable SQL engine that can read from S3 without deploying a full distributed query cluster. DataFusion provides a modular, Arrow-native query engine that can be embedded into Rust, Python, or other applications.

Primary use cases

Embedded SQL analytics over S3, building custom query engines on object storage, serverless query execution against Parquet/Iceberg on S3.

Recent developments

Latest signals

Latest release: v54.0.0 (current as of June 2026). GA on June 12, 2026 — adds LATERAL joins, SQL lambda functions, and a new Avro reader, plus join/scan/planning performance wins (~50% faster on some repartition-heavy queries). 740 commits from 139 contributors. Per DataFusion 54.0.0 release.
DataFusion 53.0.0 (April 2, 2026) — performance-driven release. DataFusion 53.0.0 shipped with 114 contributors and three structural performance wins: LIMIT-aware Parquet row group pruning (when DataFusion can prove a row group fully matches the predicate and the matched group satisfies LIMIT, partially-matching groups are skipped entirely), expanded filter pushdown through more join types and through UnionExec plus dynamic-filter pushdown, and faster query planning via cheaper-to-clone immutable plan pieces. ClickBench normalized execution time continues its multi-release downtrend.
DataFusion Comet 0.15.0 — Spark accelerator for DataFusion physical plans. Per the DataFusion blog, the Comet subproject (an accelerator that translates Spark physical plans to DataFusion physical plans without code changes) hit 0.15.0 in April 2026, on a roughly 4-week release cadence. Comet is the productization path for "use DataFusion's Rust-native execution while keeping your Spark code." For shops on Spark today, this is a transparent way to take a 1.5-2x perf bump on supported operators without rewriting jobs.
Ecosystem footprint — DataFusion is the de facto Rust query engine. Per a GreptimeDB / Apache DataFusion PMC retrospective, DataFusion now powers approximately 3,000 GitHub repositories as a library (Spice.ai, GreptimeDB, InfluxDB IOx, Delta-rs, Vortex, and many others). The community has grown several-fold since 2017. The post also documents fundamental Rust-side optimization patterns (strategic HashMap use, ownership minimization, allocation reduction) that are equally applicable to other Rust query engines.