DataFusion
An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.
Summary
An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.
DataFusion is the embedded query engine layer used by projects like Ballista, InfluxDB IOx, and Delta-rs. Rather than being a standalone analytics product, it is the foundation that other S3-native tools build upon for SQL query planning and columnar execution.
- DataFusion is a library, not a database. It provides query planning and execution but requires integration work to become a deployable analytics system.
- DataFusion's Rust implementation offers memory safety and performance but limits extensibility to Rust or languages with Rust FFI bindings (Python via PyO3, C via extern).
- Distributed execution requires Ballista or a custom scheduler. DataFusion alone runs single-node.
scoped_toS3, Lakehouse — query execution over S3-stored datadepends_onApache Arrow — Arrow columnar format is the in-memory representationdepends_onApache Parquet — reads Parquet files from S3enablesApache Iceberg — used by the iceberg-rust implementation
Definition
An extensible, embeddable query engine written in Rust, built on Apache Arrow. Provides SQL and DataFrame APIs for querying data on S3, used as the query core in tools like Ballista, InfluxDB IOx, and Delta-rs.
Many projects need a fast, embeddable SQL engine that can read from S3 without deploying a full distributed query cluster. DataFusion provides a modular, Arrow-native query engine that can be embedded into Rust, Python, or other applications.
Embedded SQL analytics over S3, building custom query engines on object storage, serverless query execution against Parquet/Iceberg on S3.
Recent developments
- Latest release: v54.0.0 (current as of June 2026). GA on June 12, 2026 — adds LATERAL joins, SQL lambda functions, and a new Avro reader, plus join/scan/planning performance wins (~50% faster on some repartition-heavy queries). 740 commits from 139 contributors. Per DataFusion 54.0.0 release.
- DataFusion 53.0.0 (April 2, 2026) — performance-driven release. DataFusion 53.0.0 shipped with 114 contributors and three structural performance wins: LIMIT-aware Parquet row group pruning (when DataFusion can prove a row group fully matches the predicate and the matched group satisfies LIMIT, partially-matching groups are skipped entirely), expanded filter pushdown through more join types and through
UnionExecplus dynamic-filter pushdown, and faster query planning via cheaper-to-clone immutable plan pieces. ClickBench normalized execution time continues its multi-release downtrend. - DataFusion Comet 0.15.0 — Spark accelerator for DataFusion physical plans. Per the DataFusion blog, the Comet subproject (an accelerator that translates Spark physical plans to DataFusion physical plans without code changes) hit 0.15.0 in April 2026, on a roughly 4-week release cadence. Comet is the productization path for "use DataFusion's Rust-native execution while keeping your Spark code." For shops on Spark today, this is a transparent way to take a 1.5-2x perf bump on supported operators without rewriting jobs.
- Ecosystem footprint — DataFusion is the de facto Rust query engine. Per a GreptimeDB / Apache DataFusion PMC retrospective, DataFusion now powers approximately 3,000 GitHub repositories as a library (Spice.ai, GreptimeDB, InfluxDB IOx, Delta-rs, Vortex, and many others). The community has grown several-fold since 2017. The post also documents fundamental Rust-side optimization patterns (strategic
HashMapuse, ownership minimization, allocation reduction) that are equally applicable to other Rust query engines.
Connections 7
Outbound 6
Inbound 1
alternative_to1Resources 3
Official Apache DataFusion site for the extensible query engine built on Arrow, designed for building custom analytics systems on object storage.
DataFusion source repository with the Rust-based query engine, object store integration, and Parquet/Iceberg readers.
DataFusion SQL reference documenting the query capabilities available for S3-backed analytical workloads.