Technology

DataFusion

An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.

7 connections 3 resources

Summary

What it is

An extensible query execution framework written in Rust, built on Apache Arrow, that provides a SQL query planner and execution engine for building custom analytics applications over S3-stored data.

Where it fits

DataFusion is the embedded query engine layer used by projects like Ballista, InfluxDB IOx, and Delta-rs. Rather than being a standalone analytics product, it is the foundation that other S3-native tools build upon for SQL query planning and columnar execution.

Misconceptions / Traps
  • DataFusion is a library, not a database. It provides query planning and execution but requires integration work to become a deployable analytics system.
  • DataFusion's Rust implementation offers memory safety and performance but limits extensibility to Rust or languages with Rust FFI bindings (Python via PyO3, C via extern).
  • Distributed execution requires Ballista or a custom scheduler. DataFusion alone runs single-node.
Key Connections
  • scoped_to S3, Lakehouse — query execution over S3-stored data
  • depends_on Apache Arrow — Arrow columnar format is the in-memory representation
  • depends_on Apache Parquet — reads Parquet files from S3
  • enables Apache Iceberg — used by the iceberg-rust implementation

Definition

What it is

An extensible, embeddable query engine written in Rust, built on Apache Arrow. Provides SQL and DataFrame APIs for querying data on S3, used as the query core in tools like Ballista, InfluxDB IOx, and Delta-rs.

Why it exists

Many projects need a fast, embeddable SQL engine that can read from S3 without deploying a full distributed query cluster. DataFusion provides a modular, Arrow-native query engine that can be embedded into Rust, Python, or other applications.

Primary use cases

Embedded SQL analytics over S3, building custom query engines on object storage, serverless query execution against Parquet/Iceberg on S3.

Connections 7

Outbound 6
scoped_to2
alternative_to1
Inbound 1
alternative_to1

Resources 3