Standard

Apache Arrow

Summary

What it is

A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics.

Where it fits

Arrow sits between S3 storage (Parquet on disk) and compute (query execution in memory). It defines how columnar data is laid out in memory, eliminating serialization overhead when processing S3-stored Parquet data.

Misconceptions / Traps

Arrow is an in-memory format, not a storage format. You do not "store Arrow files on S3" (though Arrow IPC files exist, they are not the primary use case).
Arrow and Parquet are complementary, not competing. Parquet is the on-disk format; Arrow is the in-memory format. Most engines read Parquet into Arrow for processing.

Key Connections

used_by DuckDB, Apache Spark — in-memory processing format
scoped_to S3, Table Formats

Definition

What it is

A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics. Defines how columnar data is laid out in memory.

Why it exists

Every analytics engine historically had its own in-memory format, requiring costly serialization between systems. Arrow provides a universal in-memory representation that eliminates serialization overhead, which matters especially when processing large volumes of S3-stored Parquet data.

Primary use cases

Zero-copy data sharing between processing engines, efficient Parquet deserialization, in-memory analytics acceleration.

Relationships

Outbound Relationships

scoped_to

S3 Table Formats

used_by

DuckDB Apache Spark

Inbound Relationships

depends_on

DuckDB

Resources

SpecHigh

arrow.apache.org/docs/format/Columnar.html

Formal specification of the Arrow columnar memory layout, defining the in-memory representation for arrays, buffers, null bitmaps, and nested types.

SpecHigh

arrow.apache.org/docs/format/Flight.html

Specifies the Arrow Flight RPC protocol for high-performance data transport over gRPC, a key part of the Arrow ecosystem.

GitHubHigh

github.com/apache/arrow

The monorepo containing canonical implementations of Arrow in C++, Java, Python, Rust, Go, plus the format specification source files.

DocsHigh

arrow.apache.org/

Official Apache Arrow project homepage — entry point to all format specs, language-specific docs, and community resources.