Standard

Apache Arrow

Summary

What it is

A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics.

Where it fits

Arrow sits between S3 storage (Parquet on disk) and compute (query execution in memory). It defines how columnar data is laid out in memory, eliminating serialization overhead when processing S3-stored Parquet data.

Misconceptions / Traps

  • Arrow is an in-memory format, not a storage format. You do not "store Arrow files on S3" (though Arrow IPC files exist, they are not the primary use case).
  • Arrow and Parquet are complementary, not competing. Parquet is the on-disk format; Arrow is the in-memory format. Most engines read Parquet into Arrow for processing.

Key Connections

  • used_by DuckDB, Apache Spark — in-memory processing format
  • scoped_to S3, Table Formats

Definition

What it is

A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics. Defines how columnar data is laid out in memory.

Why it exists

Every analytics engine historically had its own in-memory format, requiring costly serialization between systems. Arrow provides a universal in-memory representation that eliminates serialization overhead, which matters especially when processing large volumes of S3-stored Parquet data.

Primary use cases

Zero-copy data sharing between processing engines, efficient Parquet deserialization, in-memory analytics acceleration.

Relationships

Outbound Relationships

Inbound Relationships

depends_on

Resources