Standard

Apache Parquet

Summary

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression.

Where it fits

Parquet is the lingua franca of the S3 data ecosystem. Every table format (Iceberg, Delta, Hudi) defaults to Parquet as the data file format, and every query engine (Spark, DuckDB, Trino, ClickHouse) reads it natively.

Misconceptions / Traps

Parquet is a file format, not a table format. A single Parquet file has no concept of schema evolution, transactions, or partitioning — those come from the table format layer.
Parquet row group size matters for S3 performance. Row groups that are too small increase S3 request overhead; too large wastes I/O for selective queries. 128MB-256MB is a common target.

Key Connections

used_by DuckDB, Trino, Apache Spark, ClickHouse — the universal analytics file format
enables Lakehouse Architecture — provides efficient columnar storage on S3
solves Cold Scan Latency — columnar layout enables predicate pushdown, reducing I/O
scoped_to S3, Table Formats

Definition

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column rather than by row, enabling predicate pushdown, projection pruning, and compression.

Why it exists

Row-oriented formats (CSV, JSON) are inefficient for analytical queries that read a subset of columns from large datasets. Parquet's columnar layout dramatically reduces I/O when querying S3-stored data, where every byte transferred costs time and money.

Primary use cases

Analytical data storage on S3, data lake file format, table format data files (Iceberg, Delta, Hudi all default to Parquet).

Relationships

Outbound Relationships

scoped_to

S3 Table Formats

used_by

DuckDB Trino Apache Spark ClickHouse

enables

Lakehouse Architecture

solves

Cold Scan Latency

Inbound Relationships

depends_on

Apache Iceberg Delta Lake Apache Hudi DuckDB Trino ClickHouse StarRocks Lakehouse Architecture

Resources

SpecHigh

parquet.apache.org/documentation/latest/

Official Apache Parquet format specification defining the file layout, encoding schemes, page structure, and metadata format.

GitHubHigh

github.com/apache/parquet-format

Canonical repository for the Parquet format specification, including the Thrift IDL definitions that formally describe the binary file structure.

GitHubHigh

github.com/apache/parquet-java

The reference Java implementation of the Parquet format, the basis for Spark/Hive/Hadoop Parquet support.

DocsHigh

parquet.apache.org/

Official Apache Parquet project homepage with links to documentation, community resources, and all sub-projects.