Standard

Apache Parquet

Summary

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression.

Where it fits

Parquet is the lingua franca of the S3 data ecosystem. Every table format (Iceberg, Delta, Hudi) defaults to Parquet as the data file format, and every query engine (Spark, DuckDB, Trino, ClickHouse) reads it natively.

Misconceptions / Traps

  • Parquet is a file format, not a table format. A single Parquet file has no concept of schema evolution, transactions, or partitioning — those come from the table format layer.
  • Parquet row group size matters for S3 performance. Row groups that are too small increase S3 request overhead; too large wastes I/O for selective queries. 128MB-256MB is a common target.

Key Connections

  • used_by DuckDB, Trino, Apache Spark, ClickHouse — the universal analytics file format
  • enables Lakehouse Architecture — provides efficient columnar storage on S3
  • solves Cold Scan Latency — columnar layout enables predicate pushdown, reducing I/O
  • scoped_to S3, Table Formats

Definition

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column rather than by row, enabling predicate pushdown, projection pruning, and compression.

Why it exists

Row-oriented formats (CSV, JSON) are inefficient for analytical queries that read a subset of columns from large datasets. Parquet's columnar layout dramatically reduces I/O when querying S3-stored data, where every byte transferred costs time and money.

Primary use cases

Analytical data storage on S3, data lake file format, table format data files (Iceberg, Delta, Hudi all default to Parquet).

Relationships

Resources