Apache Parquet
Summary
What it is
A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression.
Where it fits
Parquet is the lingua franca of the S3 data ecosystem. Every table format (Iceberg, Delta, Hudi) defaults to Parquet as the data file format, and every query engine (Spark, DuckDB, Trino, ClickHouse) reads it natively.
Misconceptions / Traps
- Parquet is a file format, not a table format. A single Parquet file has no concept of schema evolution, transactions, or partitioning — those come from the table format layer.
- Parquet row group size matters for S3 performance. Row groups that are too small increase S3 request overhead; too large wastes I/O for selective queries. 128MB-256MB is a common target.
Key Connections
used_byDuckDB, Trino, Apache Spark, ClickHouse — the universal analytics file formatenablesLakehouse Architecture — provides efficient columnar storage on S3solvesCold Scan Latency — columnar layout enables predicate pushdown, reducing I/Oscoped_toS3, Table Formats
Definition
What it is
A columnar file format specification designed for efficient analytical queries. Stores data by column rather than by row, enabling predicate pushdown, projection pruning, and compression.
Why it exists
Row-oriented formats (CSV, JSON) are inefficient for analytical queries that read a subset of columns from large datasets. Parquet's columnar layout dramatically reduces I/O when querying S3-stored data, where every byte transferred costs time and money.
Primary use cases
Analytical data storage on S3, data lake file format, table format data files (Iceberg, Delta, Hudi all default to Parquet).
Relationships
Outbound Relationships
scoped_toenablessolvesInbound Relationships
Resources
Official Apache Parquet format specification defining the file layout, encoding schemes, page structure, and metadata format.
Canonical repository for the Parquet format specification, including the Thrift IDL definitions that formally describe the binary file structure.
The reference Java implementation of the Parquet format, the basis for Spark/Hive/Hadoop Parquet support.
Official Apache Parquet project homepage with links to documentation, community resources, and all sub-projects.