Standard

Apache Parquet

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression.

22 connections 4 resources 2 posts

Summary

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression.

Where it fits

Parquet is the lingua franca of the S3 data ecosystem. Every table format (Iceberg, Delta, Hudi) defaults to Parquet as the data file format, and every query engine (Spark, DuckDB, Trino, ClickHouse) reads it natively.

Misconceptions / Traps
  • Parquet is a file format, not a table format. A single Parquet file has no concept of schema evolution, transactions, or partitioning — those come from the table format layer.
  • Parquet row group size matters for S3 performance. Row groups that are too small increase S3 request overhead; too large wastes I/O for selective queries. 128MB-256MB is a common target.
Key Connections
  • used_by DuckDB, Trino, Apache Spark, ClickHouse — the universal analytics file format
  • enables Lakehouse Architecture — provides efficient columnar storage on S3
  • solves Cold Scan Latency — columnar layout enables predicate pushdown, reducing I/O
  • scoped_to S3, Table Formats

Definition

What it is

A columnar file format specification designed for efficient analytical queries. Stores data by column rather than by row, enabling predicate pushdown, projection pruning, and compression.

Why it exists

Row-oriented formats (CSV, JSON) are inefficient for analytical queries that read a subset of columns from large datasets. Parquet's columnar layout dramatically reduces I/O when querying S3-stored data, where every byte transferred costs time and money.

Primary use cases

Analytical data storage on S3, data lake file format, table format data files (Iceberg, Delta, Hudi all default to Parquet).

Recent developments

Latest signals
  • Variant + Native Geospatial types land in the spec. Per the Apache Parquet blog, the project announced the Variant type on February 27, 2026 for semi-structured payloads (the same logical shape Delta Lake 4.0 and Iceberg V3 are converging on), and Native Geospatial Types on February 13, 2026 — two adjacent expansions of what Parquet treats as first-class. The Variant work matters most for the catalog-managed-tables story: the format now has a sanctioned way to carry shredded-column statistics over schema-on-read JSON without round-tripping through string columns.
  • Eight-engine implementation matrix is the new "supported" floor. Per the implementation status page (updated February 11, 2026), eight engines now hold first-class Parquet read implementations: Arrow C++, parquet-java, Arrow Go, Arrow Rust, cuDF (NVIDIA), Hyparquet, DuckDB, and Polars. The minimum-version tables now stretch through 2025 — a real signal that the format-feature gating window across engines has compressed from years to quarters. For lakehouse architects this means "is feature X supported across my engine mix" is finally a tractable answer rather than a guess.
  • Java 11 baseline; Arrow-rs ships quarterly. Per the January 6–14, 2026 dev-list digest, Parquet has elevated its minimum to Java 11 (with Iceberg considering Java 17) as part of a broader lakehouse-stack JVM modernization wave. The Arrow-rs parquet release cadence now ships minor versions roughly quarterly (57.3.0 January, 58.x April-tracked, 59.2.0 scheduled July, plus 56.x maintenance patches) — the Rust track is now a reliable consumer of new Parquet features rather than a perpetual catch-up game.

Connections 22

Outbound 9
Inbound 13

Resources 4

Featured in