Standard

ORC

Summary

What it is

Optimized Row Columnar file format specification — a columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.

Where it fits

ORC is the legacy columnar format in the Hadoop/Hive ecosystem. On S3, it serves the same role as Parquet — efficient columnar storage for analytical queries — but is primarily used in organizations with existing Hive investments.

Misconceptions / Traps

  • ORC and Parquet are functionally similar for most workloads. The choice is usually driven by ecosystem (Hive → ORC, everything else → Parquet) rather than technical superiority.
  • ORC's built-in ACID support (for Hive) operates differently from table format ACID (Iceberg, Delta). They are not the same concept.

Key Connections

  • used_by Apache Spark, Trino — supported as a data file format
  • solves Cold Scan Latency — columnar format enables predicate pushdown
  • scoped_to S3, Table Formats

Definition

What it is

Optimized Row Columnar file format specification. A columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.

Why it exists

ORC predates Parquet in the Hadoop ecosystem and remains in use in organizations with significant Hive and Spark-on-YARN investments. It provides similar benefits to Parquet (columnar storage, efficient analytics) with different performance trade-offs.

Primary use cases

Analytical data storage in Hive-centric S3 environments, legacy Hadoop data lake compatibility.

Relationships

Outbound Relationships

Resources