Technology

Hive Metastore

The original metadata catalog service from the Apache Hive project that stores table schemas, partition mappings, and storage locations for data on S3 and HDFS. Commonly abbreviated as HMS.

8 connections 3 resources

Summary

What it is

The original metadata catalog service from the Apache Hive project that stores table schemas, partition mappings, and storage locations for data on S3 and HDFS. Commonly abbreviated as HMS.

Where it fits

Hive Metastore is the legacy but still widely deployed catalog underpinning Spark, Trino, Presto, and Flink workloads against S3 data. It predates dedicated Iceberg catalogs and remains the default metastore for many on-premise and hybrid lakehouse deployments.

Misconceptions / Traps
  • HMS was designed for Hive partition-based tables. Its data model is a poor fit for Iceberg's snapshot-based metadata, which is why dedicated Iceberg catalogs (REST, Nessie, Glue) are preferred for new deployments.
  • Running HMS requires a backing relational database (MySQL, PostgreSQL). That database becomes a single point of failure and a scaling bottleneck for metadata operations.
  • HMS is not a governance tool. It stores structural metadata but has no built-in access control, lineage tracking, or data quality features.
Key Connections
  • scoped_to Metadata Management — the original Hadoop-era catalog
  • enables Apache Spark, Trino, Apache Flink — query engines that read HMS metadata
  • alternative_to AWS Glue Catalog, Apache Polaris — older alternative to managed catalogs
  • constrained_by Metadata Overhead at Scale — HMS database becomes a bottleneck at large scale

Definition

What it is

An open-source metadata service originally built for Apache Hive that stores table schemas, partition locations, and statistics for data stored on HDFS or S3. The longest-standing catalog in the Hadoop ecosystem.

Why it exists

Before table formats, Hive Metastore was the primary way to impose table structure on files stored in distributed storage. It remains widely deployed as the default catalog for Spark, Trino, and Flink workloads reading from S3.

Primary use cases

Legacy catalog for Spark and Trino workloads, Iceberg catalog backend (HiveCatalog), schema registry for Hive-style partitioned tables on S3.

Connections 8

Outbound 6
Inbound 2

Resources 3