Guide 15

The Great Catalog Migration

Problem Framing

The "Metastore Era" — dominated by the Hive Metastore (HMS) — is ending. As Iceberg overtakes Hive-style tables, organizations must migrate from HMS to a modern REST-based catalog: Apache Polaris, Unity Catalog, Apache Gravitino, or AWS Glue. Each choice has different implications for multi-engine access, vendor neutrality, and operational complexity. This guide maps the migration from HMS to the catalog that fits your architecture.

Relevant Nodes

  • Topics: Table Formats, Lakehouse
  • Technologies: Apache Polaris, Apache Gravitino, Unity Catalog, Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino
  • Standards: Iceberg REST Catalog Spec, Iceberg V3 Spec
  • Architectures: Lakehouse Architecture, Separation of Storage and Compute
  • Pain Points: Vendor Lock-In, Metadata Overhead at Scale

Decision Path

  1. Why migrate from HMS:

    • HMS was designed for Hive. It assumes Hive-style partitioning, lacks multi-table transactions, and has no RBAC.
    • Iceberg's REST Catalog Spec provides a standard API that any engine can use. HMS requires engine-specific adaptors.
    • Performance degrades at scale. HMS uses a relational database (typically MySQL/PostgreSQL) with a schema designed for Hive semantics, not Iceberg's snapshot-based metadata.
  2. Choose your target catalog:

    • Apache Polaris — Pure Iceberg focus. Best vendor-neutral option for organizations standardizing on Iceberg.
      • Pros: Open-source, implements Iceberg REST Catalog Spec, RBAC built-in.
      • Cons: Iceberg-only (no Delta/Hudi), requires PostgreSQL backend, Apache incubating.
    • Unity Catalog — Multi-format. Best for mixed Delta/Iceberg environments.
      • Pros: Supports Iceberg + Delta + Hudi, lineage built-in, Linux Foundation governance.
      • Cons: OSS version may lag Databricks-managed version in features.
    • Apache Gravitino — Federation. Best for organizations with multiple existing catalogs.
      • Pros: Federate HMS, Glue, Polaris, and others into a single view.
      • Cons: Adds another layer of indirection; lineage features still maturing.
    • AWS Glue Data Catalog — Managed. Best for AWS-only environments that accept lock-in.
      • Pros: Zero-ops, tight integration with Athena/EMR/Redshift.
      • Cons: AWS-only, limited RBAC, not portable.
  3. Plan the migration:

    • Inventory all HMS databases and tables. Identify which are actively queried vs. dormant.
    • Run dual-registration during transition: register tables in both HMS and the new catalog. Most Iceberg catalogs support this.
    • Migrate engine configurations one at a time. Start with read-heavy engines (Trino), then move write engines (Spark).
    • Test RBAC policies before production cutover. Catalog migration is also a security boundary change.
  4. Consider Gravitino as a bridge:

    • If you cannot migrate all catalogs at once, Gravitino can federate HMS alongside the new catalog during transition.
    • This avoids a "big bang" migration and allows gradual rollover.

What Changed Over Time

  • HMS was the only option for Hive-style data lakes. Every Hadoop-era tool assumed HMS.
  • Iceberg's REST Catalog Spec (2022-2023) established a vendor-neutral API, enabling the first generation of alternatives.
  • Snowflake donated Polaris to Apache (2024), giving the community a production-grade REST catalog.
  • Databricks open-sourced Unity Catalog (2024), adding multi-format support to the catalog landscape.
  • Gravitino reached 1.0 (2025), providing federation for organizations that cannot commit to a single catalog.
  • The "Catalog Wars" of 2025-2026 reflect the broader shift: metadata governance is the new competitive battleground.

Sources