Guide 15
The Great Catalog Migration
Problem Framing
The "Metastore Era" — dominated by the Hive Metastore (HMS) — is ending. As Iceberg overtakes Hive-style tables, organizations must migrate from HMS to a modern REST-based catalog: Apache Polaris, Unity Catalog, Apache Gravitino, or AWS Glue. Each choice has different implications for multi-engine access, vendor neutrality, and operational complexity. This guide maps the migration from HMS to the catalog that fits your architecture.
Relevant Nodes
- Topics: Table Formats, Lakehouse
- Technologies: Apache Polaris, Apache Gravitino, Unity Catalog, Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino
- Standards: Iceberg REST Catalog Spec, Iceberg V3 Spec
- Architectures: Lakehouse Architecture, Separation of Storage and Compute
- Pain Points: Vendor Lock-In, Metadata Overhead at Scale
Decision Path
Why migrate from HMS:
- HMS was designed for Hive. It assumes Hive-style partitioning, lacks multi-table transactions, and has no RBAC.
- Iceberg's REST Catalog Spec provides a standard API that any engine can use. HMS requires engine-specific adaptors.
- Performance degrades at scale. HMS uses a relational database (typically MySQL/PostgreSQL) with a schema designed for Hive semantics, not Iceberg's snapshot-based metadata.
Choose your target catalog:
- Apache Polaris — Pure Iceberg focus. Best vendor-neutral option for organizations standardizing on Iceberg.
- Pros: Open-source, implements Iceberg REST Catalog Spec, RBAC built-in.
- Cons: Iceberg-only (no Delta/Hudi), requires PostgreSQL backend, Apache incubating.
- Unity Catalog — Multi-format. Best for mixed Delta/Iceberg environments.
- Pros: Supports Iceberg + Delta + Hudi, lineage built-in, Linux Foundation governance.
- Cons: OSS version may lag Databricks-managed version in features.
- Apache Gravitino — Federation. Best for organizations with multiple existing catalogs.
- Pros: Federate HMS, Glue, Polaris, and others into a single view.
- Cons: Adds another layer of indirection; lineage features still maturing.
- AWS Glue Data Catalog — Managed. Best for AWS-only environments that accept lock-in.
- Pros: Zero-ops, tight integration with Athena/EMR/Redshift.
- Cons: AWS-only, limited RBAC, not portable.
- Apache Polaris — Pure Iceberg focus. Best vendor-neutral option for organizations standardizing on Iceberg.
Plan the migration:
- Inventory all HMS databases and tables. Identify which are actively queried vs. dormant.
- Run dual-registration during transition: register tables in both HMS and the new catalog. Most Iceberg catalogs support this.
- Migrate engine configurations one at a time. Start with read-heavy engines (Trino), then move write engines (Spark).
- Test RBAC policies before production cutover. Catalog migration is also a security boundary change.
Consider Gravitino as a bridge:
- If you cannot migrate all catalogs at once, Gravitino can federate HMS alongside the new catalog during transition.
- This avoids a "big bang" migration and allows gradual rollover.
What Changed Over Time
- HMS was the only option for Hive-style data lakes. Every Hadoop-era tool assumed HMS.
- Iceberg's REST Catalog Spec (2022-2023) established a vendor-neutral API, enabling the first generation of alternatives.
- Snowflake donated Polaris to Apache (2024), giving the community a production-grade REST catalog.
- Databricks open-sourced Unity Catalog (2024), adding multi-format support to the catalog landscape.
- Gravitino reached 1.0 (2025), providing federation for organizations that cannot commit to a single catalog.
- The "Catalog Wars" of 2025-2026 reflect the broader shift: metadata governance is the new competitive battleground.