Technology

AWS Glue Catalog

AWS's fully managed metadata catalog service that stores table definitions, partition information, and schema metadata for data stored in S3, serving as the default metastore for AWS analytics services.

10 connections 3 resources

Summary

What it is

Where it fits

Glue Catalog is the AWS-native metadata layer that connects S3-stored data to query engines like Athena, Redshift Spectrum, and EMR Spark. It replaces the need for a self-managed Hive Metastore in AWS-centric lakehouse deployments.

Misconceptions / Traps

Glue Catalog is not a query engine. It stores metadata only; actual query execution is handled by Athena, Spark, Trino, or other engines.
Glue Catalog's Iceberg support requires the Glue-specific catalog implementation. Not all Iceberg features (e.g., branching, tagging) are available through Glue's catalog API.
API call pricing can surprise at scale. Each GetTable, GetPartitions, and UpdateTable call is billed, and high-frequency metadata access patterns amplify cost.

Key Connections

scoped_to Metadata Management — a managed metadata catalog
enables Athena, Apache Spark — provides table metadata for query execution
alternative_to Hive Metastore — AWS-managed alternative to self-hosted HMS
implements Iceberg REST Catalog Spec — supports Iceberg table registration

Definition

What it is

A fully managed metadata catalog service from AWS that stores table definitions, partition information, and schema metadata for data stored in S3. Serves as the default metastore for AWS analytics services.

Why it exists

S3 has no built-in concept of tables, schemas, or partitions. AWS Glue Catalog provides a centralized metadata registry so that Athena, Redshift Spectrum, EMR, and other engines can discover and query S3 data as structured tables without each engine maintaining its own metadata.

Primary use cases

Centralized table metadata for S3-based data lakes, Iceberg catalog backend on AWS, schema registry for ETL pipelines.

Recent developments

Latest signals

Iceberg auto-optimize now supports delete-file compaction + nested types + partition evolution. AWS Glue Data Catalog's automatic optimization for Iceberg tables added support for compaction of delete files, nested data types, partial progress commits, and partition evolution. The catalog now actively maintains tables, not just stores their metadata. Per AWS — Glue Data Catalog Advanced Automatic Optimization.
Iceberg materialized views in Glue Data Catalog. Materialized views are now first-class catalog objects — query engines can rewrite queries to hit precomputed materializations, with Glue managing refresh logic. Closes a feature gap vs Databricks Unity Catalog. Per AWS — Apache Iceberg materialized views in Glue Data Catalog.
VPC-only Iceberg tables get auto-optimization too. Glue Data Catalog now optimizes Iceberg tables that are only accessible from a specific Amazon VPC — the catalog runs the compaction job inside the customer's VPC boundary, keeping sensitive tables private while still benefiting from managed optimization. Per AWS — Glue Data Catalog automatic optimization through Amazon VPC.
Snapshot retention + storage optimization automated. Snapshot retention runs daily by default, removing snapshots older than configured retention while keeping the most recent N. Storage optimization separates from compaction. The catalog converged toward "self-managing Iceberg." Per AWS — Glue Data Catalog storage optimization for Iceberg.
Available in 14 regions including major US + EU + APAC. Auto-optimization rolled out across N. Virginia, Ohio, Oregon, Ireland, London, Frankfurt, Stockholm, Tokyo, Seoul, Mumbai, Singapore, Sydney, São Paulo, Canada Central — geographic breadth signals AWS treating this as a core platform capability, not a niche feature. Per AWS — Glue Data Catalog Iceberg automatic optimization VPC.
Auto-compaction monitors partition thresholds + kicks off compaction at file-count limits. The compaction optimizer continuously watches each partition; when file count or file-size distribution crosses configured thresholds, compaction runs. Customers stop running manual compaction Airflow DAGs. Per AWS — Accelerate queries through Glue auto compaction.