Technology

OpenMetadata

An open-source metadata platform providing a centralized catalog for data discovery, quality, lineage, and governance across S3-based data lakes and lakehouses.

9 connections 3 resources

Summary

What it is

An open-source metadata platform providing a centralized catalog for data discovery, quality, lineage, and governance across S3-based data lakes and lakehouses.

Where it fits

OpenMetadata sits in the governance and discovery layer above S3 storage and query engines. It ingests metadata from Iceberg tables, Spark jobs, Airflow DAGs, and other tools to provide a unified view of what data exists, who owns it, and how it flows through the organization.

Misconceptions / Traps
  • OpenMetadata is a metadata platform, not a query engine or catalog. It discovers and displays metadata from external systems (Glue, HMS, Iceberg catalogs) but does not replace them.
  • Data quality checks in OpenMetadata require configuring profiler workflows. The platform does not automatically validate data without explicit setup.
  • Deploying OpenMetadata requires running its own backend services (API server, database, Airflow for ingestion). It is not a lightweight tool.
Key Connections
  • scoped_to Metadata Management — centralized metadata discovery and governance
  • enables Audit Trails — tracks metadata change history
  • alternative_to DataHub, Apache Atlas — open-source metadata platform alternatives
  • depends_on AWS Glue Catalog, Hive Metastore — ingests metadata from catalogs

Definition

What it is

An open-source metadata platform that provides data discovery, lineage, quality, and governance for S3-based data lakes and lakehouses. Ingests metadata from catalogs, query engines, and pipelines to build a unified metadata graph.

Why it exists

As data lakes grow, teams lose track of what data exists, where it came from, who owns it, and whether it is trustworthy. OpenMetadata centralizes this information with automated metadata ingestion from S3-based sources.

Primary use cases

Data discovery and cataloging for S3 lakehouses, automated lineage tracking, data quality monitoring, governance and ownership management.

Connections 9

Outbound 7
Inbound 2
alternative_to2

Resources 3