DataHub
An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.
Summary
An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.
DataHub serves the same governance layer as OpenMetadata, providing search-driven data discovery over S3-based assets. It differentiates with a stream-based metadata architecture (built on Kafka) and a GraphQL API for programmatic metadata access.
- DataHub's metadata ingestion is source-pull, not real-time push. There is a delay between changes in source systems and their appearance in DataHub's catalog.
- DataHub's Kafka-based metadata store adds operational complexity. Running DataHub requires Kafka, Elasticsearch, MySQL, and a graph database (Neo4j or relational).
- Lineage in DataHub depends on source system instrumentation. If a Spark job does not emit OpenLineage events, DataHub will not automatically detect the lineage.
scoped_toMetadata Management — metadata discovery and governance platformdepends_onKafka Tiered Storage — uses Kafka for metadata event streamingalternative_toOpenMetadata, Apache Atlas — competing metadata platformsenablesAudit Trails — lineage and change tracking for compliance
Definition
An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage, governance, and observability for S3-based data lakes. Supports automated metadata ingestion from Iceberg, Delta, Hive, and other S3-centric sources.
Large organizations with S3-based data lakes need to know what data exists, how it flows between systems, and who is responsible for it. DataHub provides a metadata graph with real-time ingestion, search, and governance workflows.
Enterprise data cataloging over S3 lakehouses, automated lineage from Spark/Airflow pipelines, data governance and compliance workflows.
Connections 9
Outbound 7
scoped_to2implements1depends_on1solves1alternative_to2Inbound 2
alternative_to2Resources 3
Official DataHub documentation for LinkedIn's open-source metadata platform providing discovery, governance, and lineage for data lake ecosystems.
DataHub source repository with the metadata graph, ingestion framework, and S3/Glue/Iceberg integration sources.
DataHub S3 source connector documentation for automated profiling and metadata extraction from S3-hosted datasets.