Technology

DataHub

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.

9 connections 3 resources

Summary

What it is

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.

Where it fits

DataHub serves the same governance layer as OpenMetadata, providing search-driven data discovery over S3-based assets. It differentiates with a stream-based metadata architecture (built on Kafka) and a GraphQL API for programmatic metadata access.

Misconceptions / Traps
  • DataHub's metadata ingestion is source-pull, not real-time push. There is a delay between changes in source systems and their appearance in DataHub's catalog.
  • DataHub's Kafka-based metadata store adds operational complexity. Running DataHub requires Kafka, Elasticsearch, MySQL, and a graph database (Neo4j or relational).
  • Lineage in DataHub depends on source system instrumentation. If a Spark job does not emit OpenLineage events, DataHub will not automatically detect the lineage.
Key Connections
  • scoped_to Metadata Management — metadata discovery and governance platform
  • depends_on Kafka Tiered Storage — uses Kafka for metadata event streaming
  • alternative_to OpenMetadata, Apache Atlas — competing metadata platforms
  • enables Audit Trails — lineage and change tracking for compliance

Definition

What it is

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage, governance, and observability for S3-based data lakes. Supports automated metadata ingestion from Iceberg, Delta, Hive, and other S3-centric sources.

Why it exists

Large organizations with S3-based data lakes need to know what data exists, how it flows between systems, and who is responsible for it. DataHub provides a metadata graph with real-time ingestion, search, and governance workflows.

Primary use cases

Enterprise data cataloging over S3 lakehouses, automated lineage from Spark/Airflow pipelines, data governance and compliance workflows.

Connections 9

Outbound 7
Inbound 2

Resources 3