Technology

DataHub

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.

9 connections 3 resources

Summary

What it is

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and observability across data lake and lakehouse environments.

Where it fits

DataHub serves the same governance layer as OpenMetadata, providing search-driven data discovery over S3-based assets. It differentiates with a stream-based metadata architecture (built on Kafka) and a GraphQL API for programmatic metadata access.

Misconceptions / Traps

DataHub's metadata ingestion is source-pull, not real-time push. There is a delay between changes in source systems and their appearance in DataHub's catalog.
DataHub's Kafka-based metadata store adds operational complexity. Running DataHub requires Kafka, Elasticsearch, MySQL, and a graph database (Neo4j or relational).
Lineage in DataHub depends on source system instrumentation. If a Spark job does not emit OpenLineage events, DataHub will not automatically detect the lineage.

Key Connections

scoped_to Metadata Management — metadata discovery and governance platform
depends_on Kafka Tiered Storage — uses Kafka for metadata event streaming
alternative_to OpenMetadata, Apache Atlas — competing metadata platforms
enables Audit Trails — lineage and change tracking for compliance

Definition

What it is

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage, governance, and observability for S3-based data lakes. Supports automated metadata ingestion from Iceberg, Delta, Hive, and other S3-centric sources.

Why it exists

Large organizations with S3-based data lakes need to know what data exists, how it flows between systems, and who is responsible for it. DataHub provides a metadata graph with real-time ingestion, search, and governance workflows.

Primary use cases

Enterprise data cataloging over S3 lakehouses, automated lineage from Spark/Airflow pipelines, data governance and compliance workflows.

Recent developments

Latest signals

Latest release: v1.6.0 (May 21, 2026). Current OSS stable line (v1.6.0rc1/rc2 were the pre-releases). Per datahub-project/datahub releases.
DataHub Cloud v1 launched as a "context platform" for analytics agents (May 28, 2026). The release repositions DataHub as a context layer between analytics agents — Databricks Genie, Snowflake Intelligence — and enterprise data, feeding them unified metadata, semantic definitions (dbt, Power BI), and institutional knowledge (Notion, Confluence) so agents generate correct SQL; DataHub reports pushing agent accuracy beyond 90%. It is the same agent-context pattern as the broader MCP wave, applied to the metadata catalog. Per DataHub Launches Breakthrough Release for Analytics Agents.
Open-source DataHub v1.6.0 (May 21, 2026) went V2-UI-only. The OSS release removed the legacy V1 UI entirely and moved datahub-frontend onto Play 3 + Apache Pekko for improved security and maintainability. Per DataHub Releases (docs.datahub.com).
#1 open-source AI data catalog framing — 80+ production-grade connectors, MCP support. Per the DataHub Project GitHub organization, DataHub positions itself as "The Context Platform for your Data and AI Stack" with 80+ production-grade connectors, real-time streaming metadata updates, AI-readiness with MCP support, and LLM-friendly metadata exposure. The MCP integration is the same architectural pattern that Snowflake, Databricks, and dlt are converging on: catalog metadata becomes a primary surface for AI-assisted analytics rather than a backend governance layer.
Active ecosystem positioning vs OpenMetadata. Per the 16 Best Data Catalog Tools in 2026 buyer's guide, DataHub is recommended for organizations that want "most active open-source community, API-first metadata ingestion, Python/Java engineering resources" — 11,600+ GitHub stars and a three-year head start over OpenMetadata on community size. The 2026 framing positions DataHub as the choice when engineering resources are available; managed-product alternatives (Atlan, Collibra) cover the no-engineering-team segment.