Standard

OpenLineage

An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.

6 connections 3 resources

Summary

What it is

An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.

Where it fits

OpenLineage is the missing observability layer for S3 lakehouses. As pipelines span Spark, Airflow, Flink, and dbt across multiple S3-backed tables, OpenLineage provides the standard format for stitching lineage together into a complete graph, regardless of which orchestrator runs the job.

Misconceptions / Traps
  • OpenLineage is a standard, not a product. It has no UI — you need a backend like Marquez or Datakin to store and visualize the lineage events.
  • Integration quality varies by tool. Some integrations (Spark) are mature; others (Flink) are still developing.
Key Connections
  • enables Marquez — the reference implementation that stores and visualizes OpenLineage events
  • scoped_to Lakehouse, S3 — lineage tracking for S3 lakehouse pipelines

Definition

What it is

An open standard for data lineage collection that defines a common JSON schema for capturing metadata about data pipeline runs — what datasets were consumed, what was produced, and what transformations occurred.

Why it exists

Data lineage information was historically locked inside individual orchestration tools (Airflow, Spark, dbt). OpenLineage provides a vendor-neutral, open standard so that lineage events from any tool can be collected, correlated, and queried in a consistent format.

Primary use cases

Cross-tool data lineage tracking for S3 lakehouse pipelines, regulatory compliance auditing, pipeline impact analysis and debugging.

Connections 6

Outbound 3
scoped_to2
enables1
Inbound 3

Resources 3