Standard

OpenLineage

An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.

6 connections 3 resources

Summary

What it is

An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.

Where it fits

OpenLineage is the missing observability layer for S3 lakehouses. As pipelines span Spark, Airflow, Flink, and dbt across multiple S3-backed tables, OpenLineage provides the standard format for stitching lineage together into a complete graph, regardless of which orchestrator runs the job.

Misconceptions / Traps

OpenLineage is a standard, not a product. It has no UI — you need a backend like Marquez or Datakin to store and visualize the lineage events.
Integration quality varies by tool. Some integrations (Spark) are mature; others (Flink) are still developing.

Key Connections

enables Marquez — the reference implementation that stores and visualizes OpenLineage events
scoped_to Lakehouse, S3 — lineage tracking for S3 lakehouse pipelines

Definition

What it is

An open standard for data lineage collection that defines a common JSON schema for capturing metadata about data pipeline runs — what datasets were consumed, what was produced, and what transformations occurred.

Why it exists

Data lineage information was historically locked inside individual orchestration tools (Airflow, Spark, dbt). OpenLineage provides a vendor-neutral, open standard so that lineage events from any tool can be collected, correlated, and queried in a consistent format.

Primary use cases

Cross-tool data lineage tracking for S3 lakehouse pipelines, regulatory compliance auditing, pipeline impact analysis and debugging.

Recent developments

Latest signals

LF AI & Data Foundation Graduate project. OpenLineage achieved Graduate-tier status in the LF AI & Data Foundation — the highest open-source maturity signal in the foundation's tiering. Per GitHub — OpenLineage/OpenLineage.
Positioned as "OpenTelemetry for data pipelines." OpenLineage explicitly frames itself as the data-pipeline analog of OpenTelemetry: an API to collect lineage events, agnostic to the backend, aimed at being embedded in every data-processing engine. Per OpenLineage Blog — How OpenLineage takes inspiration from OpenTelemetry.
Trino added native OpenLineage integration alongside OpenTelemetry. Trino's adoption is the bellwether — when the leading lakehouse query engine ships OL + OTel side-by-side, the "two standards, complementary models" framing wins the architectural debate vs the "one standard for everything" camp. Per Improving — Effective Data Lineage Strategies for Real-Time Systems.
First-class integrations: Airflow, Spark, dbt, Flink. The four most-deployed data-pipeline tools all ship native OpenLineage emitters — collection coverage is no longer the gap. The 2026 work is on consumption + visualization (Marquez, DataHub, Atlan, etc.). Per GitHub — OpenLineage/OpenLineage.
USENIX SREcon EMEA 2025: OpenLineage as foundational layer for data reliability. Recognized at the SRE conference circuit — Obuchowski's talk frames OpenLineage as the load-bearing instrumentation layer for "data SRE" the same way OTel grounded service SRE. Per USENIX SREcon EMEA 25 — Cross-Platform Data Lineage with OpenLineage.
OTel spec issue #3447 explores modeling lineage in OTel directly. Active discussion in the OpenTelemetry specification repo on whether to model data lineage as native OTel signals. Outcome will shape whether OL stays a sister project or eventually folds into OTel. Per open-telemetry/opentelemetry-specification Issue #3447.