OpenLineage
An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.
Summary
An open standard that defines a common JSON schema for capturing data lineage events — what datasets were consumed, what was produced, and how transformations connected them.
OpenLineage is the missing observability layer for S3 lakehouses. As pipelines span Spark, Airflow, Flink, and dbt across multiple S3-backed tables, OpenLineage provides the standard format for stitching lineage together into a complete graph, regardless of which orchestrator runs the job.
- OpenLineage is a standard, not a product. It has no UI — you need a backend like Marquez or Datakin to store and visualize the lineage events.
- Integration quality varies by tool. Some integrations (Spark) are mature; others (Flink) are still developing.
enablesMarquez — the reference implementation that stores and visualizes OpenLineage eventsscoped_toLakehouse, S3 — lineage tracking for S3 lakehouse pipelines
Definition
An open standard for data lineage collection that defines a common JSON schema for capturing metadata about data pipeline runs — what datasets were consumed, what was produced, and what transformations occurred.
Data lineage information was historically locked inside individual orchestration tools (Airflow, Spark, dbt). OpenLineage provides a vendor-neutral, open standard so that lineage events from any tool can be collected, correlated, and queried in a consistent format.
Cross-tool data lineage tracking for S3 lakehouse pipelines, regulatory compliance auditing, pipeline impact analysis and debugging.
Recent developments
- LF AI & Data Foundation Graduate project. OpenLineage achieved Graduate-tier status in the LF AI & Data Foundation — the highest open-source maturity signal in the foundation's tiering. Per GitHub — OpenLineage/OpenLineage.
- Positioned as "OpenTelemetry for data pipelines." OpenLineage explicitly frames itself as the data-pipeline analog of OpenTelemetry: an API to collect lineage events, agnostic to the backend, aimed at being embedded in every data-processing engine. Per OpenLineage Blog — How OpenLineage takes inspiration from OpenTelemetry.
- Trino added native OpenLineage integration alongside OpenTelemetry. Trino's adoption is the bellwether — when the leading lakehouse query engine ships OL + OTel side-by-side, the "two standards, complementary models" framing wins the architectural debate vs the "one standard for everything" camp. Per Improving — Effective Data Lineage Strategies for Real-Time Systems.
- First-class integrations: Airflow, Spark, dbt, Flink. The four most-deployed data-pipeline tools all ship native OpenLineage emitters — collection coverage is no longer the gap. The 2026 work is on consumption + visualization (Marquez, DataHub, Atlan, etc.). Per GitHub — OpenLineage/OpenLineage.
- USENIX SREcon EMEA 2025: OpenLineage as foundational layer for data reliability. Recognized at the SRE conference circuit — Obuchowski's talk frames OpenLineage as the load-bearing instrumentation layer for "data SRE" the same way OTel grounded service SRE. Per USENIX SREcon EMEA 25 — Cross-Platform Data Lineage with OpenLineage.
- OTel spec issue #3447 explores modeling lineage in OTel directly. Active discussion in the OpenTelemetry specification repo on whether to model data lineage as native OTel signals. Outcome will shape whether OL stays a sister project or eventually folds into OTel. Per open-telemetry/opentelemetry-specification Issue #3447.
Connections 6
Inbound 3
implements3Resources 3
Official OpenLineage specification site with the JSON schema, integration guides, and ecosystem documentation.
Source repository for the OpenLineage spec with integration libraries for Python, Java, and popular orchestrators.
Overview of the data lineage ecosystem in 2025 covering OpenLineage's role as the emerging standard.