Technology

Apache Airflow

A platform for programmatically authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs) written in Python. The industry standard for batch data pipeline orchestration.

3 connections 2 resources

Summary

What it is

A platform for programmatically authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs) written in Python. The industry standard for batch data pipeline orchestration.

Where it fits

Airflow is the scheduler and coordinator for batch ETL/ELT pipelines that move, transform, and maintain data on S3. It orchestrates Spark jobs, dbt runs, Iceberg compaction, and embedding generation workflows — not executing the work itself, but ensuring it runs in the right order on the right schedule.

Misconceptions / Traps
  • Airflow is an orchestrator, not an execution engine. It schedules and monitors tasks but should not process data directly. Heavy workloads belong on Spark, DuckDB, or dedicated compute — not inside Airflow workers.
  • DAG complexity grows quickly. Without disciplined modularization, Airflow deployments become tangled webs of interdependent DAGs that are difficult to debug and test.
  • Airflow's scheduler is single-threaded by default. High-concurrency deployments require tuning the scheduler, executor (Celery/Kubernetes), and metadata database backend.
Key Connections
  • enables Lakehouse Architecture — orchestrates ETL/ELT into S3-based lakehouses
  • scoped_to Object Storage for AI Data Pipelines — coordinates pipelines over S3 data
  • solves Legacy Ingestion Bottlenecks — programmable orchestration for modern S3 architectures

Definition

What it is

A workflow orchestration platform that defines, schedules, and monitors data pipelines as Python-defined directed acyclic graphs (DAGs). The industry standard for batch ETL scheduling in S3-centric data platforms.

Why it exists

Data lakes on S3 require coordinated multi-step pipelines — extraction, transformation, loading, compaction, quality checks. Airflow provides the scheduling, dependency management, retry logic, and observability needed to operate these pipelines reliably.

Primary use cases

Batch ETL orchestration for S3 data lakes, scheduled Iceberg compaction and maintenance, data quality validation pipelines, coordinated multi-system data workflows.

Connections 3

Outbound 3

Resources 2