Apache Airflow
A platform for programmatically authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs) written in Python. The industry standard for batch data pipeline orchestration.
Summary
A platform for programmatically authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs) written in Python. The industry standard for batch data pipeline orchestration.
Airflow is the scheduler and coordinator for batch ETL/ELT pipelines that move, transform, and maintain data on S3. It orchestrates Spark jobs, dbt runs, Iceberg compaction, and embedding generation workflows — not executing the work itself, but ensuring it runs in the right order on the right schedule.
- Airflow is an orchestrator, not an execution engine. It schedules and monitors tasks but should not process data directly. Heavy workloads belong on Spark, DuckDB, or dedicated compute — not inside Airflow workers.
- DAG complexity grows quickly. Without disciplined modularization, Airflow deployments become tangled webs of interdependent DAGs that are difficult to debug and test.
- Airflow's scheduler is single-threaded by default. High-concurrency deployments require tuning the scheduler, executor (Celery/Kubernetes), and metadata database backend.
enablesLakehouse Architecture — orchestrates ETL/ELT into S3-based lakehousesscoped_toObject Storage for AI Data Pipelines — coordinates pipelines over S3 datasolvesLegacy Ingestion Bottlenecks — programmable orchestration for modern S3 architectures
Definition
A workflow orchestration platform that defines, schedules, and monitors data pipelines as Python-defined directed acyclic graphs (DAGs). The industry standard for batch ETL scheduling in S3-centric data platforms.
Data lakes on S3 require coordinated multi-step pipelines — extraction, transformation, loading, compaction, quality checks. Airflow provides the scheduling, dependency management, retry logic, and observability needed to operate these pipelines reliably.
Batch ETL orchestration for S3 data lakes, scheduled Iceberg compaction and maintenance, data quality validation pipelines, coordinated multi-system data workflows.
Connections 3
Outbound 3
scoped_to1enables1solves1Resources 2
DAG authoring, operator reference, and deployment guides for the workflow orchestrator that schedules most S3-based data pipelines.
Source code and provider packages including the Amazon provider with S3 hooks, sensors, and transfer operators.