Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.
The philosophy is "Workflows as Code", which serves several purposes
- dynamic: dynamic pipeline generation
- extensible: can connect with numerous technologies (other packages, dbs)
- flexible: parameterization leverages Jinja templating engine.
Here's a simple example
from datetime import datetime from airflow import DAG from airflow.decorators import task from airflow.operators.bash import BashOperator # A DAG represents a workflow, a collection of tasks with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule="0 0 * * *") as dag: # Tasks are represented as operators marco = BashOperator(task_id="marco", bash_command="echo marco") @task() def polo(): print("polo") # Set dependencies between tasks marco >> polo()
There are three main concepts to understand.
- DAGs: describes the work (tasks) and the order to carry out the workflow
- Operators: a class that acts as a template for carrying out some work
- Tasks: a unit of work (node) in a DAG, implements an operator
Here we have a DAG (Directed Acyclic Graph) named "demo" that will start on January 1st 2022, running daily.
We have two tasks defined
- A
BashOperator
running a bash script - A Python function defined with the
@task
decorator
>>
is a bishift operator which defines the dependency and order of the tasks.
Here it means marco
runs first, then polo()
Why Airflow?
- Coding > clicking
- version control, can roll back to previous workflows
- can be developed by multiple people simultaneously
- write tests to validate functionalities
- components are extensible
Why not Airflow?
- not for infinitely-running event-based workflows (this is streaming - Kafka)
- if you like clicking
Resources
- Official Tutorial
- DE Project by startdataengineering
- Best Practices
- YouTube Tutorial
- jghoman/awesome-apache-airflow: Curated list of resources about Apache Airflow
- Apache Airflow - YouTube
Alternatives
Blogs
- Lessons Learned From Running Apache Airflow at Scale (2023)
- Running Apache Airflow At Lyft. By Tao Feng, Andrew Stahlman, and Junda… | by Tao Feng | Lyft Engineering
- Migrating Airflow from EC2 to Kubernetes | Snowflake Blog
- How AppsFlyer uses Apache Airflow to run more than 3.5k daily jobs | by Alex Kruchkov | AppsFlyer Engineering | Medium
- The Journey of Deploying Apache Airflow at Grab
- How Sift Trains Thousands of Models using Apache Airflow - Sift Engineering Blog : Sift Engineering Blog