Professional Documents
Culture Documents
Airflow is a platform to programmatically author, schedule and monitor data pipelines that meets the need of
almost all the stages of the lifecycle of Workflow Management. The system has been built by Airbnb on the below
four principles:
• Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This
allows for writing code that instantiates pipelines dynamically.
• Extensible: Easily define your own operators, executors and extend the library so that it fits the level of
abstraction that suits your environment.
• Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using
the powerful Jinja templating engine.
• Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of
workers. Airflow is ready to scale to infinity.
• Operators: An operator describes a single task in a workflow. While DAGs describe how to run a workflow,
Operators determine what gets done.
Task: Once an operator is instantiated using some parameters, it is referred to as a “task”
Task Instance: A task executed at a time is called Task Instance.
• Scheduling the DAGs/Tasks: The DAGs and Tasks can be scheduled to be run at certain frequency using the
below parameters.
Schedule interval: Determines when the DAG should be triggered. This can be a cron expression or a
datetime object of python.
• Executors: Once the DAGs, Tasks and the scheduling definitions are in place, someone need to execute the
jobs/tasks. Here is where Executors come into picture. There are three types of executors provided by Airflow out
of the box.
Sequential: A Sequential executor is for test drive that can execute the tasks one by one (sequentially).
Tasks cannot be parallelized.
Local: A local executor is like Sequential executor. But it can parallelize task instances locally.
Celery: Celery executor is a open source Distributed Tasks Execution Engine that based on message
queues making it more scalable and fault tolerant. Message queues like RabbitMQ or Redis can be used
along with Celery. This is typically used for production purposes.
ARCHITECTURE OF AIRFLOW
Airflow typically constitutes of the below components:
• Configuration file: All the configuration points like “which port to run the web server on”, “which executor to
use”, “config related to RabbitMQ/Redis”, workers, DAGS location, repository etc. are configured.
• Metadata database (MySQL or postgres): The database where all the metadata related to the DAGS, DAG runs,
tasks, variables are stored.
• DAGs (Directed Acyclic Graphs): These are the Workflow definitions (logical units) that contains the task
definitions along with the dependencies info. These are the actual jobs that the user would be like to execute.
• Scheduler: A component that is responsible for triggering the DAG instances and job instances for each DAG. The
scheduler is also responsible for invoking the Executor (be it Local or Celery or Sequential)
• Broker (Redis or RabbitMQ): In case of a Celery executor, the broker is required to hold the messages and act as a
communicator between the executor and the workers.
• Worker nodes: The actual workers that execute the tasks and return the result of the task.
• Web server: A web server that renders the UI for Airflow through which one can view the DAGs, its status, rerun,
create variables, connections etc.
Airflow architecture
How a DAG runs?
Operators
Example Dag
import airflow
from airflow import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.sensors.http_sensor import HttpSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.hive_operator import HiveOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.operators.email_operator import EmailOperator
from airflow.operators.slack_operator import SlackAPIPostOperator
from datetime import datetime, timedelta
import csv
import requests
import json
default_args = {
"owner": "Airflow",
"start_date": airflow.utils.dates.days_ago(1),
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"email": "youremail@host.com",
"retries": 1,
"retry_delay": timedelta(minutes=5)
}
# Download forex rates according to the currencies we want to watch
# described in the file forex_currencies.csv
def download_rates():
with open('/usr/local/airflow/dags/files/forex_currencies.csv') as forex_curr
encies:
reader = csv.DictReader(forex_currencies, delimiter=';')
for row in reader:
base = row['base']
with_pairs = row['with_pairs'].split(' ')
indata = requests.get('https://api.exchangeratesapi.io/latest?base='
+ base).json()
outdata = {'base': base, 'rates': {}, 'last_update': indata['date']}
for pair in with_pairs:
outdata['rates'][pair] = indata['rates'][pair]
with open('/usr/local/airflow/dags/files/forex_rates.json', 'a') as
outfile:
json.dump(outdata, outfile)
outfile.write('\n')
with DAG(dag_id="forex_data_pipeline_final", schedule_interval="@daily", default_
args=default_args, catchup=False) as dag:
# Checking if forex rates are avaiable
# TODO: Check SSL
is_forex_rates_available = HttpSensor(
task_id="is_forex_rates_available",
method="GET",
http_conn_id="forex_api",
endpoint="latest",
response_check=lambda response: "rates" in response.text,
poke_interval=5,
timeout=20
)
# Checking if the file containing the forex pairs we want to observe is arri
ved
# TODO: Speak about the fact that the path in connection forex_path must be
specified
# in the extra parameter as JSON
is_forex_currencies_file_available = FileSensor(
task_id="is_forex_currencies_file_available",
fs_conn_id="forex_path",
filepath="forex_currencies.csv",
poke_interval=5,
timeout=20
)
# Parsing forex_pairs.csv and downloading the files
downloading_rates = PythonOperator(
task_id="downloading_rates",
python_callable=download_rates
)
# Saving forex_rates.json in HDFS
saving_rates = BashOperator(
task_id="saving_rates",
bash_command="""
hdfs dfs -mkdir -p /forex && \
hdfs dfs -put -f $AIRFLOW_HOME/dags/files/forex_rates.json /forex
"""
)
# Creating a hive table named forex_rates
creating_forex_rates_table = HiveOperator(
task_id="creating_forex_rates_table",
hive_cli_conn_id="hive_conn",
hql="""
CREATE EXTERNAL TABLE IF NOT EXISTS forex_rates(
base STRING,
last_update DATE,
eur DOUBLE,
usd DOUBLE,
nzd DOUBLE,
gbp DOUBLE,
jpy DOUBLE,
cad DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
"""
)
# Running Spark Job to process the data
forex_processing = SparkSubmitOperator(
task_id="forex_processing",
conn_id="spark_conn",
application="/usr/local/airflow/dags/scripts/forex_processing.py",
verbose=False
)
# Sending a notification by email
# https://stackoverflow.com/questions/51829200/how-to-set-up-airflow-send-
email
sending_email_notification = EmailOperator(
task_id="sending_email",
to="airflow_course@yopmail.com",
subject="forex_data_pipeline",
html_content="""
<h3>forex_data_pipeline succeeded</h3>
"""
)
# Sending a notification by Slack message
# TODO: Improvements - add on_failure for tasks
# https://medium.com/datareply/integrating-slack-alerts-in-airflow-
c9dcd155105
sending_slack_notification = SlackAPIPostOperator(
task_id="sending_slack",
token="xoxp-753801195270-740121926339-751642514144-
8391b800988bed43247926b03742459e",
username="airflow",
text="DAG forex_data_pipeline: DONE",
channel="#airflow-exploit"
)
is_forex_rates_available >> is_forex_currencies_file_available >> downloading
_rates >> saving_rates
saving_rates >> creating_forex_rates_table >> forex_processing
forex_processing >> sending_email_notification >> sending_slack_notification