You are on page 1of 9

INTRODUCTION TO AIRFLOW

Airflow is a platform to programmatically author, schedule and monitor data pipelines that meets the need of
almost all the stages of the lifecycle of Workflow Management. The system has been built by Airbnb on the below
four principles:
• Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This
allows for writing code that instantiates pipelines dynamically.
• Extensible: Easily define your own operators, executors and extend the library so that it fits the level of
abstraction that suits your environment.
• Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using
the powerful Jinja templating engine.
• Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of
workers. Airflow is ready to scale to infinity.

Basic concepts of Airflow


• DAGs: Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects
their relationships and dependencies.
 DAGs are defined as python scripts and are placed in the DAGs folder (could be any location but needs to
be configured in the airflow config file).
 Once a new DAG is placed into the DAGS folder, the DAGS are picked up by Airflow automatically within a
minute’s time.

• Operators: An operator describes a single task in a workflow. While DAGs describe how to run a workflow,
Operators determine what gets done.
 Task: Once an operator is instantiated using some parameters, it is referred to as a “task”
 Task Instance: A task executed at a time is called Task Instance.
• Scheduling the DAGs/Tasks: The DAGs and Tasks can be scheduled to be run at certain frequency using the
below parameters.
 Schedule interval: Determines when the DAG should be triggered. This can be a cron expression or a
datetime object of python.
• Executors: Once the DAGs, Tasks and the scheduling definitions are in place, someone need to execute the
jobs/tasks. Here is where Executors come into picture. There are three types of executors provided by Airflow out
of the box.
 Sequential: A Sequential executor is for test drive that can execute the tasks one by one (sequentially).
Tasks cannot be parallelized.
 Local: A local executor is like Sequential executor. But it can parallelize task instances locally.
 Celery: Celery executor is a open source Distributed Tasks Execution Engine that based on message
queues making it more scalable and fault tolerant. Message queues like RabbitMQ or Redis can be used
along with Celery. This is typically used for production purposes.

ARCHITECTURE OF AIRFLOW
Airflow typically constitutes of the below components:
• Configuration file: All the configuration points like “which port to run the web server on”, “which executor to
use”, “config related to RabbitMQ/Redis”, workers, DAGS location, repository etc. are configured.
• Metadata database (MySQL or postgres): The database where all the metadata related to the DAGS, DAG runs,
tasks, variables are stored.
• DAGs (Directed Acyclic Graphs): These are the Workflow definitions (logical units) that contains the task
definitions along with the dependencies info. These are the actual jobs that the user would be like to execute.
• Scheduler: A component that is responsible for triggering the DAG instances and job instances for each DAG. The
scheduler is also responsible for invoking the Executor (be it Local or Celery or Sequential)
• Broker (Redis or RabbitMQ): In case of a Celery executor, the broker is required to hold the messages and act as a
communicator between the executor and the workers.
• Worker nodes: The actual workers that execute the tasks and return the result of the task.
• Web server: A web server that renders the UI for Airflow through which one can view the DAGs, its status, rerun,
create variables, connections etc.

Airflow architecture
How a DAG runs?
Operators
Example Dag
import airflow
from airflow import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.sensors.http_sensor import HttpSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.hive_operator import HiveOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.operators.email_operator import EmailOperator
from airflow.operators.slack_operator import SlackAPIPostOperator
from datetime import datetime, timedelta

import csv
import requests
import json

default_args = {
            "owner": "Airflow",
            "start_date": airflow.utils.dates.days_ago(1),
            "depends_on_past": False,
            "email_on_failure": False,
            "email_on_retry": False,
            "email": "youremail@host.com",
            "retries": 1,
            "retry_delay": timedelta(minutes=5)
        }

# Download forex rates according to the currencies we want to watch
# described in the file forex_currencies.csv
def download_rates():
    with open('/usr/local/airflow/dags/files/forex_currencies.csv') as forex_curr
encies:
        reader = csv.DictReader(forex_currencies, delimiter=';')
        for row in reader:
            base = row['base']
            with_pairs = row['with_pairs'].split(' ')
            indata = requests.get('https://api.exchangeratesapi.io/latest?base=' 
+ base).json()
            outdata = {'base': base, 'rates': {}, 'last_update': indata['date']}
            for pair in with_pairs:
                outdata['rates'][pair] = indata['rates'][pair]
            with open('/usr/local/airflow/dags/files/forex_rates.json', 'a') as 
outfile:
                json.dump(outdata, outfile)
                outfile.write('\n')

with DAG(dag_id="forex_data_pipeline_final", schedule_interval="@daily", default_
args=default_args, catchup=False) as dag:

    # Checking if forex rates are avaiable
    # TODO: Check SSL
    is_forex_rates_available = HttpSensor(
            task_id="is_forex_rates_available",
            method="GET",
            http_conn_id="forex_api",
            endpoint="latest",
            response_check=lambda response: "rates" in response.text,
            poke_interval=5,
            timeout=20
    )

    # Checking if the file containing the forex pairs we want to observe is arri
ved
    # TODO: Speak about the fact that the path in connection forex_path must be 
specified
    # in the extra parameter as JSON
    is_forex_currencies_file_available = FileSensor(
            task_id="is_forex_currencies_file_available",
            fs_conn_id="forex_path",
            filepath="forex_currencies.csv",
            poke_interval=5,
            timeout=20
    )

    # Parsing forex_pairs.csv and downloading the files
    downloading_rates = PythonOperator(
            task_id="downloading_rates",
            python_callable=download_rates
    )

    # Saving forex_rates.json in HDFS
    saving_rates = BashOperator(
        task_id="saving_rates",
        bash_command="""
            hdfs dfs -mkdir -p /forex && \
            hdfs dfs -put -f $AIRFLOW_HOME/dags/files/forex_rates.json /forex
            """
    )

    # Creating a hive table named forex_rates
    creating_forex_rates_table = HiveOperator(
        task_id="creating_forex_rates_table",
        hive_cli_conn_id="hive_conn",
        hql="""
            CREATE EXTERNAL TABLE IF NOT EXISTS forex_rates(
                base STRING,
                last_update DATE,
                eur DOUBLE,
                usd DOUBLE,
                nzd DOUBLE,
                gbp DOUBLE,
                jpy DOUBLE,
                cad DOUBLE
                )
            ROW FORMAT DELIMITED
            FIELDS TERMINATED BY ','
            STORED AS TEXTFILE
        """
    )

    # Running Spark Job to process the data
    forex_processing = SparkSubmitOperator(
        task_id="forex_processing",
        conn_id="spark_conn",
        application="/usr/local/airflow/dags/scripts/forex_processing.py",
        verbose=False
    )

    # Sending a notification by email
    # https://stackoverflow.com/questions/51829200/how-to-set-up-airflow-send-
email
    sending_email_notification = EmailOperator(
            task_id="sending_email",
            to="airflow_course@yopmail.com",
            subject="forex_data_pipeline",
            html_content="""
                <h3>forex_data_pipeline succeeded</h3>
            """
            )

    # Sending a notification by Slack message
    # TODO: Improvements - add on_failure for tasks
    # https://medium.com/datareply/integrating-slack-alerts-in-airflow-
c9dcd155105
    sending_slack_notification = SlackAPIPostOperator(
        task_id="sending_slack",
        token="xoxp-753801195270-740121926339-751642514144-
8391b800988bed43247926b03742459e",
        username="airflow",
        text="DAG forex_data_pipeline: DONE",
        channel="#airflow-exploit"
    )

    is_forex_rates_available >> is_forex_currencies_file_available >> downloading
_rates >> saving_rates
    saving_rates >> creating_forex_rates_table >> forex_processing 
    forex_processing >> sending_email_notification >> sending_slack_notification

You might also like