Data Engineering Assignment Report

INTRODUCTION
Background
The healthcare analytics project is all about using smart data techniques to look closely at
information from the US Health Insurance Marketplace. We want to understand and analyze details
about health and dental plans that are available to people. The information we're looking at is found
in the Health Insurance Marketplace Public Use Files. It's like digging into a treasure trove of data to
gain insights into the plans offered through the Marketplace.
Objectives
 Ingest and transform the processed dataset using a chosen framework.

 Implement data quality checks to ensure accuracy.
 Conduct data analysis to understand market dynamics, plan rates, benefits, and variations
across different factors.
DATA SOURCE
Dataset Information
The Health Insurance Marketplace Public Use Files, made by CMS, have info on health and dental
plans for individuals and small businesses in the US Marketplace. They're like a data goldmine for
understanding available insurance options.
Processed Data Components
The processed version of the data includes six CSV files with the following components:
 BenefitsCostSharing.csv
 BusinessRules.csv
 Network.csv
 PlanAttributes.csv
 Rate.csv
 ServiceArea.csv
INGESTION AND ETL
Framework Selection: Apache Airflow
Apache Airflow Overview:

Apache Airflow is a free platform for creating, scheduling, and tracking workflows using Directed
Acyclic Graphs (DAGs). It lets you organize and execute tasks in a specific sequence, making it great
for managing intricate data workflows.
Installation Process:
1. Install Python:
 Ensure that Python is installed on your system.
 Apache Airflow is compatible with both Python 2 and Python 3, but it's recommended
to use Python 3 for the latest versions.
2. Install Apache Airflow:
 You can install Apache Airflow using pip, a package installer for Python. Run
the following command:
Bash command
pip install apache-airflow
3. Initialize Airflow Database:
 Initialize the metadata database used by Airflow to store its configuration settings and
job metadata.
Bash command
airflow db init
4. Start the Airflow Web Server:
 Start the web server, which provides the Airflow user interface.
Bash command
airflow webserver --port 8080
5. Start the Scheduler:
 Start the scheduler, which orchestrates the execution of tasks defined in your DAGs.
Bash command
airflow scheduler
6. Access the Airflow UI:
 Open a web browser and navigate to http://localhost:8080 to access the Airflow UI.
Tools and OS Requirements:
Operating System:
 Apache Airflow is platform-agnostic and can be installed on various operating
systems, including Linux and Windows.
Python Virtual Environment (Optional but Recommended):
 It's good practice to create a virtual environment to isolate the dependencies of your project.
Bash command
python3 -m venv myenv source myenv/bin/activate # On Unix
Additional Tools (if needed for specific tasks):
 Database:
 If you plan to use a specific database backend (e.g., PostgreSQL, MySQL, SQLite), you
may need to install the corresponding database software and Python drivers.
 Additional Python Libraries:
 Depending on the specific data processing tasks in your ETL process, you may need
to install additional Python libraries. For example:
Bash command
pip install pandas numpy sqlalchemy
 Other Dependencies:
 Some tasks might require additional tools or libraries. Ensure you install them
based on your project requirements.
Docker Integration:
1. Install Docker:
 Ensure Docker is installed on your system. You can download and install Docker from
the official website:
2. Create Dockerfile:
 Create a Dockerfile in your project directory to define the Docker image for your Apache
Airflow environment. An example Dockerfile might look like this:
FROM apache/airflow:2.1.2 USER root RUN pip install pandas numpy sqlalchemy USER airflow
3. Build Docker Image:
 In the same directory as your Dockerfile, run the following command to build the Docker
image:
Bash command
docker build -t my_airflow_image .
4. Run Apache Airflow in Docker Container:
 Start Apache Airflow within a Docker container using the built image:
Bash command
docker run -p 8080:8080 my_airflow_image
 Access the Airflow UI at http://localhost:8080.
Visual Studio Code Integration:
1. Install Visual Studio Code:

 Download and install Visual Studio Code from the official website:
2. Install Docker Extension:
 Install the "Docker" extension for Visual Studio Code to manage Docker containers and
images directly from the VS Code interface.
3. Install Python Extension:
 Install the "Python" extension for Visual Studio Code to enhance Python development within
the editor.
4. Connect to Docker from VS Code:
 Open the Docker extension in VS Code, and it will automatically detect running Docker
containers and images.
5. Develop and Debug in VS Code:
 Write your Python scripts, Apache Airflow DAGs, and ETL logic within VS Code.
 Utilize VS Code's debugging features for Python development.
6. Docker Compose:
 If your project involves multiple services or components, consider using Docker Compose to
define and run multi-container Docker applications.
 Create a docker-compose.yml file to specify your services, volumes, and networks.
 Use the docker-compose CLI to manage the lifecycle of your application.
TRANSFORMATION LOGIC
The transformation process involves tasks like data cleaning, normalization, and feature engineering.
To carry out these tasks, Python scripts will be integrated into the Apache Airflow workflow.
Data Cleaning:
Handling Missing Values:
 Find and address any missing or null values in the dataset. Depending on the situation and the
columns affected, you can either fill in the missing values using statistical methods or decide to
remove the corresponding rows or columns.
Outlier Detection and Removal:
 Detect and manage outliers that could impact the analysis. This may require using statistical
techniques or domain expertise to identify and eliminate the outliers from the dataset.
Normalization:
Scaling Numerical Features:
 Standardize numerical features to ensure a consistent scale. Methods like Min-Max scaling or Z-
score normalization can be employed to prevent any particular feature from overshadowing
others during analysis.
Categorical Variable Encoding:
 Encode categorical variables using methods like one-hot encoding or label encoding. This step
is vital for machine learning models, ensuring that categorical variables are formatted
appropriately for analysis.
Python Scripts within Apache Airflow Workflow:
Define Python Operators:

 Leverage Apache Airflow to create Python Operators, each linked to a distinct
transformation task. These operators will run Python scripts containing the
transformation logic.
Order of Execution:
 Define the order in which these Python Operators should execute within the workflow.
Ensure dependencies are established so that transformations are performed in a logical
sequence.
Parameterization:
 Take advantage of Apache Airflow's parameterization features for a flexible workflow.
This involves using parameters such as file paths, column names, or other configurations
needed by the Python scripts.
Error Handling:
 Incorporate error-handling mechanisms in Python scripts to gracefully manage
unexpected issues during transformations. This guarantees the workflow can recover
from failures, offering informative error messages for effective debugging.
Logging and Monitoring:
Logging:
 Integrate logging in Python scripts to record important events, errors, or relevant
information. This facilitates troubleshooting and monitoring of the ETL process.
Monitoring:
 Set up monitoring tools within Apache Airflow to track the progress of the workflow. This
includes checking for successful task completion, identifying failures, and triggering alerts
if needed.
APACHE AIRFLOW
Apache Airflow is an open-source platform that orchestrates complex workflows represented as
Directed Acyclic Graphs (DAGs). Primarily employed for managing and scheduling ETL (Extract,
Transform, Load) processes, it streamlines the flow of data operations.
Key Concepts:
1. DAG (Directed Acyclic Graph):

 In Apache Airflow, a DAG is a set of tasks with specified dependencies, outlining the
orchestrated workflow.
 It is created in a Python script, encompassing tasks, operators, and their relationships.
2. Operators:
 Operators in Apache Airflow define individual steps in a workflow, each carrying out a
specific action like executing SQL queries or running Python scripts.
 Examples of common operators are PythonOperator, BashOperator, SqlSensor, etc.
3. Tasks:
 In Apache Airflow, a task is an occurrence of an operator, signifying a distinct unit of
work in a DAG.
 Tasks are part of a DAG, and the connections between tasks determine the sequence of
the workflow.
4. Task Dependencies:
 Dependencies between tasks are specified in the DAG definition. A task can depend
on the success or failure of one or more tasks before it can be executed.
 Dependencies are established using the set_upstream() and set_downstream() methods.
Python Code for creating DAG file in Apache Airflow
1 import pandas as pd
2 from sqlalchemy import create_engine
3 from datetime import datetime,
4 timedelta from airflow import DAG
5 from airflow.operators.python_operator import PythonOperator
6
7 def load_csv():
8 # List of file names to be loaded
9 file_names = [ r"C:\Users\HP\Downloads\dataset\
10 BenefitsCostSharing.csv", r"C:\Users\HP\Downloads\dataset\
11 BusinessRules.csv", r"C:\Users\HP\Downloads\dataset\Network.csv",
12 r"C:\Users\HP\Downloads\dataset\PlanAttributes.csv", r"C:\Users\HP\
13 Downloads\dataset\Rate.csv", r"C:\Users\HP\Downloads\dataset\
14 ServiceArea.csv"]
15
16 # Load all CSV files into a list of DataFrames
17 dfs = [pd.read_csv(file) for file in file_names]
18
19 # Concatenate the DataFrames into a single
20 DataFrame combined_df = pd.concat(dfs,
21 ignore_index=True) return combined_df
22
23
24 def perform_transformation(**kwargs):
25 ti = kwargs['ti']
26 raw_data = ti.xcom_pull(task_ids='load_task')
27
28 # Perform your transformations
29 transformed_data = raw_data.copy()
30
31 # Add a new column with a complex calculation
32 transformed_data['new_calculated_column'] = (
33 transformed_data['existing_column'] * 3 + transformed_data['another_column']
34 )
35
36 # Handle missing values (replace NaN with a default value)
37 transformed_data['existing_column'].fillna(0, inplace=True)
38
39 # Apply a custom function to a column
40 def custom_function(value):
41 # Example: Apply a function to each value in a column
42 return value + 10
43
44 transformed_data['another_column'] =
transformed_data['another_column'].apply(custom_function)
45
46 # Push the transformed data to XCom for later use
47 ti.xcom_push(key='transformed_data', value=transformed_data)
48
49 def store_to_mysql(**kwargs):
50 ti = kwargs['ti']
51 transformed_data = ti.xcom_pull(task_ids='transform_task')
52
53 # Replace 'your_mysql_connection_string' with your actual MySQL connection string
54 engine =
create_engine('mysql+mysqlconnector://username:password@localhost:3306/your_database')
55
56 # Upload transformed data to MySQL table
57 transformed_data.to_sql('Health Insurance Marketplace ', con=engine,
index=False, if_exists='replace')
58
59 default_args = {
60 'owner': 'airflow',
61 'start_date': datetime(2023, 12, 14),
62 'depends_on_past': False,
63 'retries': 1,
64 'retry_delay': timedelta(minutes=5),
65 }
66
67 dag = DAG(
68 'transform_dag',
69 default_args=default_args,
70 description='A DAG for data transformation and storage',
71 schedule_interval=timedelta(days=1),
72 )
73
74 load_task = PythonOperator(
75 task_id='load_task',
76 python_callable=load_csv,
77 dag=dag,
78 )
79
80 transform_task = PythonOperator(
81 task_id='transform_task',
82 python_callable=perform_transformation,
83 provide_context=True,
84 dag=dag,
85 )
86
87 store_task = PythonOperator(
88 task_id='store_task',
89 python_callable=store_to_mysql,
90 provide_context=True,
91 dag=dag,
92 )
93
94 # Set task dependencies
95 load_task >> transform_task >> store_task
Workflow:
1. Extraction Task (extract_task):

 Defines the logic to extract data from the source, such as a database, API, or file.
2. Transformation Task (transform_task):
 Performs data transformations on the extracted data. This could involve
cleaning, aggregating, or reshaping the data.
3. Loading Task (load_task):
 Loads the transformed data into the target destination, such as a data warehouse
or database.
4. Dependencies:
 extract_task must complete successfully before transform_task can run, and
similarly, transform_task must complete before load_task.
Running the DAG:
 Save the script and place it in the DAGs directory configured in your Airflow installation.
 Airflow will automatically detect and schedule the DAG based on the defined
schedule_interval .
 Monitor the progress and logs through the Airflow web UI (http://localhost:8080 by
default).
LEARNING OUTCOMES
1. ETL Processes Understanding:
 Gain in-depth knowledge of Extract, Transform, Load (ETL) processes and the crucial role
of orchestrating workflows in data engineering.
2. Apache Airflow Mastery:
 Develop proficiency in utilizing Apache Airflow as a powerful tool for workflow
orchestration.
 Learn the art of defining Directed Acyclic Graphs (DAGs) for intricate workflows.
3. Python Scripting for Data Processing:
 Hone skills in crafting Python scripts tailored for data processing tasks within the Apache
Airflow framework.
 Explore the use of Python operators for diverse data manipulation operations.
4. Data Quality Assurance:
 Implement robust data quality checks to uphold the accuracy and reliability of processed
data.
5. Framework Integration Expertise:
 Gain hands-on experience in seamlessly integrating Apache Airflow with other frameworks,
tools, or services, enhancing the overall ETL pipeline.
6. Docker Integration Proficiency:
 Learn the ins and outs of using Docker for containerization, ensuring consistent and
reproducible environments.
7. Version Control Competence:
 Utilize version control tools like Git for efficient code management and collaborative
teamwork.
8. Documentation Skills:
 Practice creating thorough project documentation, including README files and reports, to
effectively communicate project details.
9. Project Management Aptitude:
 Acquire project management skills by effectively organizing and overseeing the
development, testing, and deployment phases of ETL projects.
10.Troubleshooting and Debugging Skills:
 Develop expertise in identifying and resolving issues within Apache Airflow workflows,
Python scripts, and during the ETL process.
11.Collaboration and Communication Excellence:
 Enhance collaborative and communication skills through interactions with team members,
stakeholders, and the broader data engineering community.
12.Practical Experience Focus:
 Obtain hands-on, practical experience in constructing end-to-end ETL workflows, providing
a realistic grasp of data engineering challenges and effective solutions.

Data Engineering Assignment Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Engineering Assignment Report

Uploaded by

Copyright:

Available Formats

INTRODUCTION

 Ingest and transform the processed dataset using a chosen framework.

Processed Data Components

INGESTION AND ETL

Framework Selection: Apache Airflow

Apache Airflow Overview:

Tools and OS Requirements:

Visual Studio Code Integration:

1. Install Visual Studio Code:

Define Python Operators:

1. DAG (Directed Acyclic Graph):

1. Extraction Task (extract_task):

Running the DAG:

You might also like