You are on page 1of 6

GS Mandal’s

Maharashtra Institute of Technology, Aurangabad


Laboratory manual -Practical Experiment Instruction sheet

Department of Emerging Sciences and Technology


Class: TY-AI&DS Subject Code: AID321 Subject: Data Engineering
Div: Roll No: Name of Student:

EXPERIMENT NO :1

AIM: To Build the data engineering infrastructure.

OBJECTIVE: To understand Data Engineering Tools and their installation.

PER-REQUISITES: Basics of Cloud

THEORY:

Here for building Data Engineering infrastructure.


1. install and configure two different databases– PostgreSQL and Elasticsearch –
2. two tools to assist in building workflows – Airflow and Apache NiFi, and
3. two administrative tools – pgAdmin for PostgreSQL and Kibana for Elasticsearch.

Thus, Data Engineering needs:


Installing and configuring Apache NiFi
• Installing and configuring Apache Airflow
• Installing and configuring Elasticsearch
• Installing and configuring Kibana
• Installing and configuring PostgreSQL
• Installing pgAdmin 4

1. Installing and configuring Apache NiFi:

To install Apache NiFi, you will need to download it from https://nifi.apache.


org/download.html:
1. By using curl, you can download NiFi using the following command line:
curl https://mirrors.estointernet.in/apache/
nifi/1.12.1/nifi-1.12.1-bin.tar.gz

2. Extract the NiFi files from the .tar.gz file using the following command:
tar xvzf nifi.tar.gz

3. You will now have a folder named nifi-1.12.1. You can run NiFi by executing
the following from inside the folder:
bin/nifi.sh start

4. If you already have Java installed and configured, when you run the status tool as
shown in the following snippet, you will see a path set for JAVA_HOME:
sudo bin/nifi.sh status
5. If you do not see JAVA_HOME set, you may need to install Java using the following
command:
sudo apt install openjdk-11-jre-headless

6. Then, you should edit .bash_profile to include the following line so that NiFi
can find the JAVA_HOME variable:
export JAVA_HOME=/usr/lib/jvm/java11-openjdk-amd64

7. Lastly, reload .bash_profile:


source .bash_profile

8. When you run for the status on NiFi, you should now see a path for JAVA_HOME:
Figure 2.1 – NiFi is running

9. When NiFi is ready, which may take a minute, open your web browser and go to
http://localhost:8080/nifi/. You should be seeing the following screen:

10. change the port NiFi runs on. In conf/nifi.properties,


change nifi.web.http.port=8080 under the web properties heading to 9300,
as shown:
# web properties #
nifi.web.http.port=9300

If your firewall is on, you may need to open the port:


sudo ufw allow 9300/tcp
Now, you can relaunch NiFi and view the GUI at http://localhost:9300/nifi/.

11. NiFi is the Processor tool. The other tools, from left to right, are
as follows:
• Input Port
• Output Port
• Processor Group
• Remote Processor Group
• Funnel
• Template
Label

The following steps should be followed to configure a processor:

1. You must have a value set for any parameters that are bold. Each parameter has
a question mark icon to help you.
2. You can also right-click on the processes and select the option to use.
3. For GenerateFlowfile, all the required parameters are already filled out.
4. In the preceding screenshot, we have added a value to the parameter of Custom
Text. To add custom properties, you can click the plus sign at the upper-right of
the window. You will be prompted for a name and value. We added my property
filename and set the value to This is a file from nifi.
5. Once configured, the yellow warning icon in the box will turn into a square
(stop button).

To create a connection, hover over the processor box and a circle and arrow will appear:

1. Drag the circle to the processor underneath it (PutFile).


It will snap into place, then prompt you to specify which relationship you want
to make this connection for. The only choice will be Success and it will already
be checked.
2. Select OK. Lastly, right-click on the GenerateFlowFile processor and select
Run.
The red square icon will change to a green play button. You should now have a data flow
as in the following screenshot:

Installing PostgreSQL driver


Later in this chapter, you will install PostgreSQL. In order to connect to a PostgreSQL
database using a NiFi ExecuteSQL processor, you need a connection pool, and that
requires a Java Database Connectivity (JDBC) driver for the database you will be
connecting to. This section shows you how to download that driver for use later. To
download it, go to https://jdbc.postgresql.org/download.html and
download the PostgreSQL JDBC 4.2 driver, 42.2.10.
Make a new folder in your NiFi installation directory named drivers. Move the
postgresql-42.2.10.jar file into the folder.

Installing and configuring Apache Airflow

Installing Apache Airflow can be accomplished using pip. But, before installing Apache
Airflow, you can change the location of the Airflow install by exporting AIRFLOW_HOME.
If you want Airflow to install to opt/airflow, export the AIRLFOW_HOME variable,
as shown:

export AIRFLOW_HOME=/opt/airflow

work with PostgreSQL, then you should install the sub-package by running the following:

apache-airflow[postgres]

To install Apache Airflow, with the options for postgreSQL, slack, and celery, use
the following command:

pip install 'apache-airflow[postgres,slack,celery]'


To run Airflow, you need to initialize the database using the following:
airflow initdb

The default database for Airflow is SQLite. This is acceptable for testing and running on
a single machine, but to run in production and in clusters, you will need to change the
database to something else, such as PostgreSQL.
No Command Airflow: If the airflow command cannot be found, you may need to add it to your
path:
export PATH=$PATH:/home/<username>/.local/bin

The Airflow web server runs on port 8080, the same port as Apache NiFi. You already
changed the NiFi port to 9300 in the nifi.properties file, so you can start the
Airflow web server using the following command:

airflow webserver
If you did not change the NiFi port, or have any other processes running on port 8080,
you can specify the port for Airflow using the -p flag, as shown:
airflow webserver -p 8081

Next, start the Airflow scheduler so that you can run your data flows at set intervals. Run
this command in a different terminal so that you do not kill the web server:
airflow scheduler
Airflow will run without the scheduler, but you will receive a warning when you launch
the web server if the scheduler is not running.

Airflow installs
several example data flows (Directed Acyclic Graphs (DAGs)) during install. You should
see them on the main screen

Installing and configuring Elasticsearch


Elasticsearch is a search engine. In this book, you will use it as a NoSQL database. You will
move data both to and from Elasticsearch to other locations. To download Elasticsearch,
take the following steps:
1. Use curl to download the files, as shown:
curl https://artifacts.elastic.co/downloads/
elasticsearch/elasticsearch-7.6.0-darwin-x86_64.tar.gz
--output elasticsearch.tar.gz
2. Extract the files using the following command:
tar xvzf elasticsearch.tar.gz

Installing and configuring Kibana


Elasticsearch does not ship with a GUI, but rather an API. To add a GUI to Elasticsearch,
you can use Kibana. By using Kibana, you can better manage and interact with
Elasticsearch. Kibana will allow you to access the Elasticsearch API in a GUI, but more
importantly, you can use it to build visualizations and dashboards of your data held in
Elasticsearch. To install Kibana, take the following steps:
1. Using wget, add the key:
wget -qO - https://artifacts.elastic.co/GPG-KEYelasticsearch
| sudo apt-key add -
2. Then, add the repository along with it:
echo "deb https://artifacts.elastic.co/packages/7.x/
apt stable main" | sudo tee -a /etc/apt/sources.list.d/
elastic-7.x.list
3. Lastly, update apt and install Kibana:
sudo apt-get update
sudo apt-get install kibana
4. The configuration files for Kibana are located in etc/kibana and the application
is in /usr/share/kibana/bin. To launch Kibana, run the following:
bin/kibana

Installing and configuring PostgreSQL


PostgreSQL is an open source relational database. It compares to Oracle or Microsoft
SQL Server. PostgreSQL also has a plugin – postGIS – which allows spatial capabilities in
PostgreSQL. In this book, it will be the relational database of choice. PostgreSQL can be
installed on Linux as a package:
1. For a Debian-based system, use apt-get, as shown:
sudo apt-get install postgresql-11
2. Once the packages have finished installing, you can start the database with the
following:
sudo pg_ctlcluster 11 main start
3. The default user, postgres, does not have a password. To add one, connect to the
default database:
sudo -u postgres psql
4. Once connected, you can alter the user and assign a password:
ALTER USER postgres PASSWORD ‚postgres';
5. To create a database, you can enter the following command:
sudo -u postgres createdb dataengineering

Installing pgAdmin 4
pgAdmin 4 will make managing PostgreSQL much easier if you are new to relational
databases. The web-based GUI will allow you to view your data and allow you to visually
create tables. To install pgAdmin 4, take the following steps:
1. You need to add the repository to Ubuntu. The following commands should be
added to the repository:
wget --quiet -O - https://www.postgresql.org/media/keys/
ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/
repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/
42 Building Our Data Engineering Infrastructure
sources.list.d/pgdg.list'
sudo apt update
sudo apt install pgadmin4 pgadmin4-apache2 -y
2. You will be prompted to enter an email address for a username and then for
a password. You should see the following screen:
Figure

Conclusion: Here we studied how to install and configure many of the tools used by data
Engineers. You now havetwo working databases – Elasticsearch and PostgreSQL – as well as two
tools for buildingdata pipelines – Apache NiFi and Apache Airflow which comprises of data
engineering infrastructure.

You might also like