You are on page 1of 2

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

Agenda
In this project, we are going to make a Data Pipeline including various AWS services and
Apache products such as Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon
OpenSearch, Logstash and Kibana etc. We will fetch data from an API using Apache NiFi,
transform it and load it in an AWS S3 bucket. Using Logstash we will ingest data from an AWS
S3 bucket into Amazon OpenSearch. From Amazon OpenSearch we will pass data into Kibana
to perform Data visualization on data. Along with this, we will also perform data analysis using
PySpark.

Tech stack:
➔Language: Python
➔Package: Pyspark
➔Services: AWS NiFi, AWS EC2, Apache Spark, AWS S3, Amazon EMR cluster, Amazon
OpenSearch, Logstash, Kibana

AWS NiFi:
Apache NiFi is a data logistics technology that automates data transportation across diverse
systems. Real-time control is provided, making it simple to regulate the transfer of data between
any source and any destination. It supports buffering of all Queued data.

Amazon EMR cluster:

To process and analyze enormous volumes of data on AWS, big data frameworks like Apache
Hadoop and Apache Spark may be easily operated on Amazon EMR, a managed cluster
platform. You may process data for analytics purposes and business intelligence tasks using
these frameworks and associated open-source projects.

Amazon OpenSearch:

OpenSearch is a distributed, open-source search and analytics package used for a variety of
use cases, including online search, log analytics, and real-time application monitoring. With the
help of an integrated visualisation tool called OpenSearch Dashboards, OpenSearch offers a
highly scalable system for giving quick access and reaction to massive amounts of data. This
tool makes it simple for users to examine their data.

Logstash:

Logstash is a server-side, open-source, lightweight data processing pipeline that enables you to
gather data from many sources, alter it as you go, and deliver it where you want.
Kibana:

Kibana is a tool for data visualisation and exploration that is used for operational intelligence
use cases, log and time-series analytics, and application monitoring. The popular analytics and
search engine Elasticsearch is tightly integrated with Kibana, making Kibana the go-to tool for
viewing Elasticsearch data.

Key Takeaways:
● Understanding the project overview
● Understanding the Data Pipeline
● Understanding the flow of Data Pipeline
● Create AWS EC2 instance
● Install Apache NiFi on EC2 instance
● Fetch data from an API
● Understanding Apache NiFi tool
● Transform data in Apache NiFi
● Convert data from json into csv using Apache NiFi
● Create AWS S3 bucket
● Transfer data from Apache NiFi to AWS S3 bucket
● Understand ELK stack
● Understand the use of OpenSearch, Logstash and Kibana
● Install Logstash
● Inject data from AWS S3 into Amazon OpenSearch
● Visualize data in Kibana
● Perform data analysis using PySpark

You might also like