You are on page 1of 14

DATA ANALYSIS USING BIG DATA TOOLS

A PROJECT REPORT

Submitted by

DheerajSingh Dhami(21BCS3113)
Manasvi Rajeev Sharma (21BCS3092)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

Chandigarh University

May 2023
BONAFIDE CERTIFICATE

Certified that this project report DATA ANALYSIS USING BIG DATA TOOLSis
the bonafide work of DHEERAJ DHAMI and Manasvi Sharma who carried out
the project work under my supervision.

Dr. Puneet Kumar Er.Hari Mohan Dixit


HEAD OF THE DEPARTMENT SUPERVISOR
CSE AP
CSE

Submitted for the project viva-voce examination held on

INTERNAL EXAMINER EXTERNAL


EXAMINER
CHAPTER 1.

INTRODUCTION

1.1. Identification of Client /Need / Relevant Contemporary issue

We have T-Series music video dataset available with us and let us assume that the
client wants to see the analysis of the overall data. Now, the size of the dataset is
very huge (might go up to billions of rows) and using traditional DBMS is not
feasible. So, we will use Big Data tool like Apache Spark to transform the data and
generate the necessary aggregated output tables and store it in MySQL database.
With this architecture the UI will be able to fetch reports and charts at much faster
speed from MySQL than querying on the actual raw data. Finally, the batch we use
to analyze the data can be automated to run on daily basis within a fixed period of
time

1.2. Identification of Problem/Tasks


1.Transform the raw data into multiple tables as per the requirement.
2.Load the tables to MySQL.
3.Automate the flow so it can be scheduled to be ran on a regular basis.

4.Setup the environment and install all the tools required for the project.
5.Read data from CSV file and store the data into HDFS (Hadoop File
System) in compressed format.
6. Transform the raw data and build multiple table by performing the required
aggregations.

7. Load the end tables to MySQL tables.

8. Automating the full flow using Shell Script.

Timeline

PROBLEM STUDY: -16 February 2023 - 19 February 2023


PLANNING: -20 February 2023 - 26 February 2023
REQUIREMENT ANALYSIS: -27 February 2023 - 03 March2023
& GATHERING
DESIGNING: -04 March 2023 - 15 March 2023
DEVELOPMENT: -16 March 2023 – 10 May 2023
DEPLOYMENT: -11 May 2023
1.3. Requirements
 It is expected that you are using a Linux distribution. (A cloud
system can be a substitute.)

 We have to install all the tools and setup the environment (if
you have already installed the required tools you can skip this
task), make sure you install all the required software in one
location for simplicity.

 Install Hadoop in your system using this tutorial.

 Once Hadoop is set up, start the services using start-all.sh


command and run jps to check whether the services are up or
not. Below screenshot shows the expected services that should
be running on successful installation.

hadoop-services

 Now, you can install Apache Spark using this link

 Once spark is installed we will install Anaconda. Download


Anaconda bash installer file from Anaconda website. Install and
initialize it.
 Finally install MySQL.

 Now, by default Spark is supposed to start on terminal, to use


Jupyter Notebook for development we will have to set some
properties in ~/.bashrc file.

 export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
 Finally, you can run pyspark command in terminal which
should start Spark on Jupyter Notebook

1.4. Organization of the Report


The report for the virtual whiteboard project is organized into 5
Chaptersstated below:
 CHAPTER 1. INTRODUCTION:
It will include the Identification of Client/ Need/ Relevant
Contemporary issue, Identification of Problem, Identification of
Tasks, Timeline, and Organization of the Report.
 CHAPTER 2. LITERATURE
REVIEW/BACKGROUNDSTUDY:
It will include the Timeline of the reported problem, Existing
solutions, Bibliometric analysis, Review Summary, Problem
Definition, and Goals/Objectives.
 CHAPTER 3. DESIGN FLOW/PROCESS:
It will include the Evaluation & Selection of Specifications/Features,
Design Constraints, Analysis of Features and finalization subject to
constraints, Design Flow, Design selection, and Implementation
plan/methodology.
 CHAPTER 4. RESULTS ANALYSIS AND VALIDATION:
It will include the Implementation of the solution.
 CHAPTER 5. CONCLUSION AND FUTURE WORK:
It will include the Conclusion and Future work.

Chapter 2
LITERATURE REVIEW/BACKGROUNDSTUDY:

2.1 Abstract:
The exponential growth of data has led to an increase in the volume, velocity, and
variety of data generated. Traditional data analysis tools are no longer sufficient to
handle such large data sets. Big data tools provide a solution to this challenge by
enabling analysts to process, analyze, and derive insights from massive data sets.
This research paper provides an overview of big data analytics, explores the
various big data tools available, identifies challenges faced in big data analytics,
and provides best practices for overcoming these challenges.

2.2 Introduction:
The advent of big data has created a new era of data analysis, where traditional
data analysis tools are no longer capable of handling the scale of data being
generated. Big data analytics refers to the use of advanced techniques and tools to
analyze and extract insights from large data sets. The goal of this research paper is
to examine the use of big data tools for data analysis, and to provide an
understanding of the importance of big data analytics, explore various big data
tools, identify challenges faced in big data analytics, and provide best practices for
overcoming these challenges.
Importance of Big Data Analytics:
Big data analytics plays a significant role in enabling organizations to make
informed decisions based on insights derived from their data. It provides a
powerful tool for analyzing data, identifying patterns, trends, and insights that
would otherwise be difficult to discern. For instance, big data analytics can be used
to analyze customer behavior, identify fraud, optimize business processes, and
improve customer satisfaction. By using big data analytics, businesses can gain a
competitive edge by making informed decisions based on insights derived from
their data.

Tools for Big Data Analysis:


There are several big data tools available for analyzing large data sets. One of the
most popular tools is Apache Hadoop, an open-source framework that enables
distributed processing of large data sets. Hadoop provides a distributed file system
called HDFS, which facilitates efficient storage and retrieval of large data sets.
Apache Spark is another popular tool for big data analytics. Spark is a fast and
efficient engine for large-scale data processing that provides an easy-to-use
interface for data analysis and can be used with a variety of programming
languages, including Python, Java, and Scala. Other big data tools include Apache
Cassandra, Apache Flink, and Apache Storm.

Challenges in Big Data Analytics:


Despite the benefits of big data analytics, there are several challenges that must be
addressed. One of the most significant challenges is data quality. Large data sets
often contain errors and inconsistencies, which can lead to inaccurate results if not
properly addressed. Data integration is another challenge, which involves
combining data from different sources into a single data set. This can be a complex
and time-consuming process, particularly when dealing with data from disparate
sources. Another challenge is data privacy and security, which requires
organizations to ensure that their data is secure and protected from unauthorized
access.
Best Practices for Big Data Analytics:
To overcome the challenges in big data analytics, it is important to follow best
practices. These include data quality checks, data normalization, and the use of
machine learning algorithms for data cleansing. It is also important to have a clear
understanding of the business problem being addressed and to choose the
appropriate tool for the task. Another best practice is to ensure that data is stored in
a format that is easily accessible and usable by big data tools. Finally, it is
important to have a skilled team of data analysts who are proficient in big data
tools and techniques.

Case Study:
A case study on the use of big data analytics in the healthcare industry can provide
an insight into how big data tools can be used to extract insights from large data
sets. In the healthcare industry, big data analytics can be used to improve patient
outcomes, identify disease patterns, and optimize resource utilization. For example,
the use of big data analytics can enable healthcare providers to identify high-risk
patients, develop personalized treatment plans, and

CHAPTER 3
DESIGN FLOW/PROCESS

Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:

1. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
2. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
3. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
4. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.

However, there are some potential drawbacks and limitations to this approach, such
as:

1. CSV files may not be the best choice for storing large volumes of data, as
they can become unwieldy and difficult to manage over time.
2. The process of data transformation can be time-consuming and complex,
especially if the CSV files contain large amounts of unstructured or
inconsistent data.
3. The use of traditional DBMS can also be limiting, as these systems are
often designed for specific use cases and may not be flexible enough to
handle changing data requirements or data models.

To address these limitations and ensure an effective solution, the following features
are ideally required:

1. Scalability: The solution should be able to handle large volumes of data,


with the ability to scale up or down as needed.
2. Flexibility: The solution should be able to handle a variety of data formats
and types, and be flexible enough to accommodate changes in data
requirements or models.
3. Automation: The solution should automate as much of the data
transformation process as possible, to reduce the risk of errors and save
time.
4. Data quality: The solution should include features to ensure data quality,
such as data validation and data profiling, to identify and address any
issues with the data.
5. Security: The solution should include security features to protect sensitive
data, such as encryption and access controls.
6. Integration: The solution should be able to integrate with other systems
and tools, such as data visualization or business intelligence software, to
provide a complete end-to-end solution for data processing and analysis.
To implement reading data from CSV files, transforming the data using Pyspark,
and storing the final output tables in a traditional DBMS, you can follow these
steps:

1. Install Pyspark, Hadoop File System (HDFS), and any necessary drivers for your
DBMS on your Linux machine.

2. Use Pyspark to read the CSV files from HDFS. Pyspark provides several APIs to
read CSV files, such as the `csv` module, which can be used to read CSV files as
DataFrames. Here's an example:

3. Transform the data using Pyspark'sDataFrame API. Pyspark provides a rich set
of APIs to manipulate DataFrames. You can perform operations like filtering,
aggregation, joining, and more on DataFrames. Here's an example:

4. Store the final output tables in your traditional DBMS. Pyspark provides
connectors for many popular DBMSs, such as MySQL, PostgreSQL, and Oracle.
You can use the appropriate connector to write the DataFrames to your DBMS.
Here's an example:
With these steps, you can implement reading data from CSV files, transforming the
data using Pyspark, and storing the final output tables in a traditional DBMS.

The implementation process can be broken down into several steps:

1. Reading data from CSV files


2. Transforming the data to generate the final output tables
3. Storing the output tables in a traditional DBMS

Here's an example implementation using PySpark, Linux basics, and


Hadoop File System:

Step 1: Reading data from CSV files


Step 2: Transforming the data to generate the final output tables

Step 3: Storing the output tables in a traditional DBMS

In this example, we're reading a CSV file into a PySparkDataFrame,


applying some transformations to generate the final output table, and
then writing the output table to a traditional DBMS (in this case,
PostgreSQL)
To execute this code, you'll need to have PySpark installed and configured, as well
as access to a Hadoop File System and a traditional DBMS.

Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:
5. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
6. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
7. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
8. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.

You might also like