You are on page 1of 23

CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Big Data Analytics Assessment 2 –


Develop python code to implement assignment tasks

Big Data Analytics Assessment 2 – Develop


Title python code to implement assignment tasks

Name Rakesh Kumar

Student ID 21554244

1. Big Data (CP7UA65O_14-FEB-22_10-JUN-22)


Subject

Date 17 June 2022

Submitted by – Rakesh Kumar (Student ID: 21554244)


1
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Table of Contents

1. Overview ..................................................................................................................................... 3
2. Part 1 of assignment ............................................................................................................ 4
1.1. Question 1 .......................................................................................................................... 4
1.2. Question 2 .......................................................................................................................... 6
1.3. Question 3 .......................................................................................................................... 7
1.4. Question 4 .......................................................................................................................... 8
1.5. Question 5 .......................................................................................................................... 9
1.6. Question 6 ........................................................................................................................ 10
3. Part 2 of assignment .......................................................................................................... 16
1.1. Question 1 ........................................................................................................................ 16
1.2. Question 2 ........................................................................................................................ 18
1.3. Question 3 ........................................................................................................................ 19
1.4. Question 4 ........................................................................................................................ 20
1.5. Question 5 ........................................................................................................................ 21

Submitted by – Rakesh Kumar (Student ID: 21554244)


2
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1. Overview

We have received a datafile from university, on which we need to perform


certain code & implement the tasks.
We have used “PySpark” to implement the tasks below along with
explanation for the codes used.
Following to which, the output of code execution is shown, as the evidence
of working program

Submitted by – Rakesh Kumar (Student ID: 21554244)


3
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

2. Part 1 of assignment

1.1. Question 1

Load the data file into a Spark DataFrame (1st DataFrame).


Describe the structure of the DataFrame. (3 marks)

First step is to Import pyspark library into juypter notebook.

Load the ‘Medical_info.csv’ file into spark data frame using the following set
of instruction. After Executing the instructions, we will be able to load the
csv file into spark data frame and display
The first 20 records of the data frame using show() instruction.

Submitted by – Rakesh Kumar (Student ID: 21554244)


4
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

To get the structure of the data frame we need to use the following set of
instructions

Get the number of rows and columns of data frame

Print Schema of spark data frame

Submitted by – Rakesh Kumar (Student ID: 21554244)


5
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.2. Question 2

Create a new DataFrame (2nd DataFrame) by removing all the rows


with null/missing values in the 1st DataFrame and calculate the
number of rows removed. (3 marks)

Use the below instructions to copy the existing data frame sdfData into
new data frame sdfDat_Null and remove the records with ‘Null’ values
using dropna() instruction in sdfData_Null data frame, to display the
total records deleted, substract the dropna().count() in sdfData_Null
data frame from original sdfData.count().

Submitted by – Rakesh Kumar (Student ID: 21554244)


6
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.3. Question 3

Calculate summary statistics of the ‘age’ feature in the 2nd


DataFrame, including its min value, max value, mean value, median
value, variance and standard deviation. Generate a histogram for
the ‘age’ feature and describe the distribution of the feature. (3
marks)

For this we need to select “age” column and use summary().show()


instruction.

We will be using pandas library to plot histogram for “age” feature.

Submitted by – Rakesh Kumar (Student ID: 21554244)


7
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.4. Question 4

Display the quartile info of the ‘BMI’ feature in the 2nd DataFrame.
Generate a boxplot for the ‘BMI’ feature and discuss the
distribution of the feature based on the boxplot. (3 marks)

Use the following set of instructions to generate boxplot using pandas


library.

Submitted by – Rakesh Kumar (Student ID: 21554244)


8
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.5. Question 5

Use Spark DataFrame API (i.e., expression methods) to count the


number of rows where ‘age’ is greater than 50 and ‘BP_1’ equals 1.
(3 marks)

We need to use pyspark SQL functions to run the query and display
the no of records satisfying.the said criteria in the query.

Submitted by – Rakesh Kumar (Student ID: 21554244)


9
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.6. Question 6

Use the ‘BP_1’ feature in the 2nd DataFrame as the target label, to
build two classification models based on all other columns as
predictors. Conduct performance evaluation for the two models and
make conclusions. (15 marks)

Create “Label’ column in data frame and filling it by integer 1 if BP_1 >
1 = 1 else with 0.
We need to consolidate all of the predictor columns into a single
column. By following the steps mentioned in code we can get the
consolidated data into single column.

Following is the new data frame “medical_assembled” which is created


using medData_Null and vector assembler.
Also, Now we need to split the data into two components, training data
(80% of records) and test data (20% of records)

Submitted by – Rakesh Kumar (Student ID: 21554244)


10
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Submitted by – Rakesh Kumar (Student ID: 21554244)


11
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Check if training data is around 80% of the total records by using


following commands

Submitted by – Rakesh Kumar (Student ID: 21554244)


12
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Classification using Decision Tree


In classification using decision tree model we need the Decision Tree
Classifier class in pyspark machine learning classification and need to
create a classifier object, and fit this classifier object to the training
data after that we need to create predictions for the testing data and
take a look at the predictions.

Evaluate the decision tree using confusion matrix


A confusion matrix gives a useful breakdown of predictions versus
known values.
It has four cells which represent the counts of:
True Negatives (TN): model predicts negative outcome & known
outcome is negative
True Positives (TP): model predicts positive outcome & known
outcome is positive
False Negatives (FN): model predicts negative outcome but known
outcome is positive
False Positives (FP): model predicts positive outcome but known
outcome is negative.

Submitted by – Rakesh Kumar (Student ID: 21554244)


13
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

By using Logical Regression model, we get the precision of 1.00 and


recall as 1.0.
Precision is the proportion of positive predictions which are correct and
Recall is the proportion of positives outcomes which are correctly
predicted.

Submitted by – Rakesh Kumar (Student ID: 21554244)


14
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Submitted by – Rakesh Kumar (Student ID: 21554244)


15
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

3. Part 2 of assignment

1.1. Question 1

Load the data file into a Spark DataFrame (1st DataFrame). Describe
the structure of the created data frame. (3 marks)

Load the ‘Region_info.csv’ file into spark data frame using the
following set of instruction. After
Executing the instructions, we will be able to load the csv file into
spark data frame and display
The first 20 records of the data frame using show() instruction.

Submitted by – Rakesh Kumar (Student ID: 21554244)


16
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Use the following instruction to get the structure of the data frame.

Submitted by – Rakesh Kumar (Student ID: 21554244)


17
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.2. Question 2

Create a new DataFrame (2nd DataFrame) by removing the ‘region’


column. (3 marks)

Use drop(‘Region’) instruction to drop the region column from data


frame and display the schema Of the data frame, here we can see ‘Region’
column is dropped and it’s not displayed in schema Of the data frame.

Submitted by – Rakesh Kumar (Student ID: 21554244)


18
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.3. Question 3

Use a graph, explore and describe the relationship between ‘fertility’


feature and ‘life’ feature in the 2nd DataFrame. (3 marks)

For this we will be using pandas library to plot correlation between


‘fertility’ and ‘life’ feature.

Submitted by – Rakesh Kumar (Student ID: 21554244)


19
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.4. Question 4

Use Spark SQL query to display the ‘fertility’ and ‘life’ columns in the
2nd DataFrame where ‘fertility’ is great than 1.0 and ‘life’ is greater
than 70. (3 marks)

We need to use pyspark SQL functions to run the query and display
the no of records satisfying The said criteria in the query.

Submitted by – Rakesh Kumar (Student ID: 21554244)


20
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

1.5. Question 5

Build a linear regression model to predict life expectancy (the ‘life’


column) in the 2nd DataFrame using the ‘fertility’ column as the
predictor. Conduct performance evaluation for the model and make
conclusions. (9 marks)

We need to consolidate all of the predictor columns into a single


column. By following the steps mentioned in code we can get the
consolidated data into single column.
Following to which, we can see the assembled data frame using
following instruction. And then we will split the data into two
components using randomsplit() instruction.
Training Data: Used for train the model, 80% of data from data frame.
Testing Data: Used to test the model, 20% of data from data frame.

Submitted by – Rakesh Kumar (Student ID: 21554244)


21
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Check and verify if training data is 80% of total records in


“region_assembled” data frame.

Submitted by – Rakesh Kumar (Student ID: 21554244)


22
CP7UA65O_14-FEB-22_10-JUN-22

Assignment 2

Now create a regression object and train on training data using ‘fit’
function and then create a prediction for the testing data by calling
‘transform’ function and check the predictions. And calculate the
average fertility per life.

Submitted by – Rakesh Kumar (Student ID: 21554244)


23

You might also like