Rakesh Kumar - 21554244 - Big Data - Assessment 2

CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Big Data Analytics Assessment 2 –

Develop python code to implement assignment tasks
Big Data Analytics Assessment 2 – Develop

Title python code to implement assignment tasks
Name Rakesh Kumar
Student ID 21554244
1. Big Data (CP7UA65O_14-FEB-22_10-JUN-22)

Subject
Date 17 June 2022
Submitted by – Rakesh Kumar (Student ID: 21554244)

1
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Table of Contents
1. Overview ..................................................................................................................................... 3
2. Part 1 of assignment ............................................................................................................ 4
1.1. Question 1 .......................................................................................................................... 4
1.2. Question 2 .......................................................................................................................... 6
1.3. Question 3 .......................................................................................................................... 7
1.4. Question 4 .......................................................................................................................... 8
1.5. Question 5 .......................................................................................................................... 9
1.6. Question 6 ........................................................................................................................ 10
3. Part 2 of assignment .......................................................................................................... 16
1.1. Question 1 ........................................................................................................................ 16
1.2. Question 2 ........................................................................................................................ 18
1.3. Question 3 ........................................................................................................................ 19
1.4. Question 4 ........................................................................................................................ 20
1.5. Question 5 ........................................................................................................................ 21

2
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1. Overview
We have received a datafile from university, on which we need to perform

certain code & implement the tasks.
We have used “PySpark” to implement the tasks below along with
explanation for the codes used.
Following to which, the output of code execution is shown, as the evidence
of working program

3
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
2. Part 1 of assignment
1.1. Question 1
Load the data file into a Spark DataFrame (1st DataFrame).

Describe the structure of the DataFrame. (3 marks)
First step is to Import pyspark library into juypter notebook.
Load the ‘Medical_info.csv’ file into spark data frame using the following set
of instruction. After Executing the instructions, we will be able to load the
csv file into spark data frame and display
The first 20 records of the data frame using show() instruction.

4
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
To get the structure of the data frame we need to use the following set of
instructions
Get the number of rows and columns of data frame
Print Schema of spark data frame

5
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.2. Question 2
Create a new DataFrame (2nd DataFrame) by removing all the rows

with null/missing values in the 1st DataFrame and calculate the
number of rows removed. (3 marks)
Use the below instructions to copy the existing data frame sdfData into
new data frame sdfDat_Null and remove the records with ‘Null’ values
using dropna() instruction in sdfData_Null data frame, to display the
total records deleted, substract the dropna().count() in sdfData_Null
data frame from original sdfData.count().

6
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.3. Question 3
Calculate summary statistics of the ‘age’ feature in the 2nd

DataFrame, including its min value, max value, mean value, median
value, variance and standard deviation. Generate a histogram for
the ‘age’ feature and describe the distribution of the feature. (3
marks)
For this we need to select “age” column and use summary().show()

instruction.
We will be using pandas library to plot histogram for “age” feature.

7
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.4. Question 4
Display the quartile info of the ‘BMI’ feature in the 2nd DataFrame.
Generate a boxplot for the ‘BMI’ feature and discuss the
distribution of the feature based on the boxplot. (3 marks)
Use the following set of instructions to generate boxplot using pandas

library.

8
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.5. Question 5
Use Spark DataFrame API (i.e., expression methods) to count the

number of rows where ‘age’ is greater than 50 and ‘BP_1’ equals 1.
(3 marks)
We need to use pyspark SQL functions to run the query and display
the no of records satisfying.the said criteria in the query.

9
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.6. Question 6
Use the ‘BP_1’ feature in the 2nd DataFrame as the target label, to
build two classification models based on all other columns as
predictors. Conduct performance evaluation for the two models and
make conclusions. (15 marks)
Create “Label’ column in data frame and filling it by integer 1 if BP_1 >
1 = 1 else with 0.
We need to consolidate all of the predictor columns into a single
column. By following the steps mentioned in code we can get the
consolidated data into single column.
Following is the new data frame “medical_assembled” which is created

using medData_Null and vector assembler.
Also, Now we need to split the data into two components, training data
(80% of records) and test data (20% of records)

10
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2

11
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Check if training data is around 80% of the total records by using

following commands

12
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Classification using Decision Tree

In classification using decision tree model we need the Decision Tree
Classifier class in pyspark machine learning classification and need to
create a classifier object, and fit this classifier object to the training
data after that we need to create predictions for the testing data and
take a look at the predictions.
Evaluate the decision tree using confusion matrix

A confusion matrix gives a useful breakdown of predictions versus
known values.
It has four cells which represent the counts of:
True Negatives (TN): model predicts negative outcome & known
outcome is negative
True Positives (TP): model predicts positive outcome & known
outcome is positive
False Negatives (FN): model predicts negative outcome but known
outcome is positive
False Positives (FP): model predicts positive outcome but known
outcome is negative.

13
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
By using Logical Regression model, we get the precision of 1.00 and

recall as 1.0.
Precision is the proportion of positive predictions which are correct and
Recall is the proportion of positives outcomes which are correctly
predicted.

14
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2

15
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
3. Part 2 of assignment
1.1. Question 1
Load the data file into a Spark DataFrame (1st DataFrame). Describe
the structure of the created data frame. (3 marks)
Load the ‘Region_info.csv’ file into spark data frame using the
following set of instruction. After
Executing the instructions, we will be able to load the csv file into
spark data frame and display
The first 20 records of the data frame using show() instruction.

16
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Use the following instruction to get the structure of the data frame.

17
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.2. Question 2
Create a new DataFrame (2nd DataFrame) by removing the ‘region’

column. (3 marks)
Use drop(‘Region’) instruction to drop the region column from data

frame and display the schema Of the data frame, here we can see ‘Region’
column is dropped and it’s not displayed in schema Of the data frame.

18
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.3. Question 3
Use a graph, explore and describe the relationship between ‘fertility’

feature and ‘life’ feature in the 2nd DataFrame. (3 marks)
For this we will be using pandas library to plot correlation between

‘fertility’ and ‘life’ feature.

19
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.4. Question 4
Use Spark SQL query to display the ‘fertility’ and ‘life’ columns in the
2nd DataFrame where ‘fertility’ is great than 1.0 and ‘life’ is greater
than 70. (3 marks)
We need to use pyspark SQL functions to run the query and display
the no of records satisfying The said criteria in the query.

20
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
1.5. Question 5
Build a linear regression model to predict life expectancy (the ‘life’

column) in the 2nd DataFrame using the ‘fertility’ column as the
predictor. Conduct performance evaluation for the model and make
conclusions. (9 marks)
We need to consolidate all of the predictor columns into a single

column. By following the steps mentioned in code we can get the
consolidated data into single column.
Following to which, we can see the assembled data frame using
following instruction. And then we will split the data into two
components using randomsplit() instruction.
Training Data: Used for train the model, 80% of data from data frame.
Testing Data: Used to test the model, 20% of data from data frame.

21
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Check and verify if training data is 80% of total records in

“region_assembled” data frame.

22
CP7UA65O_14-FEB-22_10-JUN-22
Assignment 2
Now create a regression object and train on training data using ‘fit’
function and then create a prediction for the testing data by calling
‘transform’ function and check the predictions. And calculate the
average fertility per life.

23

Rakesh Kumar - 21554244 - Big Data - Assessment 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rakesh Kumar - 21554244 - Big Data - Assessment 2

Uploaded by

Copyright:

Available Formats

CP7UA65O_14-FEB-22_10-JUN-22

Big Data Analytics Assessment 2 –

Big Data Analytics Assessment 2 – Develop

Name Rakesh Kumar

1. Big Data (CP7UA65O_14-FEB-22_10-JUN-22)

Date 17 June 2022

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

We have received a datafile from university, on which we need to perform

Submitted by – Rakesh Kumar (Student ID: 21554244)

Load the data file into a Spark DataFrame (1st DataFrame).

First step is to Import pyspark library into juypter notebook.

Submitted by – Rakesh Kumar (Student ID: 21554244)

Get the number of rows and columns of data frame

Print Schema of spark data frame

Submitted by – Rakesh Kumar (Student ID: 21554244)

Create a new DataFrame (2nd DataFrame) by removing all the rows

Submitted by – Rakesh Kumar (Student ID: 21554244)

Calculate summary statistics of the ‘age’ feature in the 2nd

For this we need to select “age” column and use summary().show()

We will be using pandas library to plot histogram for “age” feature.

Submitted by – Rakesh Kumar (Student ID: 21554244)

Use the following set of instructions to generate boxplot using pandas

Submitted by – Rakesh Kumar (Student ID: 21554244)

Use Spark DataFrame API (i.e., expression methods) to count the

Submitted by – Rakesh Kumar (Student ID: 21554244)

Following is the new data frame “medical_assembled” which is created

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

Check if training data is around 80% of the total records by using

Submitted by – Rakesh Kumar (Student ID: 21554244)

Classification using Decision Tree

Evaluate the decision tree using confusion matrix

Submitted by – Rakesh Kumar (Student ID: 21554244)

By using Logical Regression model, we get the precision of 1.00 and

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

Create a new DataFrame (2nd DataFrame) by removing the ‘region’

Use drop(‘Region’) instruction to drop the region column from data

Submitted by – Rakesh Kumar (Student ID: 21554244)

Use a graph, explore and describe the relationship between ‘fertility’

For this we will be using pandas library to plot correlation between

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

Build a linear regression model to predict life expectancy (the ‘life’

We need to consolidate all of the predictor columns into a single

Submitted by – Rakesh Kumar (Student ID: 21554244)

Check and verify if training data is 80% of total records in

Submitted by – Rakesh Kumar (Student ID: 21554244)

Submitted by – Rakesh Kumar (Student ID: 21554244)

You might also like