0% found this document useful (0 votes)

19 views73 pages

Machine Learning Project Report 1

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views73 pages

Machine Learning Project Report 1

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine

Learning
Project
Report
DSBA
Kaarthikeyan Senthilmaran
PGP-DSBA Online Feb 22
Date: 25/09/2022
Contents
Table of Figures............................................................................................................................................2
Table of Tables.............................................................................................................................................4
Problem 1:...................................................................................................................................................6
Executive Summary:................................................................................................................................6
Introduction:............................................................................................................................................6
Problem Questions:.................................................................................................................................6
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it......................................................................................................................................6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers........9
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30)...................................................................................22
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)................................................28
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.................................................31
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting..............33
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized........................................39
1.8 Based on these predictions, what are the insights?.....................................................................64
Problem 2:.................................................................................................................................................64
Executive Summary:..............................................................................................................................64
Introduction:..........................................................................................................................................65
Problem Questions:...............................................................................................................................65
2.1 Find the number of characters, words, and sentences for the mentioned documents................65
2.2 Remove all the stopwords from all three speeches.....................................................................65
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords).........................................................69
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) –
3 Marks [ refer to the End-to-End Case Study done in the Mentored Learning Session ]..................69

Table of Figur
Figure 1 Univariate Analysis – Age.............................................................................................................11
Figure 2 Univariate Analysis – [Link].............................................................................12
Figure 3 Univariate Analysis – [Link].........................................................................13
Figure 4 Univariate Analysis – Blair............................................................................................................14
Figure 5 Univariate Analysis – Hague.........................................................................................................15
Figure 6 Univariate Analysis – Europe........................................................................................................16
Figure 7 Univariate Analysis – [Link]....................................................................................17
Figure 8 Univariate Analysis – vote............................................................................................................18
Figure 9 Univariate Analysis – gender........................................................................................................18
Figure 10 Correlation Heatmap.................................................................................................................20
Figure 11 Pairplot......................................................................................................................................21
Figure 12 Box plots after outlier Treatment...............................................................................................22
Figure 13 Linear Discriminant Analysis......................................................................................................30
Figure 14 KNN model.................................................................................................................................31
Figure 15 Custom Cut-Off selection - Accuracy, F1 score and Confusion Matrix........................................34
Figure 16 Misclassification Error across No. of Neighbors k......................................................................35
Figure 17 Confusion Matrix.......................................................................................................................40
Figure 18 AUC and ROC for the training data- Initial model.......................................................................41
Figure 19 AUC and ROC for testing data- Initial model..............................................................................41
Figure 20 Confusion Matrix - Training data- Initial model..........................................................................41
Figure 21 Confusion matrix - Test data- Initial model................................................................................42
Figure 22 AUC and ROC for the training data- Best model.........................................................................43
Figure 23 AUC and ROC for testing data- Best model................................................................................43
Figure 24 Confusion Matrix - Training data- Best model............................................................................44
Figure 25 Confusion matrix - Test data- Best model..................................................................................44
Figure 26 AUC and ROC for training and testing data- Initial model..........................................................45
Figure 27 Confusion Matrix - Training and Testing data- Initial model.......................................................46
Figure 28 AUC and ROC for the training data- Initial model.......................................................................48
Figure 29 AUC and ROC for testing data- Initial model..............................................................................48
Figure 30 Confusion Matrix - Training data- Initial model..........................................................................48
Figure 31 Confusion matrix - Test data- Initial model................................................................................49
Figure 32 AUC and ROC for the training data- best model.........................................................................50
Figure 33 AUC and ROC for testing data- best model................................................................................50
Figure 34 Confusion Matrix - Training data- best model............................................................................51
Figure 35 Confusion matrix - Test data- best model..................................................................................51
Figure 36 AUC and ROC for the training data- Initial model.......................................................................52
Figure 37 AUC and ROC for testing data- Initial model..............................................................................52
Figure 38 Confusion Matrix - Training data- Initial model..........................................................................53
Figure 39 Confusion matrix - Test data- Initial model................................................................................53
Figure 40 AUC and ROC for the training data- best model.........................................................................54
Figure 41 AUC and ROC for testing data- best model................................................................................55
Figure 42 Confusion Matrix - Training data- best model............................................................................55
Figure 43 Confusion matrix - Test data- best model..................................................................................55
Figure 44 AUC and ROC for the training data- Bagging..............................................................................57
Figure 45 AUC and ROC for testing data- Bagging......................................................................................57
Figure 46 Confusion Matrix - Training data- Bagging.................................................................................57
Figure 47 Confusion matrix - Test data- Bagging........................................................................................58
Figure 48 AUC and ROC for the training data- AdaBoost...........................................................................59
Figure 49 AUC and ROC for testing data- AdaBoost...................................................................................59
Figure 50 Confusion Matrix - Training data- AdaBoost..............................................................................60
Figure 51 Confusion matrix - Test data- AdaBoost.....................................................................................60
Figure 52 AUC and ROC for the training data- Gradient Boosting..............................................................61
Figure 53 AUC and ROC for testing data- Gradient Boosting......................................................................61
Figure 54 Confusion Matrix - Training data- Gradient Boosting.................................................................62
Figure 55 Confusion matrix - Test data- Gradient Boosting........................................................................62
Figure 56 Result set...................................................................................................................................65

Table of TablesY
Table 1 Dataset Sample...............................................................................................................................6
Table 2 Data Dictionary................................................................................................................................7
Table 3 Problem 1: Data Information...........................................................................................................7
Table 4 Duplicated Records..........................................................................................................................8
Table 5 Missing Values.................................................................................................................................8
Table 6 Problem 1: Summary Stats..............................................................................................................8
Table 7 Problem 1: Skewness.......................................................................................................................9
Table 8 Problem 1: Missing/Null values.....................................................................................................10
Table 9 Problem 1: Shape and Data types..................................................................................................10
Table 10 Outlier Proportions......................................................................................................................22
Table 11 Ordinal encoding - vote – Codes..................................................................................................23
Table 12 Ordinal encoding - gender – Codes.............................................................................................23
Table 13 Sample of the dataset after z-score scaling.................................................................................24
Table 14 Sample of X - Dataset of predictor variables...............................................................................24
Table 15 Predictor variables dataset - X - info............................................................................................24
Table 16 Sample of y - Dataset of target variables.....................................................................................25
Table 17 Independent variables - Training dataset - Sample......................................................................25
Table 18 Independant variables - Training dataset - Info...........................................................................26
Table 19 Independent variables - Testing dataset - Sample.......................................................................26
Table 20 Independent variables - Testing dataset - Info.............................................................................26
Table 21 Target variable - Training dataset - Sample..................................................................................27
Table 22 Target variable - Training dataset - Info.......................................................................................27
Table 23 Target variable - Testing dataset - Sample...................................................................................27
Table 24 Target variable - Testing dataset – Info........................................................................................27
Table 25 Predicted Class - Logistic Regression...........................................................................................28
Table 26 Predicted Class - Logistic Regression - Best parameters..............................................................29
Table 27 Accuracy - Best model.................................................................................................................29
Table 28 Predicted Class – LDA..................................................................................................................30
Table 29 Accuracy – Initial LDA model.......................................................................................................30
Table 30 Predicted Class – KNN.................................................................................................................31
Table 31 Accuracy – Initial LDA model.......................................................................................................32
Table 32 Predicted Class – Naïve Bayes.....................................................................................................33
Table 33 Accuracy – Initial Naïve Bayes model..........................................................................................33
Table 34 Accuracy - Tuned Model - Logistic Regression.............................................................................34
Table 35 Accuracy - Tuned Model - Logistic Regression.............................................................................35
Table 36 Accuracy - Tuned Model – KNN...................................................................................................36
Table 37 Accuracy - Tuned Model – KNN...................................................................................................36
Table 38 Predicted Class – Bagging over Random Forest...........................................................................37
Table 39 Accuracy – Bagging over Random Forest.....................................................................................37
Table 40 Predicted Class – AdaBoost.........................................................................................................38
Table 41 Accuracy – AdaBoost...................................................................................................................38
Table 42 Predicted Class – Gradient Boosting............................................................................................39
Table 43 Accuracy – Gradient Boosting......................................................................................................39
Table 44 Accuracy - Initial model...............................................................................................................40
Table 45 Classification Report - Training data- Initial model......................................................................42
Table 46 Classification Report - Testing data- Initial model........................................................................42
Table 47 Accuracy - Best model.................................................................................................................43
Table 48 Classification Report - Training data- Best model........................................................................44
Table 49 Classification Report - Testing data - Best Model.........................................................................45
Table 50 Accuracy - Initial model...............................................................................................................45
Table 51 Classification Report - Training and Testing data- Initial model...................................................46
Table 52 Accuracy - Custom Cut-Off...........................................................................................................47
Table 53 Classification report - Custom cutoff test data – LDA..................................................................47
Table 54 Accuracy - Initial model...............................................................................................................47
Table 55 Classification Report - Training data- Initial model......................................................................49
Table 56 Classification Report - Testing data- Initial model........................................................................49
Table 57 Accuracy - Best model.................................................................................................................49
Table 58 Classification Report - Training data- best model........................................................................51
Table 59 Classification Report - Testing data - Best Model.........................................................................52
Table 60 Accuracy - Initial model...............................................................................................................52
Table 61 Classification Report - Training data- Initial model......................................................................53
Table 62 Classification Report - Testing data- Initial model........................................................................54
Table 63 Accuracy - Best model.................................................................................................................54
Table 64 Classification Report - Training data- best model........................................................................56
Table 65 Classification Report - Testing data - Best Model.........................................................................56
Table 66 Accuracy - Bagging.......................................................................................................................56
Table 67 Classification Report - Training data- Bagging..............................................................................58
Table 68 Classification Report - Testing data - Bagging..............................................................................58
Table 69 Accuracy - AdaBoost....................................................................................................................58
Table 70 Classification Report - Training data- AdaBoost...........................................................................60
Table 71 Classification Report - Testing data - AdaBoost............................................................................61
Table 72 Accuracy - Gradient Boosting......................................................................................................61
Table 73 Classification Report - Training data- Gradient Boosting..............................................................62
Table 74 Classification Report - Testing data – Gradient Boosting.............................................................63
Table 75 Accuracy of each algorithm and calculated stability....................................................................63
Table 64 Classification Report - Training data- best Naïve Bayes model....................................................64
Table 65 Classification Report - Testing data - best Naïve Bayes Model.....................................................64
Problem 1:
Executive Summary:
One of the leading news channels CNBE wants to analyze recent elections. This survey was conducted on
1525 voters with 9 variables. We have to build a model, to predict which party a voter will vote for on
the basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.

Introduction:
The purpose of this exercise is to utilize various models to predict the vote of each voter and compare
the results so as to find which model has delivered best results. The chosen model or the ensembled
result of all the models will be later used by the channel to predict the overall winner in the election and
seats covered by a particular party.

Problem Questions:
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
Sample of the dataset:

Table 1 Dataset Sample

Dataset has various information such as the party choice of the voter, voter’s age, an assessment of
current national economic condition, household economic condition, assessment of leader of each party,
attitude towards European integration, voter’s political knowledge and gender. The variable ‘vote’ will be
the target variable based on the problem statement.
Data Dictionary:
Below is the data dictionary to understand the terminologies used across the dataset and the analysis.

Variable Name Description

vote Party choice, Conservative or Labour
age Voter’s age in years
[Link] Assessment of current national economic conditions, 1 to 5.
[Link] Assessment of current household economic conditions, 1 to 5.
Blair Assessment of the Labour leader, 1 to 5.
Hague Assessment of the Conservative leader, 1 to 5.
Europe An 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment.
[Link] Knowledge of parties' positions on European integration, 0 to 3.
gender female or male.
Table 2 Data Dictionary

Data Description
The given data gives us a good amount of information regarding the voters. It has the age and gender of
the voters along with the vote choice of each – whether they have voted for the ‘Labour’ or the
‘Conservative’ party. It also provides us with some ordinal information regarding the current national
economic conditions and current household economic conditions of the voters – rated 1 to 5 where 5
being the best and 1 being the worst.

It also provides us with their rating on assessment of Labour and Conservative Party leaders – rated the
same way as the above variables – 5 being the best and 1 being the worst. In addition to these, we have
a 11 – point scale that measures voters' attitudes toward European integration of which high scores
represent ‘Eurosceptic’ sentiment. We also have a rating of voter’s political knowledge – especially on
the standpoint of each party with regards to European integration – rated 0 to 3 where 0 represents no
knowledge on the subject while 3 represents very good knowledge on the same.

 It’s a sample dataset with just 1525 rows across 9 fields.

 All of the fields are of integer datatype except vote and gender which are of object type. All of
the columns are non-null.

Table 3 Problem 1: Data Information

Data checks:
Additional data checks have been made to confirm on the following aspects.

Duplicate records
There are 8 duplicate records found in the data. Below is the list of duplicate records.
Table 4 Duplicated Records

The duplicates are not deleted in this case based on the assumption that there is always a possibility of
having two voters of same age that have similar economical background and similar political standpoints.

Null value check

There are no missing/Null values in any of the columns. Below is the result-set table of missing values
check in each column of the dataset.

Columns Total Missing values

vote 0
age 0
[Link] 0
[Link] 0
Blair 0
Hague 0
Europe 0
[Link] 0
gender 0
Table 5 Missing Values

Summary Stats:
 Based on fundamental description table shown below, we can see that the values for each field
are of different scales. For example, the field 'Europe' shows the values ranging between 1 and
11 while the values under the field ‘[Link]’ are ranging between 0 and 3.
 It is also evident that, like already mentioned on the data description, the columns
[Link], [Link], Blair and Hague are similarly rated between
1 to 5.

Table 6 Problem 1: Summary Stats

Skewness:
Skewness is a measure of the asymmetry of a distribution. A distribution is asymmetrical when its left
and right side are not mirror images. A distribution can have right (or positive), left (or negative), or zero
skewness.

The rule of thumb seems to be: If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the
skewness is less than -1 or greater than 1, the data are highly skewed.

For the given dataset we have the skewness for each variable as follows:

Columns Skewness
age 0.144621
[Link] -0.240453
[Link]
d -0.149552
Blair -0.535419
Hague 0.1521
Europe -0.135947
[Link] -0.426838
Table 7 Problem 1: Skewness

Based on the above dataset, we can see that

 The skewness for the variable ‘age’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
 The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5
and 0.5. So, the distribution is fairly symmetrical.
 The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5
and 0.5. So, the distribution is fairly symmetrical.
 The skewness for the variable ‘Blair’ is negative but lies in the range of -1 and – 0.5. So, the
distribution is moderately left skewed.
 The skewness for the variable ‘Hague’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
 The skewness for the variable ‘Europe’ is negative but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
 The skewness for the variable ‘[Link]’ is positive but lies in the range of -0.5 and
0.5. So, the distribution is fairly symmetrical.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Check for Outliers.
Exploratory Data Analysis:
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to
discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of
summary statistics and graphical representations.
Null Values:
In statistics, missing data, or missing values, or null values occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant effect on
the conclusions that can be drawn from the data.

As mentioned earlier as part of data description data checks, There are no missing/Null values in any of
the columns. Below is the result-set table of missing values check in each column of the dataset.

Columns Total Missing/Null values

vote 0
age 0
[Link] 0
[Link] 0
Blair 0
Hague 0
Europe 0
[Link] 0
gender 0
Table 8 Problem 1: Missing/Null values

Shape and Data Types:

 Based on the initial analysis, we can see that this is a sample dataset with just 1525 rows across
9 fields.
 All of the fields are of integer datatype except vote and gender which are of object type. All of
the columns are non-null.

Table 9 Problem 1: Shape and Data types

Univariate Analysis:
For the 7 continuous variables as part of the dataset, univariate analysis is done which includes the
description of each variable, distribution plot, Skewness measure and Box plot of the same. We also have
2 categorical variables for which the analysis includes the value counts and respective bar plots.
Age

Figure 1 Univariate Analysis – Age

Inference:
The skewness for the variable ‘age’ is positive but lies in the range of -0.5 and 0.5. So, the distribution is
fairly symmetrical.

Based on the boxplot, there are no outliers in the ‘age’ variable. We can see that most of the sample
voters are at the age range between 41 and 67 which are 25 th and 75th percentile as shown in the
variable description.
[Link]

Figure 2 Univariate Analysis – [Link]

Inference:
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5 and 0.5.
So, the distribution is fairly symmetrical.

Based on the boxplot, there are a few outliers for the variable towards the left side of the bell curve.
[Link]

Figure 3 Univariate Analysis – [Link]

Inference:
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5 and 0.5.
So, the distribution is fairly symmetrical.

Based on the boxplot, there are a few outliers for the variable towards the left side of the bell curve.
Blair

Figure 4 Univariate Analysis – Blair

Inference:
The skewness for the variable ‘Blair’ is negative but lies in the range of -1 and – 0.5. So, the distribution is
moderately left skewed.

Based on the boxplot, there are no visible outliers in the variable.

Hague

Figure 5 Univariate Analysis – Hague

Inference:
The skewness for the variable ‘Hague’ is positive but lies in the range of -0.5 and 0.5. So, the distribution
is fairly symmetrical.

Based on the boxplot, there are no visible outliers in the variable.

Europe

Figure 6 Univariate Analysis – Europe

Inference:
The skewness for the variable ‘Europe’ is negative but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.

Based on the boxplot, there are no visible outliers in the variable.

[Link]

Figure 7 Univariate Analysis – [Link]

Inference:
The skewness for the variable ‘[Link]’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.

Based on the boxplot, there are no visible outliers in the variable.

Vote

Figure 8 Univariate Analysis – vote

Inference:
Majority of the voters have shown their support to the ‘Labour’ party which accounts to 1063 out of the
1525 total voters.

This accounts to nearly 70% of the entire sample

Gender

Figure 9 Univariate Analysis – gender

Inference:
The major category among the genders is female and accounts to 812 out of the total 1525 voters.

This accounts to about 53% of the entire set.

Bivariate Analysis:
The same way for the 7 continuous variables as part of the dataset, Bivariate analysis is done which
includes the Correlation Heatmap and Pair plot for the entire set.

Correlation Heatmap:
A correlation heatmap is a graphical representation of a correlation matrix representing the correlation
between different variables. The value of correlation can take any value from -1 to 1. Correlation
between two random variables or bivariate data does not necessarily imply a causal relationship.

Here we can see how the various factors of the dataset are correlated with each other. Some of the main
inferences are such as the follows:

 The variables seem to have very minimal correlation among each other. The only set of variables that
show some correlations are [Link] with [Link] and Blair.
 It is noticeable that Hague field is the similarly correlated with the variable ‘Europe’.
 There are considerable number of negatively correlated variables in the dataset based on the
correlation heatmap below.
Figure 10 Correlation Heatmap
Pairplot

Figure 11 Pairplot

Pair plot is used to understand the best set of features to explain a relationship between two variables or
to form the most separated clusters. It also helps to form some simple classification models by drawing
some simple lines or make linear separation in our data-set.
Based on the pairplot above, we can get some inferences as follows:

 There are no considerable linear correlation among any of the variables in the given dataset.

Outlier Proportions
Based on the univariate analysis, we see that we have considerable proportions of outliers as part of
some of the variables. The proportion of outliers to the amount of data available is as follows:

We can see that we have good number of outliers as part of ‘[Link]’ and
‘[Link]’ variables while other variables seem to have no outliers.

Table 10 Outlier Proportions

Outliers Treatment:
We have treated the outliers in the data by replacing the outlier values using the upper and lower bound
values since the columns are holding continuous data. The below series of boxplots will show the
outliers in the fields after being treated.

We can now see in the below boxplots that there are no outliers as part of the above-mentioned
variables.

Figure 12 Box plots after outlier Treatment

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30)

Data Encoding:
Most of the models require all the input and output variables to be numeric and hence it becomes very
important to encode the categorical data into numbers before you can fit and evaluate a model.

Here we are encoding the categorical variables ‘vote’ and ‘gender’ using the label encoding since the
underlying data are of just 2 categories by nature. This type of encoding is used normally when the
variables in the data are ordinal, ordinal encoding converts each label into integer values and the
encoded data represents the sequence of labels.

Vote:
The data under ‘vote’ are ‘Labour’ and ‘Conservative’ which are labelled to be the integer values 0 and 1
respectively. We encode the string values into integer values for further modelling requirements.

Vote Codes
Labour 0
Conservative 1
Table 11 Ordinal encoding - vote – Codes

Gender:
The data under ‘gender’ are ‘female’ and ‘male’ which are labelled to be the integer values 0 and 1
respectively. We encode the string values into integer values for further modelling requirements.

Vote Codes
female 0
male 1
Table 12 Ordinal encoding - gender – Codes

Scaling:
Feature scaling (also known as data normalization) is the method used to standardize the range of
features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data
preprocessing while using machine learning algorithms.

Is Scaling necessary here?

Yes. Scaling is necessary in this case. This is because all the variables are not on same scale. For example,
the field 'Europe' shows the values ranging between 1 and 11 while the values under the field
‘[Link]’ are ranging between 0 and 3.

Zscore – Scaling:
Z-score normalization refers to the process of normalizing every value in a dataset such that the mean of
all of the values is 0 and the standard deviation is 1. We use the following formula to perform a z-score
normalization on every value in a dataset:

New value = (x – μ) / σ
Sample of dataset after Scaling:
The sample of the dataset after applying Z-score scaling is shown below. We can see that the values of all
the variables are of the same scale now. Note that we have scaled only the set of independent variables
here since there are only 2 values in the target variable.

Table 13 Sample of the dataset after z-score scaling

Splitting the data:

The train-test split procedure is used to estimate the performance of machine learning algorithms when
they are used to make predictions on data not used to train the model. It is a fast and easy procedure to
perform, the results of which allow you to compare the performance of machine learning algorithms.

Splitting the target variable:

Before splitting the data into training and test sets, the final step involved is splitting the target variable
from the independent variables. Here the target variable is ‘vote’ and the rest are the independent
variables.

We are creating the dataset X to have all the predictor variables and the dataset y to have the
dependent/target variable.

Below is the sample of how the dataset with predictor variables look after the split.

Table 14 Sample of X - Dataset of predictor variables

Below is the information on the predictor variables dataset ‘X’ which shows that we have 8 independent
variables as part of it.
Table 15 Predictor variables dataset - X - info

Below is the sample of how the dataset with target variable look after the split.

Table 16 Sample of y - Dataset of target variables

Train and Test data split:

Here, leaving a considerable amount of data in both X and y datasets to be test splits, majority of the
data is split into training datasets. So, the data in the training sets will be used for training the machine
learning algorithms while test sets will be used to check the performance of these models.

Train dataset – Independent variables:

Below are the details of the training dataset which is split from the original dataset to hold independent
variables specific to the purpose of training the machine learning algorithms.

Sample data:
Based on the sample, we can see that a random set of records from the independent variables’ dataset
‘X’ are added here.
Table 17 Independent variables - Training dataset - Sample

Data information:
Below is the information regarding the above dataset. It shows that we have 1067 rows of data loaded
into this dataset and all of them are independent variables.

Table 18 Independant variables - Training dataset - Info

Test dataset – Independent variables:

Below are the details of the testing dataset which is split from the original dataset to hold independent
variables specific to the purpose of testing the machine learning algorithms.

Sample:
Based on the sample, we can see that a random set of records from the independent variables’ dataset
‘X’ are added here.

Table 19 Independent variables - Testing dataset - Sample

Data information:
Below is the information regarding the above dataset. It shows that we have 458 rows of data loaded
into this dataset and all of them are independent variables.
Table 20 Independent variables - Testing dataset - Info

Train dataset – Target variable:

Below are the details of the training dataset which is split from the original dataset to hold target
variable specific to the purpose of training the machine learning algorithms.

Sample:
Based on the sample, we can see that a random set of records from the target variable’s dataset ‘y’ are
added here.

Table 21 Target variable - Training dataset - Sample

Data information:
Below is the information regarding the above dataset. It shows that we have 1067 rows of data loaded
into this dataset and it is just the target variable.

Table 22 Target variable - Training dataset - Info

Test dataset – Target variable:

Below are the details of the testing dataset which is split from the original dataset to hold target variable
specific to the purpose of testing the machine learning algorithms.

Sample:
Based on the sample, we can see that a random set of records from the target variable’s dataset ‘y’ are
added here.
Table 23 Target variable - Testing dataset - Sample

Data information:
Below is the information regarding the above dataset. It shows that we have 458 rows of data loaded
into this dataset and it is just the target variable.

Table 24 Target variable - Testing dataset – Info

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Fitting the Models:
As part of this phase, we will be fitting the data in logistic regression and LDA models.

Logistic Regression:
Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on
a given dataset of independent variables. Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.

After fitting the data to the Logistic Regression Model, we will be retrieving the predictions on training
and test datasets.

Predicted Classes:
Below are the predicated classes rendered by the model.

Table 25 Predicted Class - Logistic Regression

Parameters:
Here we are running with the following parameters for logistic regression.
 Penalty – ’I2’
o Penalized logistic regression imposes a penalty to the logistic model for having too many
variables. This results in shrinking the coefficients of the less contributive variables
toward zero. This is also known as regularization.
o Out of the l1 and l2 which are suitable penalties for liblinear solver, we have chosen l2
randomly here
 Solver – ‘liblinear’
o Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you
might want to consider the following aspects:
 For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones;
 For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss;
 ‘liblinear’ is limited to one-versus-rest schemes.
 Tol – 0.0001
o Tolerance for stopping criteria.

 Verbose – 1
o For the liblinear and lbfgs solvers it is good to set verbose to any positive number for
verbosity.

Based on these parameters, the model is built, and the predicated classes rendered by the model are as
follows:

Table 26 Predicted Class - Logistic Regression - Best parameters

Accuracy:
Accuracy of the best model on the training and test data:

Trainin
g Test
Logistic Regression Model 0.83 0.86
Table 27 Accuracy - Best model
Validness:
Based on the observation we see on training and testing dataset accuracy results, The model does not
seem over or under fit.

Linear Discriminant Analysis:

Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis
to create a new axis by separating them using a straight line and projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.

Figure 13 Linear Discriminant Analysis

To create a new axis, Linear Discriminant Analysis uses the following criteria:

 It maximizes the distance between means of two classes.

 It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class. In other
words, we can say that the new axis will increase the separation between the data points of the two
classes and plot them onto the new axis.

Predicted Classes:
Below are the predicated classes rendered by the model.
Table 28 Predicted Class – LDA

Parameters:
We have implemented the above LDA model with the default cut-off which is set to 0.5.

Accuracy:
1. Accuracy of the model on the train and test data:

Train Test
Default cutoff LDA Model 0.82 0.86
Table 29 Accuracy – Initial LDA model

Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is no under or over fitting that
has happened in the model.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Fitting the Models:
As part of this phase, we will be fitting the data in KNN and Naïve Bayes models.

K-Nearest Neighbors Model:

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both
classification as well as regression predictive problems. However, it is mainly used for classification
predictive problems in industry. The following two properties would define KNN well −

Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training
phase and uses all the data for training while classification.

Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t
assume anything about the underlying data.
Figure 14 KNN model

Predicted Classes:
Below are the predicated classes rendered by the model.

Table 30 Predicted Class – KNN

Parameters:
We have implemented the above KNN model with the default k value which is set to 5.

Accuracy:
2. Accuracy of the model on the train and test data:

Train Test
Initial KNN Model 0.87 0.81
Table 31 Accuracy – Initial LDA model

Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is under fitting that has
happened in the model.

Naive Bayes Model:

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems. It is mainly used in text classification that includes a high-dimensional
training dataset. Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions. It is a
probabilistic classifier, which means it predicts on the basis of the probability of an object. Some popular
examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

where,

3. P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

4. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

5. P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

6. P(B) is Marginal Probability: Probability of Evidence.

Predicted Classes:
Below are the predicated classes rendered by the model.

Table 32 Predicted Class – Naïve Bayes

Parameters:
We have implemented the above Naïve Bayes model with the default parameter var_smoothing which is
set to 1e-09.
Accuracy:
7. Accuracy of the model on the train and test data:

Train Test
Default Naïve Bayes Model 0.83 0.83
Table 33 Accuracy – Initial Naïve Bayes model

Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is no underfitting or over
fitting that has happened in the model.

1.6 Model Tuning, Bagging (Random Forest should be applied for

Bagging), and Boosting.
Model Tuning:
Logistic Regression:
In order to fine tune the logistic regression model, we make use of the Grid search CV.

GridSearchCV:
GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in
the estimator used. The parameters of the estimator used to apply these methods are optimized by
cross-validated grid-search over a parameter grid.
Here we are running the grid search with the following parameters for logistic regression.
 Penalty – 'l1',’I2’ and ‘none’
 Solver – 'liblinear','sag' and 'lbfgs'
 Tol – 0.0001 and 0.00001
The best parameters as given by the grid search is as follows:
{'penalty': 'none', 'solver': 'sag', 'tol': 0.0001}

Based on these parameters, the best model is built, and the accuracy of the model are as follows:

Accuracy – Tuned Model:

Below is the accuracy for training and testing dataset rendered by the model.

Table 34 Accuracy - Tuned Model - Logistic Regression

Linear Discriminant Analysis (LDA):

We have built the best LDA model by changing the cut-off values for maximum accuracy.

Custom Cut-Off model:

We check the F1 Score for each value of the custom prob ranging from 0.1 to 1 and thus found that the
F1 score is maximum when the custom prob is 0.3.
Figure 15 Custom Cut-Off selection - Accuracy, F1 score and Confusion Matrix

Based on these custom cut-offs selected, the best model is built, and the accuracy of the model are as
follows:

Accuracy – Tuned Model:

Below is the accuracy for training and testing dataset rendered by the model.

Table 35 Accuracy - Tuned Model - Logistic Regression

K-Nearest Neighbors (KNN):

We have built the best KNN model by changing the No. of neighbors k to an optimum value based on
Misclassification Error(MCE).
Optimum k value model:
We run the KNN with no of neighbors to be 1,3,5...19 and *Find the optimal number of neighbors from
K=1,3,5,7....19 using the Mis classification error

Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbors =
1,3,5...19 and find the model with lowest MCE.

Figure 16 Misclassification Error across No. of Neighbors k

We can see that for the k value 13, the MCE seems to be the minimum and so we take the k-value to be
13 for the tuned model.

Based on the k value selected, the best model is built, and the accuracy of the model are as follows:

Accuracy – Tuned Model:

Below is the accuracy for training and testing dataset rendered by the model.

Table 36 Accuracy - Tuned Model – KNN

Naïve Bayes (Gaussian):

We have built the best Naïve Bayes (Gaussian) model by finding an optimum var_smoothing value based
on Grid search.

GridSearchCV:
GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in
the estimator used. The parameters of the estimator used to apply these methods are optimized by
cross-validated grid-search over a parameter grid.
Here we are running the grid search with the following parameters for Naïve Bayes (Gaussian) model.
{'var_smoothing': [Link](0,-9, num=100)}: Portion of the largest variance of all features that is
added to variances for calculation stability.
The best parameters as given by the grid search is as follows:
{'var_smoothing': 0.01}

Based on the parameters above, the best model is built, and the accuracy of the model are as follows:

Accuracy – Tuned Model:

Below is the accuracy for training and testing dataset rendered by the model.

Table 37 Accuracy - Tuned Model – KNN

Bagging (With Random Forest):

Now let us implement Bagging for Random Forest.

Random Forest Algorithm:

Random forests or random decision forests is an ensemble learning method for classification, regression
and other tasks that operates by constructing a multitude of decision trees at training time. For
classification tasks, the output of the random forest is the class selected by most trees. For regression
tasks, the mean or average prediction of the individual trees is returned. Random decision forests
correct for decision trees' habit of overfitting to their training set. Random forests generally outperform
decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can
affect their performance.

Parameter grid:
Below is the parameters added to the parameter grid for grid search cv.

'max_depth': [1,3,5], 'max_features': [6,8], 'min_samples_leaf': [2,5], 'min_samples_split': [25,50],

'n_estimators': [300,400,500]

Best Parameters:
As a result of the grid search, below are best parameters rendered.
(max_depth=5, max_features=6, min_samples_leaf=2,
min_samples_split=25, n_estimators=300, random_state=1)
Bagging:
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used
to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is
selected with replacement—meaning that the individual data points can be chosen more than once.

Here a bagging model is created with Random Forest model as the base estimator.

Predicted Classes:
Below are the predicated classes rendered by the model.

Table 38 Predicted Class – Bagging over Random Forest

Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.

Table 39 Accuracy – Bagging over Random Forest

Boosting:
Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to
minimize training errors. In boosting, a random sample of data is selected, fitted with a model and then
trained sequentially—that is, each model tries to compensate for the weaknesses of its predecessor.
With each iteration, the weak rules from each individual classifier are combined to form one, strong
prediction rule.

In bagging, weak learners are trained in parallel, but in boosting, they learn sequentially. This means that
a series of models are constructed and with each new model iteration, the weights of the misclassified
data in the previous model are increased. This redistribution of weights helps the algorithm identify the
parameters that it needs to focus on to improve its performance. AdaBoost, which stands for “adaptative
boosting algorithm,” is one of the most popular boosting algorithms as it was one of the first of its kind.
Other types of boosting algorithms include XGBoost, GradientBoost, and BrownBoost.

AdaBoost
This method operates iteratively, identifying misclassified data points and adjusting their weights to
minimize the training error. The model continues optimize in a sequential fashion until it yields the
strongest predictor.

Parameters:
Here, we are running the algorithm with the below parameters:
(n_estimators=100, random_state=1)
Predicted Classes:
Below are the predicated classes rendered by the model.

Table 40 Predicted Class – AdaBoost

Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.

Table 41 Accuracy – AdaBoost

Gradient boosting:
The gradient boosting trains on the residual errors of the previous predictor. The name, gradient
boosting, is used since it combines the gradient descent algorithm and boosting method.

Parameters:
Here, we are running the algorithm with the below parameters:
(n_estimators=100, random_state=1)

Predicted Classes:
Below are the predicated classes rendered by the model.

Table 42 Predicted Class – Gradient Boosting

Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.

Table 43 Accuracy – Gradient Boosting

1.7 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Final Model: Compare the models and write inference
which model is best/optimized.
Now the models are fit, and it is necessary to check the performance of these models in order to fine
tune them and get the best predictions possible.

Performance Metrics:
Performance metrics are a part of every machine learning pipeline. They tell you if you're making
progress and put a number on it. All machine learning models, whether it's linear regression, or a SOTA
technique like BERT, need a metric to judge performance. Some of the performance metrics that we are
going to use here are

 Accuracy
 Confusion Matrix
 ROC curve
 ROC AUC score

Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations. One may think that, if we have high accuracy then our model is
best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false
positive and false negatives are almost same. Therefore, you have to look at other parameters to
evaluate the performance of your model. For our model, we have got 0.803 which means our model is
approx. 80% accurate.

Accuracy = TP+TN/TP+FP+FN+TN

Confusion Matrix:
A confusion matrix is a table that is used to define the performance of a classification algorithm. A
confusion matrix visualizes and summarizes the performance of a classification algorithm.

Figure 17 Confusion Matrix

ROC Curve:
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:

 True Positive Rate

 False Positive Rate

ROC AUC score:

ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much
the model is capable of distinguishing between classes. Higher the AUC, the better the model is at
predicting 0 classes as 0 and 1 classes as 1.

Now that we have an idea of what these performance metrics are, let us check how our models have
performed.

Initial Model - Logistic Regression:

1. Accuracy of the initial model on the training and test data:

Trainin
g Test
Initial Logistic Regression Model 0.83 0.86
Table 44 Accuracy - Initial model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.880 and the ROC curve for the training data is as follows:

Figure 18 AUC and ROC for the training data- Initial model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.912 and the ROC curve for the testing data is as follows:
Figure 19 AUC and ROC for testing data- Initial model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 210, False negatives account to 113, False
Positives account to 73 and true negatives account to 671.

Figure 20 Confusion Matrix - Training data- Initial model

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 94, False negatives account to 45, False
Positives account to 18 and true negatives account to 301.

Figure 21 Confusion matrix - Test data- Initial model

6. Classification report for the training data:
a. We can see that the precision values for 0 and 1 are 0.86 and 0.74 respectively and recall
values are 0.90 and 0.65 respectively.

Table 45 Classification Report - Training data- Initial model

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.87 and 0.84 respectively and recall
values are 0.91 and 0.75 respectively.

Table 46 Classification Report - Testing data- Initial model

Best model – Logistic Regression:

1. Accuracy of the best model on the training and test data:

Trainin
g Test
Best Logistic Regression Model 0.83 0.86
Table 47 Accuracy - Best model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.880 and the ROC curve for the training data is as follows:
Figure 22 AUC and ROC for the training data- Best model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.912 and the ROC curve for the testing data is as follows:

Figure 23 AUC and ROC for testing data- Best model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 210, False negatives account to 113, False
Positives account to 73 and true negatives account to 671.

Figure 24 Confusion Matrix - Training data- Best model

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 94, False negatives account to 45, False
Positives account to 18 and true negatives account to 301.

Figure 25 Confusion matrix - Test data- Best model

6. Classification report for the training data:

a. We can see that the precision values for 0 and 1 are 0.86 and 0.74 respectively and recall
values are 0.90 and 0.65 respectively.

Table 48 Classification Report - Training data- Best model

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.87 and 0.84 respectively and recall
values are 0.94 and 0.68 respectively.

Table 49 Classification Report - Testing data - Best Model

Initial Model – LDA:

1. Accuracy of the initial model on the training and test data:

Trainin Test
g
Initial LDA Model 0.82 0.86
Table 50 Accuracy - Initial model

2. AUC and ROC for the training and testing data:

a. AUC value for training and testing data are calculated to be 0.880 and 0.912 respectively
b. The ROC curve for the training and testing data is as follows:

Figure 26 AUC and ROC for training and testing data- Initial model

3. Confusion matrix for the training and testing data:

a. In the confusion matrix for training data, We can see that the True positives account to
212, False negatives account to 111, False Positives account to 79 and true negatives
account to 665.
b. In the confusion matrix for testing data, We can see that the True positives account to
95, False negatives account to 44, False Positives account to 19 and true negatives
account to 300.

Figure 27 Confusion Matrix - Training and Testing data- Initial model

4. Classification report for the training and testing data:
a. In the classification report of the training data, We can see that the precision values for 0
and 1 are 0.86 and 0.73 respectively and recall values are 0.89 and 0.66 respectively.
b. In the classification report of the testing data, We can see that the precision values for 0
and 1 are 0.87 and 0.83 respectively and recall values are 0.94 and 0.68 respectively.

Table 51 Classification Report - Training and Testing data- Initial model

Custom cutoff model – LDA:

8. Accuracy of the best model on the train and test data:

Train Test
Custom cutoff LDA Model 0.81 0.85
Table 52 Accuracy - Custom Cut-Off

9. Classification Report of the custom cut-off train and test data are as follows:
o In the classification report of the training data, we can see that the precision values for 0
and 1 are 0.90 and 0.64 respectively and recall values are 0.81 and 0.80 respectively.
o In the classification report of the testing data, we can see that the precision values for 0
and 1 are 0.93 and 0.71 respectively and recall values are 0.85 and 0.85 respectively.
Table 53 Classification report - Custom cutoff test data – LDA

Initial Model - KNN:

1. Accuracy of the initial model on the training and test data:

Trainin
g Test
Initial KNN Model 0.87 0.81
Table 54 Accuracy - Initial model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.931 and the ROC curve for the training data is as follows:

Figure 28 AUC and ROC for the training data- Initial model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.847 and the ROC curve for the testing data is as follows:
Figure 29 AUC and ROC for testing data- Initial model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 247, False negatives account to 84, False
Positives account to 58 and true negatives account to 678.

Figure 30 Confusion Matrix - Training data- Initial model

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 80, False negatives account to 51, False
Positives account to 37 and true negatives account to 290.

Figure 31 Confusion matrix - Test data- Initial model

6. Classification report for the training data:
a. We can see that the precision values for 0 and 1 are 0.89 and 0.81 respectively and recall
values are 0.92 and 0.75 respectively.

Table 55 Classification Report - Training data- Initial model

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.85 and 0.68 respectively and recall
values are 0.89 and 0.61 respectively.

Table 56 Classification Report - Testing data- Initial model

Best model – KNN:

1. Accuracy of the best model on the training and test data:

Trainin
g Test
Best KNN Model 0.85 0.83
Table 57 Accuracy - Best model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.912 and the ROC curve for the training data is as follows:
Figure 32 AUC and ROC for the training data- best model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.886 and the ROC curve for the testing data is as follows:

Figure 33 AUC and ROC for testing data- best model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 228, False negatives account to 103, False
Positives account to 59 and true negatives account to 677.

Figure 34 Confusion Matrix - Training data- best model

5. Confusion matrix for the testing data:

b. We can see that the True positives account to 87, False negatives account to 44, False
Positives account to 32 and true negatives account to 295.

Figure 35 Confusion matrix - Test data- best model

6. Classification report for the training data:

a. We can see that the precision values for 0 and 1 are 0.87 and 0.79 respectively and recall
values are 0.92 and 0.69 respectively.

Table 58 Classification Report - Training data- best model

7. Classification report for the testing data:

b. We can see that the precision values for 0 and 1 are 0.87 and 0.73 respectively and recall
values are 0.90 and 0.66 respectively.

Table 59 Classification Report - Testing data - Best Model

Initial Model – Naïve Bayes:

1. Accuracy of the initial model on the training and test data:

Trainin
g Test
Initial Naïve Bayes Model 0.83 0.83
Table 60 Accuracy - Initial model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.888 and the ROC curve for the training data is as follows:

Figure 36 AUC and ROC for the training data- Initial model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.887 and the ROC curve for the testing data is as follows:

Figure 37 AUC and ROC for testing data- Initial model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 241, False negatives account to 91, False
Positives account to 87 and true negatives account to 648.
Figure 38 Confusion Matrix - Training data- Initial model

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 96, False negatives account to 34, False
Positives account to 46 and true negatives account to 282.

Figure 39 Confusion matrix - Test data- Initial model

6. Classification report for the training data:

a. We can see that the precision values for 0 and 1 are 0.88 and 0.73 respectively and recall
values are 0.88 and 0.73 respectively.

Table 61 Classification Report - Training data- Initial model

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.89 and 0.68 respectively and recall
values are 0.86 and 0.74 respectively.
Table 62 Classification Report - Testing data- Initial model

Best model – Naïve Bayes:

1. Accuracy of the best model on the training and test data:

Trainin
g Test
Best Naïve Bayes Model 0.83 0.82
Table 63 Accuracy - Best model

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.888 and the ROC curve for the training data is as follows:

Figure 40 AUC and ROC for the training data- best model

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.887 and the ROC curve for the testing data is as follows:
Figure 41 AUC and ROC for testing data- best model

4. Confusion matrix for the training data:

a. We can see that the True positives account to 241, False negatives account to 91, False
Positives account to 87 and true negatives account to 648.

Figure 42 Confusion Matrix - Training data- best model

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 96, False negatives account to 34, False
Positives account to 47 and true negatives account to 281.

Figure 43 Confusion matrix - Test data- best model

6. Classification report for the training data:
a. We can see that the precision values for 0 and 1 are 0.88 and 0.73 respectively and recall
values are 0.88 and 0.73 respectively.

Table 64 Classification Report - Training data- best model

7. Classification report for the testing data:

 We can see that the precision values for 0 and 1 are 0.89 and 0.67 respectively and recall
values are 0.86 and 0.74 respectively.

Table 65 Classification Report - Testing data - Best Model

Bagging (Random Forest):

1. Accuracy of the best model on the training and test data:

Training Test
Bagging with Random Forest
Model 0.86 0.83
Table 66 Accuracy - Bagging

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.919 and the ROC curve for the training data is as follows:
Figure 44 AUC and ROC for the training data- Bagging

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.898 and the ROC curve for the testing data is as follows:

Figure 45 AUC and ROC for testing data- Bagging

4. Confusion matrix for the training data:

a. We can see that the True positives account to 240, False negatives account to 92, False
Positives account to 58 and true negatives account to 677.

Figure 46 Confusion Matrix - Training data- Bagging

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 90, False negatives account to 40, False
Positives account to 36 and true negatives account to 292.

Figure 47 Confusion matrix - Test data- Bagging

6. Classification report for the training data:

a. We can see that the precision values for 0 and 1 are 0.88 and 0.81 respectively and recall
values are 0.92 and 0.72 respectively.

Table 67 Classification Report - Training data- Bagging

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.88 and 0.71 respectively and recall
values are 0.89 and 0.69 respectively.

Table 68 Classification Report - Testing data - Bagging

Boosting(AdaBoost):
1. Accuracy of the best model on the training and test data:

Trainin
g Test
Adaptive Boosting Model 0.85 0.82
Table 69 Accuracy - AdaBoost
2. AUC and ROC for the training data:
a. AUC value is calculated to be 0.913 and the ROC curve for the training data is as follows:

Figure 48 AUC and ROC for the training data- AdaBoost

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.881 and the ROC curve for the testing data is as follows:

Figure 49 AUC and ROC for testing data- AdaBoost

4. Confusion matrix for the training data:

a. We can see that the True positives account to 238, False negatives account to 94, False
Positives account to 71 and true negatives account to 664.
Figure 50 Confusion Matrix - Training data- AdaBoost

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 89, False negatives account to 41, False
Positives account to 41 and true negatives account to 287.

Figure 51 Confusion matrix - Test data- AdaBoost

6. Classification report for the training data:

a. We can see that the precision values for 0 and 1 are 0.88 and 0.77 respectively and recall
values are 0.90 and 0.72 respectively.

Table 70 Classification Report - Training data- AdaBoost

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.88 and 0.68 respectively and recall
values are 0.88 and 0.68 respectively.
Table 71 Classification Report - Testing data - AdaBoost

Boosting(Gradient Boosting):
1. Accuracy of the best model on the training and test data:

Trainin
g Test
Gradient Boosting Model 0.88 0.83
Table 72 Accuracy - Gradient Boosting

2. AUC and ROC for the training data:

a. AUC value is calculated to be 0.947 and the ROC curve for the training data is as follows:

Figure 52 AUC and ROC for the training data- Gradient Boosting

3. AUC and ROC for the testing data:

a. AUC value is calculated to be 0.904 and the ROC curve for the testing data is as follows:
Figure 53 AUC and ROC for testing data- Gradient Boosting

4. Confusion matrix for the training data:

a. We can see that the True positives account to 260, False negatives account to 72, False
Positives account to 52 and true negatives account to 683.

Figure 54 Confusion Matrix - Training data- Gradient Boosting

5. Confusion matrix for the testing data:

a. We can see that the True positives account to 95, False negatives account to 35, False
Positives account to 42 and true negatives account to 286.

Figure 55 Confusion matrix - Test data- Gradient Boosting

6. Classification report for the training data:
a. We can see that the precision values for 0 and 1 are 0.90 and 0.83 respectively and recall
values are 0.93 and 0.78 respectively.

Table 73 Classification Report - Training data- Gradient Boosting

7. Classification report for the testing data:

a. We can see that the precision values for 0 and 1 are 0.89 and 0.69 respectively and recall
values are 0.87 and 0.73 respectively.

Table 74 Classification Report - Testing data – Gradient Boosting

Models Comparison:
Let us compare the models based on the performance metrics rendered by each model for the training
and test datasets.

Model/Algorithm Training Test Difference Stability

Initial Logistic Regression Model 0.83 0.86 -0.03 Moderately Stable
Best Logistic Regression Model 0.83 0.86 -0.03 Moderately Stable
Initial LDA Model 0.82 0.86 -0.04 Moderately Stable
Custom cutoff LDA Model 0.81 0.85 -0.04 Moderately Stable
Initial KNN Model 0.87 0.81 0.06 Unstable
Best KNN Model 0.85 0.83 0.02 Moderately Stable
Initial Naïve Bayes Model 0.83 0.83 0 Stable
Best Naïve Bayes Model 0.83 0.82 0.01 Almost Stable
Bagging with Random Forest Model 0.86 0.83 0.03 Moderately Stable
Adaptive Boosting Model 0.85 0.82 0.03 Moderately Stable
Gradient Boosting Model 0.88 0.83 0.05 Unstable
Table 75 Accuracy of each algorithm and calculated stability

Based on the comparison above, we can conclude the most stable algorithm would be the Naïve Bayes
Model based on the difference in the accuracy of predictions between the Training and Test datasets.
The most unstable ones seem to be the KNN Model before fine tuning and Gradient Boosting model.
Final Model:
The best or most optimized model based on the accuracy of the training and test datasets is found to be
the Naïve Bayes Model. The model has delivered the same kind of accuracy in both the sets.

We can confirm the same based on the classification report for training and testing data for the fine-
tuned Naïve Bayes models below.

Classification report for the training data:

10. We can see that the precision values for 0 and 1 are 0.88 and 0.73 respectively and recall values
are 0.88 and 0.73 respectively.

Table 76 Classification Report - Training data- best Naïve Bayes model

Classification report for the testing data:

11. We can see that the precision values for 0 and 1 are 0.89 and 0.67 respectively and recall values
are 0.86 and 0.74 respectively.

Table 77 Classification Report - Testing data - best Naïve Bayes Model

1.8 Based on these predictions, what are the insights?

Inference:
We have built multiple models for above scenario such as Logistic Regression, LDA, KNN, Naïve Bayes,
Random Forest with Bagging technique and Boosting algorithms such AdaBoost and Gradient Boosting.
And based on the accuracy rendered by each of these models, we have concluded that the Naïve Bayes
Model has performed well. Below are the recommendations to the organization.

Recommendations and insights:

Some of the major insights and recommendations with respect to the above analysis are as follows.

12. The CNBE channel can use the Naïve Bayes model to make their poll predictions since it provides
the most stable prediction across the training and test sets.
13. The data already seem to have the least correlated variables captured which makes the Naïve
Bayes algorithm’s assumption that the variables are independent correct.
14. It is also possible to have an ensemble learning from multiple models and take up the most
common prediction from the models as the result. This could be an option if the organization
could decide upon the best model to predict the poll results.

Problem 2:
Executive Summary:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will
be looking at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

Introduction:
The purpose of this exercise is to utilize various Text mining techniques to understand the information
and emotions involved in the speeches given. We will be using the Natural Language ToolKit to mine the
raw data and derive sensible information from the same.

Problem Questions:
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Below is total number of words, Sentences and characters in the docs.

Figure 56 Result set

2.2 Remove all the stopwords from all three speeches.

Below is the speech words after stopwords are removed.
nation day inaugur sinc 1789 peopl renew sens dedic unit state washington
day task peopl creat weld togeth nation lincoln day task peopl preserv
nation disrupt within day task peopl save nation institut disrupt without
come time midst swift happen paus moment take stock recal place histori
rediscov may risk real peril inact live nation determin count year lifetim
human spirit life man three score year ten littl littl less life nation
full measur live men doubt men believ democraci form govern frame life
limit measur kind mystic artifici fate unexplain reason tyranni slaveri
becom surg wave futur freedom eb tide american true eight year ago life
republ seem frozen fatalist terror prove true midst shock act act quickli
boldli decis later year live year fruit year peopl democraci brought
greater secur hope better understand life ideal measur materi thing vital
present futur experi democraci success surviv crisi home put away mani
evil thing built new structur endur line maintain fact democraci action
taken within three way framework constitut unit state coordin branch
govern continu freeli function bill right remain inviol freedom elect
wholli maintain prophet downfal american democraci seen dire predict come
naught democraci die seen reviv grow cannot die built unhamp initi
individu men women join togeth common enterpris enterpris undertaken carri
free express free major democraci alon form govern enlist full forc men
enlighten democraci alon construct unlimit civil capabl infinit progress
improv human life look surfac sens still spread everi contin human advanc
end unconquer form human societi nation like person bodi bodi must fed
cloth hous invigor rest manner measur object time nation like person mind
mind must kept inform alert must understand hope need neighbor nation live
within narrow circl world nation like person someth deeper someth perman
someth larger sum part someth matter futur call forth sacr guard present
thing find difficult even imposs hit upon singl simpl word yet understand
spirit faith america product centuri born multitud came mani land high
degre mostli plain peopl sought earli late find freedom freeli democrat
aspir mere recent phase human histori human histori permeat ancient life
earli peopl blaze anew middl age written magna charta america impact
irresist america new world tongu peopl contin new found land came believ
could creat upon contin new life life new freedom vital written mayflow
compact declar independ constitut unit state gettysburg address first came
carri long spirit million follow stock sprang move forward constantli
consist toward ideal gain statur clariti gener hope republ cannot forev
toler either undeserv poverti self serv wealth still far go must greatli
build secur opportun knowledg everi citizen measur justifi resourc capac
land enough achiev purpos alon enough cloth feed bodi nation instruct
inform mind also spirit three greatest spirit without bodi mind men nation
could live spirit america kill even though nation bodi mind constrict
alien world live america would perish spirit faith speak daili live way
often unnot seem obviou speak capit nation speak process govern
sovereignti 48 state speak counti citi town villag speak nation hemispher
across sea enslav well free sometim fail hear heed voic freedom privileg
freedom old old stori destini america proclaim word propheci spoken first
presid first inaugur 1789 word almost direct would seem year 1941 preserv
sacr fire liberti destini republican model govern justli consid deepli
final stake experi intrust hand american peopl ." lose sacr fire smother
doubt fear reject destini washington strove valiantli triumphantli
establish preserv spirit faith nation furnish highest justif everi
sacrific may make caus nation defens face great peril never encount strong
purpos protect perpetu integr democraci muster spirit america faith
america retreat content stand still american go forward servic countri god

vice presid johnson mr speaker mr chief justic presid eisenhow vice presid
nixon presid truman reverend clergi fellow citizen observ today victori
parti celebr freedom symbol end well begin signifi renew well chang sworn
almighti god solemn oath forebear l prescrib nearli centuri three quarter
ago world differ man hold mortal hand power abolish form human poverti
form human life yet revolutionari belief forebear fought still issu around
globe belief right man come generos state hand god dare forget today heir
first revolut word go forth time place friend foe alik torch pass new
gener american born centuri temper war disciplin hard bitter peac proud
ancient heritag unwil wit permit slow undo human right nation alway commit
commit today home around world everi nation whether wish well ill pay
price bear burden meet hardship support friend oppos foe order assur
surviv success liberti much pledg old alli whose cultur spiritu origin
share pledg loyalti faith friend unit littl cannot host cooper ventur
divid littl dare meet power challeng odd split asund new state welcom rank
free pledg word one form coloni control pass away mere replac far iron
tyranni alway expect find support view alway hope find strongli support
freedom rememb past foolishli sought power ride back tiger end insid peopl
hut villag across globe struggl break bond mass miseri pledg best effort
help help whatev period requir communist may seek vote right free societi
cannot help mani poor cannot save rich sister republ south border offer
special pledg convert good word good deed new allianc progress assist free
men free govern cast chain poverti peac revolut hope cannot becom prey
hostil power neighbor join oppos aggress subvers anywher america everi
power hemispher intend remain master hous world assembl sovereign state
unit nation last best hope age instrument war far outpac instrument peac
renew pledg support prevent becom mere forum invect strengthen shield new
weak enlarg area writ may run final nation would make adversari offer
pledg request side begin anew quest peac dark power destruct unleash
scienc engulf human plan accident self destruct dare tempt weak arm
suffici beyond doubt certain beyond doubt never employ neither two great
power group nation take comfort present cours side overburden cost modern
weapon rightli alarm steadi spread deadli atom yet race alter uncertain
balanc terror stay hand mankind final war begin anew rememb side civil
sign weak sincer alway subject proof never negoti fear never fear negoti
side explor problem unit instead belabor problem divid side first time
formul seriou precis propos inspect control arm bring absolut power
destroy nation absolut control nation side seek invok wonder scienc
instead terror togeth explor star conquer desert erad diseas tap ocean
depth encourag art commerc side unit heed corner earth command isaiah undo
heavi burden ... oppress go free ." beachhead cooper may push back jungl
suspicion side join creat new endeavor new balanc power new world law
strong weak secur peac preserv finish first 100 day finish first 1 000 day
life administr even perhap lifetim planet begin hand fellow citizen mine
rest final success failur cours sinc countri found gener american summon
give testimoni nation loyalti grave young american answer call servic
surround globe trumpet summon call bear arm though arm need call battl
though embattl call bear burden long twilight struggl year year rejoic
hope patient tribul struggl common enemi man tyranni poverti diseas war
forg enemi grand global allianc north south east west assur fruit life
mankind join histor effort long histori world gener grant role defend
freedom hour maximum danger shrink respons welcom believ would exchang
place peopl gener energi faith devot bring endeavor light countri serv
glow fire truli light world fellow american ask countri ask countri fellow
citizen world ask america togeth freedom man final whether citizen america
citizen world ask high standard strength sacrific ask good conscienc sure
reward histori final judg deed go forth lead land love ask bless help know
earth god work must truli
mr vice presid mr speaker mr chief justic senat cook mr eisenhow fellow
citizen great good countri share togeth met four year ago america bleak
spirit depress prospect seemingli endless war abroad destruct conflict
home meet today stand threshold new era peac world central question use
peac resolv era enter postwar period often time retreat isol lead stagnat
home invit new danger abroad resolv becom time great respons greatli born
renew spirit promis america enter third centuri nation past year saw far
reach result new polici peac continu revit tradit friendship mission peke
moscow abl establish base new durabl pattern relationship among nation
world america bold initi 1972 long rememb year greatest progress sinc end
world war ii toward last peac world peac seek world flimsi peac mere
interlud war peac endur gener come import understand necess limit america
role maintain peac unless america work preserv peac peac unless america
work preserv freedom freedom clearli understand new natur america role
result new polici adopt past four year respect treati commit support vigor
principl countri right impos rule anoth forc continu era negoti work limit
nuclear arm reduc danger confront great power share defend peac freedom
world expect other share time pass america make everi nation conflict make
everi nation futur respons presum tell peopl nation manag affair respect
right nation determin futur also recogn respons nation secur futur america
role indispens preserv world peac nation role indispens preserv peac
togeth rest world resolv move forward begin made continu bring wall hostil
divid world long build place bridg understand despit profound differ
system govern peopl world friend build structur peac world weak safe
strong respect right live differ system would influenc other strength idea
forc arm accept high respons burden gladli gladli chanc build peac noblest
endeavor nation engag gladli also act greatli meet respons abroad remain
great nation remain great nation act greatli meet challeng home chanc
today ever histori make life better america ensur better educ better
health better hous better transport cleaner environ restor respect law
make commun livabl insur god given right everi american full equal
opportun rang need great reach opportun great bold determin meet need new
way build structur peac abroad requir turn away old polici fail build new
era progress home requir turn away old polici fail abroad shift old polici
new retreat respons better way peac home shift old polici new retreat
respons better way progress abroad home key new respons lie place divis
respons live long consequ attempt gather power respons washington abroad
home time come turn away condescend polici patern washington know best ."
person expect act respons respons human natur encourag individu home
nation abroad decid locat respons place measur other today offer promis
pure government solut everi problem live long fals promis trust much
govern ask deliv lead inflat expect reduc individu effort disappoint
frustrat erod confid govern peopl govern must learn take less peopl peopl
rememb america built govern peopl welfar work shirk respons seek respons
live ask govern challeng face togeth ask govern help help nation govern
great vital role play pledg govern act act boldli lead boldli import role
everi one must play individu member commun day forward make solemn commit
heart bear respons part live ideal togeth see dawn new age progress
america togeth celebr 200th anniversari nation proud fulfil promis world
america longest difficult war come end learn debat differ civil decenc
reach one preciou qualiti govern cannot provid new level respect right
feel one anoth new level respect individu human digniti cherish birthright
everi american els time come renew faith america recent year faith
challeng children taught asham countri asham parent asham america record
home role world everi turn beset find everyth wrong america littl right
confid judgment histori remark time privileg live america record centuri
unparallel world histori respons generos creativ progress proud system
produc provid freedom abund wide share system histori world proud four war
engag centuri includ one bring end fought selfish advantag help other
resist aggress proud bold new initi steadfast peac honor made break toward
creat world world known structur peac last mere time gener come embark
today era present challeng great nation gener ever face answer god histori
conscienc way use year stand place hallow histori think other stood think
dream america think recogn need help far beyond order make dream come true
today ask prayer year ahead may god help make decis right america pray
help togeth may worthi challeng pledg togeth make next four year best four
year america histori 200th birthday america young vital began bright
beacon hope world go forward confid hope strong faith one anoth sustain
faith god creat strive alway serv purpos
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)

Top 3 frequent words for [Link] are 'nation','peopl' and 'spirit'

Top 3 frequent words for [Link] are 'power','world' and 'nation'

Top 3 frequent words for [Link] are 'america', 'peac' and 'world'
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords) – 3 Marks [ refer to the End-to-End Case Study done
in the Mentored Learning Session ]
[Link]
[Link]
[Link]

03 Symmetric and Skewed Distributions and Outliers
No ratings yet
03 Symmetric and Skewed Distributions and Outliers
6 pages
01 Mean, Variance, and Standard Deviation
No ratings yet
01 Mean, Variance, and Standard Deviation
10 pages
02 Measures of Spread
No ratings yet
02 Measures of Spread
6 pages
Probability & Statistics - Final Exam - Practice 1
No ratings yet
Probability & Statistics - Final Exam - Practice 1
9 pages
02 Frequency Histograms and Polygons, and Density Curves
No ratings yet
02 Frequency Histograms and Polygons, and Density Curves
6 pages
08 Joint Distributions
No ratings yet
08 Joint Distributions
6 pages
07 Relative Frequency Tables
No ratings yet
07 Relative Frequency Tables
6 pages
04 Box and Whisker Plots
No ratings yet
04 Box and Whisker Plots
6 pages
09 Histograms and Stem-And-leaf Plots
No ratings yet
09 Histograms and Stem-And-leaf Plots
6 pages
01 Measures of Central Tendency
No ratings yet
01 Measures of Central Tendency
6 pages
10 Building Histograms From Data Sets
No ratings yet
10 Building Histograms From Data Sets
7 pages
Python Seaborn Tutorial For Beginners v2
No ratings yet
Python Seaborn Tutorial For Beginners v2
40 pages
09 Lineplot
No ratings yet
09 Lineplot
21 pages
3 Outliers Iqr
No ratings yet
3 Outliers Iqr
3 pages
Probability & Statistics - Final Exam
No ratings yet
Probability & Statistics - Final Exam
9 pages
Probability & Statistics - Workbook
No ratings yet
Probability & Statistics - Workbook
163 pages
03 Coefficient of Determination and RMSE
No ratings yet
03 Coefficient of Determination and RMSE
7 pages
Probability & Statistics - Final Exam - Solutions
No ratings yet
Probability & Statistics - Final Exam - Solutions
16 pages
02 Significance Level and Type I and II Errors
No ratings yet
02 Significance Level and Type I and II Errors
8 pages
Probability & Statistics - Workbook.solutions
No ratings yet
Probability & Statistics - Workbook.solutions
471 pages
Workbook - Hypothesis Testing
No ratings yet
Workbook - Hypothesis Testing
26 pages
Discrete Random Variables and Probability
No ratings yet
Discrete Random Variables and Probability
19 pages
Car Insurance Insights Summary Presentation
No ratings yet
Car Insurance Insights Summary Presentation
10 pages
Workbook Regression
No ratings yet
Workbook Regression
18 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Workbook - Hypothesis Testing - Solutions
No ratings yet
Workbook - Hypothesis Testing - Solutions
91 pages
Hypothesis Testing for Proportions
No ratings yet
Hypothesis Testing for Proportions
9 pages
Global Wi-Fi Market Forecast 2020
No ratings yet
Global Wi-Fi Market Forecast 2020
24 pages
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
54 pages
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
54 pages