Machine Learning Project Report 1
Machine Learning Project Report 1
Learning
Project
Report
DSBA
Kaarthikeyan Senthilmaran
PGP-DSBA Online Feb 22
Date: 25/09/2022
Contents
Table of Figures............................................................................................................................................2
Table of Tables.............................................................................................................................................4
Problem 1:...................................................................................................................................................6
Executive Summary:................................................................................................................................6
Introduction:............................................................................................................................................6
Problem Questions:.................................................................................................................................6
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it......................................................................................................................................6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers........9
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30)...................................................................................22
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)................................................28
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.................................................31
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting..............33
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized........................................39
1.8 Based on these predictions, what are the insights?.....................................................................64
Problem 2:.................................................................................................................................................64
Executive Summary:..............................................................................................................................64
Introduction:..........................................................................................................................................65
Problem Questions:...............................................................................................................................65
2.1 Find the number of characters, words, and sentences for the mentioned documents................65
2.2 Remove all the stopwords from all three speeches.....................................................................65
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords).........................................................69
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) –
3 Marks [ refer to the End-to-End Case Study done in the Mentored Learning Session ]..................69
Table of Figur
Figure 1 Univariate Analysis – Age.............................................................................................................11
Figure 2 Univariate Analysis – [Link].............................................................................12
Figure 3 Univariate Analysis – [Link].........................................................................13
Figure 4 Univariate Analysis – Blair............................................................................................................14
Figure 5 Univariate Analysis – Hague.........................................................................................................15
Figure 6 Univariate Analysis – Europe........................................................................................................16
Figure 7 Univariate Analysis – [Link]....................................................................................17
Figure 8 Univariate Analysis – vote............................................................................................................18
Figure 9 Univariate Analysis – gender........................................................................................................18
Figure 10 Correlation Heatmap.................................................................................................................20
Figure 11 Pairplot......................................................................................................................................21
Figure 12 Box plots after outlier Treatment...............................................................................................22
Figure 13 Linear Discriminant Analysis......................................................................................................30
Figure 14 KNN model.................................................................................................................................31
Figure 15 Custom Cut-Off selection - Accuracy, F1 score and Confusion Matrix........................................34
Figure 16 Misclassification Error across No. of Neighbors k......................................................................35
Figure 17 Confusion Matrix.......................................................................................................................40
Figure 18 AUC and ROC for the training data- Initial model.......................................................................41
Figure 19 AUC and ROC for testing data- Initial model..............................................................................41
Figure 20 Confusion Matrix - Training data- Initial model..........................................................................41
Figure 21 Confusion matrix - Test data- Initial model................................................................................42
Figure 22 AUC and ROC for the training data- Best model.........................................................................43
Figure 23 AUC and ROC for testing data- Best model................................................................................43
Figure 24 Confusion Matrix - Training data- Best model............................................................................44
Figure 25 Confusion matrix - Test data- Best model..................................................................................44
Figure 26 AUC and ROC for training and testing data- Initial model..........................................................45
Figure 27 Confusion Matrix - Training and Testing data- Initial model.......................................................46
Figure 28 AUC and ROC for the training data- Initial model.......................................................................48
Figure 29 AUC and ROC for testing data- Initial model..............................................................................48
Figure 30 Confusion Matrix - Training data- Initial model..........................................................................48
Figure 31 Confusion matrix - Test data- Initial model................................................................................49
Figure 32 AUC and ROC for the training data- best model.........................................................................50
Figure 33 AUC and ROC for testing data- best model................................................................................50
Figure 34 Confusion Matrix - Training data- best model............................................................................51
Figure 35 Confusion matrix - Test data- best model..................................................................................51
Figure 36 AUC and ROC for the training data- Initial model.......................................................................52
Figure 37 AUC and ROC for testing data- Initial model..............................................................................52
Figure 38 Confusion Matrix - Training data- Initial model..........................................................................53
Figure 39 Confusion matrix - Test data- Initial model................................................................................53
Figure 40 AUC and ROC for the training data- best model.........................................................................54
Figure 41 AUC and ROC for testing data- best model................................................................................55
Figure 42 Confusion Matrix - Training data- best model............................................................................55
Figure 43 Confusion matrix - Test data- best model..................................................................................55
Figure 44 AUC and ROC for the training data- Bagging..............................................................................57
Figure 45 AUC and ROC for testing data- Bagging......................................................................................57
Figure 46 Confusion Matrix - Training data- Bagging.................................................................................57
Figure 47 Confusion matrix - Test data- Bagging........................................................................................58
Figure 48 AUC and ROC for the training data- AdaBoost...........................................................................59
Figure 49 AUC and ROC for testing data- AdaBoost...................................................................................59
Figure 50 Confusion Matrix - Training data- AdaBoost..............................................................................60
Figure 51 Confusion matrix - Test data- AdaBoost.....................................................................................60
Figure 52 AUC and ROC for the training data- Gradient Boosting..............................................................61
Figure 53 AUC and ROC for testing data- Gradient Boosting......................................................................61
Figure 54 Confusion Matrix - Training data- Gradient Boosting.................................................................62
Figure 55 Confusion matrix - Test data- Gradient Boosting........................................................................62
Figure 56 Result set...................................................................................................................................65
Table of TablesY
Table 1 Dataset Sample...............................................................................................................................6
Table 2 Data Dictionary................................................................................................................................7
Table 3 Problem 1: Data Information...........................................................................................................7
Table 4 Duplicated Records..........................................................................................................................8
Table 5 Missing Values.................................................................................................................................8
Table 6 Problem 1: Summary Stats..............................................................................................................8
Table 7 Problem 1: Skewness.......................................................................................................................9
Table 8 Problem 1: Missing/Null values.....................................................................................................10
Table 9 Problem 1: Shape and Data types..................................................................................................10
Table 10 Outlier Proportions......................................................................................................................22
Table 11 Ordinal encoding - vote – Codes..................................................................................................23
Table 12 Ordinal encoding - gender – Codes.............................................................................................23
Table 13 Sample of the dataset after z-score scaling.................................................................................24
Table 14 Sample of X - Dataset of predictor variables...............................................................................24
Table 15 Predictor variables dataset - X - info............................................................................................24
Table 16 Sample of y - Dataset of target variables.....................................................................................25
Table 17 Independent variables - Training dataset - Sample......................................................................25
Table 18 Independant variables - Training dataset - Info...........................................................................26
Table 19 Independent variables - Testing dataset - Sample.......................................................................26
Table 20 Independent variables - Testing dataset - Info.............................................................................26
Table 21 Target variable - Training dataset - Sample..................................................................................27
Table 22 Target variable - Training dataset - Info.......................................................................................27
Table 23 Target variable - Testing dataset - Sample...................................................................................27
Table 24 Target variable - Testing dataset – Info........................................................................................27
Table 25 Predicted Class - Logistic Regression...........................................................................................28
Table 26 Predicted Class - Logistic Regression - Best parameters..............................................................29
Table 27 Accuracy - Best model.................................................................................................................29
Table 28 Predicted Class – LDA..................................................................................................................30
Table 29 Accuracy – Initial LDA model.......................................................................................................30
Table 30 Predicted Class – KNN.................................................................................................................31
Table 31 Accuracy – Initial LDA model.......................................................................................................32
Table 32 Predicted Class – Naïve Bayes.....................................................................................................33
Table 33 Accuracy – Initial Naïve Bayes model..........................................................................................33
Table 34 Accuracy - Tuned Model - Logistic Regression.............................................................................34
Table 35 Accuracy - Tuned Model - Logistic Regression.............................................................................35
Table 36 Accuracy - Tuned Model – KNN...................................................................................................36
Table 37 Accuracy - Tuned Model – KNN...................................................................................................36
Table 38 Predicted Class – Bagging over Random Forest...........................................................................37
Table 39 Accuracy – Bagging over Random Forest.....................................................................................37
Table 40 Predicted Class – AdaBoost.........................................................................................................38
Table 41 Accuracy – AdaBoost...................................................................................................................38
Table 42 Predicted Class – Gradient Boosting............................................................................................39
Table 43 Accuracy – Gradient Boosting......................................................................................................39
Table 44 Accuracy - Initial model...............................................................................................................40
Table 45 Classification Report - Training data- Initial model......................................................................42
Table 46 Classification Report - Testing data- Initial model........................................................................42
Table 47 Accuracy - Best model.................................................................................................................43
Table 48 Classification Report - Training data- Best model........................................................................44
Table 49 Classification Report - Testing data - Best Model.........................................................................45
Table 50 Accuracy - Initial model...............................................................................................................45
Table 51 Classification Report - Training and Testing data- Initial model...................................................46
Table 52 Accuracy - Custom Cut-Off...........................................................................................................47
Table 53 Classification report - Custom cutoff test data – LDA..................................................................47
Table 54 Accuracy - Initial model...............................................................................................................47
Table 55 Classification Report - Training data- Initial model......................................................................49
Table 56 Classification Report - Testing data- Initial model........................................................................49
Table 57 Accuracy - Best model.................................................................................................................49
Table 58 Classification Report - Training data- best model........................................................................51
Table 59 Classification Report - Testing data - Best Model.........................................................................52
Table 60 Accuracy - Initial model...............................................................................................................52
Table 61 Classification Report - Training data- Initial model......................................................................53
Table 62 Classification Report - Testing data- Initial model........................................................................54
Table 63 Accuracy - Best model.................................................................................................................54
Table 64 Classification Report - Training data- best model........................................................................56
Table 65 Classification Report - Testing data - Best Model.........................................................................56
Table 66 Accuracy - Bagging.......................................................................................................................56
Table 67 Classification Report - Training data- Bagging..............................................................................58
Table 68 Classification Report - Testing data - Bagging..............................................................................58
Table 69 Accuracy - AdaBoost....................................................................................................................58
Table 70 Classification Report - Training data- AdaBoost...........................................................................60
Table 71 Classification Report - Testing data - AdaBoost............................................................................61
Table 72 Accuracy - Gradient Boosting......................................................................................................61
Table 73 Classification Report - Training data- Gradient Boosting..............................................................62
Table 74 Classification Report - Testing data – Gradient Boosting.............................................................63
Table 75 Accuracy of each algorithm and calculated stability....................................................................63
Table 64 Classification Report - Training data- best Naïve Bayes model....................................................64
Table 65 Classification Report - Testing data - best Naïve Bayes Model.....................................................64
Problem 1:
Executive Summary:
One of the leading news channels CNBE wants to analyze recent elections. This survey was conducted on
1525 voters with 9 variables. We have to build a model, to predict which party a voter will vote for on
the basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Introduction:
The purpose of this exercise is to utilize various models to predict the vote of each voter and compare
the results so as to find which model has delivered best results. The chosen model or the ensembled
result of all the models will be later used by the channel to predict the overall winner in the election and
seats covered by a particular party.
Problem Questions:
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
Sample of the dataset:
Dataset has various information such as the party choice of the voter, voter’s age, an assessment of
current national economic condition, household economic condition, assessment of leader of each party,
attitude towards European integration, voter’s political knowledge and gender. The variable ‘vote’ will be
the target variable based on the problem statement.
Data Dictionary:
Below is the data dictionary to understand the terminologies used across the dataset and the analysis.
Data Description
The given data gives us a good amount of information regarding the voters. It has the age and gender of
the voters along with the vote choice of each – whether they have voted for the ‘Labour’ or the
‘Conservative’ party. It also provides us with some ordinal information regarding the current national
economic conditions and current household economic conditions of the voters – rated 1 to 5 where 5
being the best and 1 being the worst.
It also provides us with their rating on assessment of Labour and Conservative Party leaders – rated the
same way as the above variables – 5 being the best and 1 being the worst. In addition to these, we have
a 11 – point scale that measures voters' attitudes toward European integration of which high scores
represent ‘Eurosceptic’ sentiment. We also have a rating of voter’s political knowledge – especially on
the standpoint of each party with regards to European integration – rated 0 to 3 where 0 represents no
knowledge on the subject while 3 represents very good knowledge on the same.
Data checks:
Additional data checks have been made to confirm on the following aspects.
Duplicate records
There are 8 duplicate records found in the data. Below is the list of duplicate records.
Table 4 Duplicated Records
The duplicates are not deleted in this case based on the assumption that there is always a possibility of
having two voters of same age that have similar economical background and similar political standpoints.
Summary Stats:
Based on fundamental description table shown below, we can see that the values for each field
are of different scales. For example, the field 'Europe' shows the values ranging between 1 and
11 while the values under the field ‘[Link]’ are ranging between 0 and 3.
It is also evident that, like already mentioned on the data description, the columns
[Link], [Link], Blair and Hague are similarly rated between
1 to 5.
The rule of thumb seems to be: If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the
skewness is less than -1 or greater than 1, the data are highly skewed.
For the given dataset we have the skewness for each variable as follows:
Columns Skewness
age 0.144621
[Link] -0.240453
[Link]
d -0.149552
Blair -0.535419
Hague 0.1521
Europe -0.135947
[Link] -0.426838
Table 7 Problem 1: Skewness
The skewness for the variable ‘age’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5
and 0.5. So, the distribution is fairly symmetrical.
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5
and 0.5. So, the distribution is fairly symmetrical.
The skewness for the variable ‘Blair’ is negative but lies in the range of -1 and – 0.5. So, the
distribution is moderately left skewed.
The skewness for the variable ‘Hague’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
The skewness for the variable ‘Europe’ is negative but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
The skewness for the variable ‘[Link]’ is positive but lies in the range of -0.5 and
0.5. So, the distribution is fairly symmetrical.
As mentioned earlier as part of data description data checks, There are no missing/Null values in any of
the columns. Below is the result-set table of missing values check in each column of the dataset.
Based on the initial analysis, we can see that this is a sample dataset with just 1525 rows across
9 fields.
All of the fields are of integer datatype except vote and gender which are of object type. All of
the columns are non-null.
Univariate Analysis:
For the 7 continuous variables as part of the dataset, univariate analysis is done which includes the
description of each variable, distribution plot, Skewness measure and Box plot of the same. We also have
2 categorical variables for which the analysis includes the value counts and respective bar plots.
Age
Inference:
The skewness for the variable ‘age’ is positive but lies in the range of -0.5 and 0.5. So, the distribution is
fairly symmetrical.
Based on the boxplot, there are no outliers in the ‘age’ variable. We can see that most of the sample
voters are at the age range between 41 and 67 which are 25 th and 75th percentile as shown in the
variable description.
[Link]
Inference:
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5 and 0.5.
So, the distribution is fairly symmetrical.
Based on the boxplot, there are a few outliers for the variable towards the left side of the bell curve.
[Link]
Inference:
The skewness for the variable ‘[Link]’ is negative but lies in the range of -0.5 and 0.5.
So, the distribution is fairly symmetrical.
Based on the boxplot, there are a few outliers for the variable towards the left side of the bell curve.
Blair
Inference:
The skewness for the variable ‘Blair’ is negative but lies in the range of -1 and – 0.5. So, the distribution is
moderately left skewed.
Inference:
The skewness for the variable ‘Hague’ is positive but lies in the range of -0.5 and 0.5. So, the distribution
is fairly symmetrical.
Inference:
The skewness for the variable ‘Europe’ is negative but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
Inference:
The skewness for the variable ‘[Link]’ is positive but lies in the range of -0.5 and 0.5. So, the
distribution is fairly symmetrical.
Inference:
Majority of the voters have shown their support to the ‘Labour’ party which accounts to 1063 out of the
1525 total voters.
Gender
Bivariate Analysis:
The same way for the 7 continuous variables as part of the dataset, Bivariate analysis is done which
includes the Correlation Heatmap and Pair plot for the entire set.
Correlation Heatmap:
A correlation heatmap is a graphical representation of a correlation matrix representing the correlation
between different variables. The value of correlation can take any value from -1 to 1. Correlation
between two random variables or bivariate data does not necessarily imply a causal relationship.
Here we can see how the various factors of the dataset are correlated with each other. Some of the main
inferences are such as the follows:
The variables seem to have very minimal correlation among each other. The only set of variables that
show some correlations are [Link] with [Link] and Blair.
It is noticeable that Hague field is the similarly correlated with the variable ‘Europe’.
There are considerable number of negatively correlated variables in the dataset based on the
correlation heatmap below.
Figure 10 Correlation Heatmap
Pairplot
Figure 11 Pairplot
Pair plot is used to understand the best set of features to explain a relationship between two variables or
to form the most separated clusters. It also helps to form some simple classification models by drawing
some simple lines or make linear separation in our data-set.
Based on the pairplot above, we can get some inferences as follows:
There are no considerable linear correlation among any of the variables in the given dataset.
Outlier Proportions
Based on the univariate analysis, we see that we have considerable proportions of outliers as part of
some of the variables. The proportion of outliers to the amount of data available is as follows:
We can see that we have good number of outliers as part of ‘[Link]’ and
‘[Link]’ variables while other variables seem to have no outliers.
Outliers Treatment:
We have treated the outliers in the data by replacing the outlier values using the upper and lower bound
values since the columns are holding continuous data. The below series of boxplots will show the
outliers in the fields after being treated.
We can now see in the below boxplots that there are no outliers as part of the above-mentioned
variables.
Data Encoding:
Most of the models require all the input and output variables to be numeric and hence it becomes very
important to encode the categorical data into numbers before you can fit and evaluate a model.
Here we are encoding the categorical variables ‘vote’ and ‘gender’ using the label encoding since the
underlying data are of just 2 categories by nature. This type of encoding is used normally when the
variables in the data are ordinal, ordinal encoding converts each label into integer values and the
encoded data represents the sequence of labels.
Vote:
The data under ‘vote’ are ‘Labour’ and ‘Conservative’ which are labelled to be the integer values 0 and 1
respectively. We encode the string values into integer values for further modelling requirements.
Vote Codes
Labour 0
Conservative 1
Table 11 Ordinal encoding - vote – Codes
Gender:
The data under ‘gender’ are ‘female’ and ‘male’ which are labelled to be the integer values 0 and 1
respectively. We encode the string values into integer values for further modelling requirements.
Vote Codes
female 0
male 1
Table 12 Ordinal encoding - gender – Codes
Scaling:
Feature scaling (also known as data normalization) is the method used to standardize the range of
features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data
preprocessing while using machine learning algorithms.
Zscore – Scaling:
Z-score normalization refers to the process of normalizing every value in a dataset such that the mean of
all of the values is 0 and the standard deviation is 1. We use the following formula to perform a z-score
normalization on every value in a dataset:
New value = (x – μ) / σ
Sample of dataset after Scaling:
The sample of the dataset after applying Z-score scaling is shown below. We can see that the values of all
the variables are of the same scale now. Note that we have scaled only the set of independent variables
here since there are only 2 values in the target variable.
We are creating the dataset X to have all the predictor variables and the dataset y to have the
dependent/target variable.
Below is the sample of how the dataset with predictor variables look after the split.
Below is the information on the predictor variables dataset ‘X’ which shows that we have 8 independent
variables as part of it.
Table 15 Predictor variables dataset - X - info
Below is the sample of how the dataset with target variable look after the split.
Here, leaving a considerable amount of data in both X and y datasets to be test splits, majority of the
data is split into training datasets. So, the data in the training sets will be used for training the machine
learning algorithms while test sets will be used to check the performance of these models.
Sample data:
Based on the sample, we can see that a random set of records from the independent variables’ dataset
‘X’ are added here.
Table 17 Independent variables - Training dataset - Sample
Data information:
Below is the information regarding the above dataset. It shows that we have 1067 rows of data loaded
into this dataset and all of them are independent variables.
Sample:
Based on the sample, we can see that a random set of records from the independent variables’ dataset
‘X’ are added here.
Data information:
Below is the information regarding the above dataset. It shows that we have 458 rows of data loaded
into this dataset and all of them are independent variables.
Table 20 Independent variables - Testing dataset - Info
Sample:
Based on the sample, we can see that a random set of records from the target variable’s dataset ‘y’ are
added here.
Data information:
Below is the information regarding the above dataset. It shows that we have 1067 rows of data loaded
into this dataset and it is just the target variable.
Sample:
Based on the sample, we can see that a random set of records from the target variable’s dataset ‘y’ are
added here.
Table 23 Target variable - Testing dataset - Sample
Data information:
Below is the information regarding the above dataset. It shows that we have 458 rows of data loaded
into this dataset and it is just the target variable.
Logistic Regression:
Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on
a given dataset of independent variables. Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.
After fitting the data to the Logistic Regression Model, we will be retrieving the predictions on training
and test datasets.
Predicted Classes:
Below are the predicated classes rendered by the model.
Parameters:
Here we are running with the following parameters for logistic regression.
Penalty – ’I2’
o Penalized logistic regression imposes a penalty to the logistic model for having too many
variables. This results in shrinking the coefficients of the less contributive variables
toward zero. This is also known as regularization.
o Out of the l1 and l2 which are suitable penalties for liblinear solver, we have chosen l2
randomly here
Solver – ‘liblinear’
o Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you
might want to consider the following aspects:
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones;
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss;
‘liblinear’ is limited to one-versus-rest schemes.
Tol – 0.0001
o Tolerance for stopping criteria.
Verbose – 1
o For the liblinear and lbfgs solvers it is good to set verbose to any positive number for
verbosity.
Based on these parameters, the model is built, and the predicated classes rendered by the model are as
follows:
Accuracy:
Accuracy of the best model on the training and test data:
Trainin
g Test
Logistic Regression Model 0.83 0.86
Table 27 Accuracy - Best model
Validness:
Based on the observation we see on training and testing dataset accuracy results, The model does not
seem over or under fit.
Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis
to create a new axis by separating them using a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class. In other
words, we can say that the new axis will increase the separation between the data points of the two
classes and plot them onto the new axis.
Predicted Classes:
Below are the predicated classes rendered by the model.
Table 28 Predicted Class – LDA
Parameters:
We have implemented the above LDA model with the default cut-off which is set to 0.5.
Accuracy:
1. Accuracy of the model on the train and test data:
Train Test
Default cutoff LDA Model 0.82 0.86
Table 29 Accuracy – Initial LDA model
Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is no under or over fitting that
has happened in the model.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Fitting the Models:
As part of this phase, we will be fitting the data in KNN and Naïve Bayes models.
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training
phase and uses all the data for training while classification.
Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t
assume anything about the underlying data.
Figure 14 KNN model
Predicted Classes:
Below are the predicated classes rendered by the model.
Parameters:
We have implemented the above KNN model with the default k value which is set to 5.
Accuracy:
2. Accuracy of the model on the train and test data:
Train Test
Initial KNN Model 0.87 0.81
Table 31 Accuracy – Initial LDA model
Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is under fitting that has
happened in the model.
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
where,
4. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Predicted Classes:
Below are the predicated classes rendered by the model.
Parameters:
We have implemented the above Naïve Bayes model with the default parameter var_smoothing which is
set to 1e-09.
Accuracy:
7. Accuracy of the model on the train and test data:
Train Test
Default Naïve Bayes Model 0.83 0.83
Table 33 Accuracy – Initial Naïve Bayes model
Validness:
Based on the accuracy of the Train and Test set, we can confirm that there is no underfitting or over
fitting that has happened in the model.
GridSearchCV:
GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in
the estimator used. The parameters of the estimator used to apply these methods are optimized by
cross-validated grid-search over a parameter grid.
Here we are running the grid search with the following parameters for logistic regression.
Penalty – 'l1',’I2’ and ‘none’
Solver – 'liblinear','sag' and 'lbfgs'
Tol – 0.0001 and 0.00001
The best parameters as given by the grid search is as follows:
{'penalty': 'none', 'solver': 'sag', 'tol': 0.0001}
Based on these parameters, the best model is built, and the accuracy of the model are as follows:
Based on these custom cut-offs selected, the best model is built, and the accuracy of the model are as
follows:
Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbors =
1,3,5...19 and find the model with lowest MCE.
We can see that for the k value 13, the MCE seems to be the minimum and so we take the k-value to be
13 for the tuned model.
Based on the k value selected, the best model is built, and the accuracy of the model are as follows:
GridSearchCV:
GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in
the estimator used. The parameters of the estimator used to apply these methods are optimized by
cross-validated grid-search over a parameter grid.
Here we are running the grid search with the following parameters for Naïve Bayes (Gaussian) model.
{'var_smoothing': [Link](0,-9, num=100)}: Portion of the largest variance of all features that is
added to variances for calculation stability.
The best parameters as given by the grid search is as follows:
{'var_smoothing': 0.01}
Based on the parameters above, the best model is built, and the accuracy of the model are as follows:
GridSearchCV:
GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in
the estimator used. The parameters of the estimator used to apply these methods are optimized by
cross-validated grid-search over a parameter grid.
Here we are running the grid search with the following parameters for Random Forest.
Parameter grid:
Below is the parameters added to the parameter grid for grid search cv.
'n_estimators': [300,400,500]
Best Parameters:
As a result of the grid search, below are best parameters rendered.
(max_depth=5, max_features=6, min_samples_leaf=2,
min_samples_split=25, n_estimators=300, random_state=1)
Bagging:
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used
to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is
selected with replacement—meaning that the individual data points can be chosen more than once.
Here a bagging model is created with Random Forest model as the base estimator.
Predicted Classes:
Below are the predicated classes rendered by the model.
Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.
Boosting:
Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to
minimize training errors. In boosting, a random sample of data is selected, fitted with a model and then
trained sequentially—that is, each model tries to compensate for the weaknesses of its predecessor.
With each iteration, the weak rules from each individual classifier are combined to form one, strong
prediction rule.
In bagging, weak learners are trained in parallel, but in boosting, they learn sequentially. This means that
a series of models are constructed and with each new model iteration, the weights of the misclassified
data in the previous model are increased. This redistribution of weights helps the algorithm identify the
parameters that it needs to focus on to improve its performance. AdaBoost, which stands for “adaptative
boosting algorithm,” is one of the most popular boosting algorithms as it was one of the first of its kind.
Other types of boosting algorithms include XGBoost, GradientBoost, and BrownBoost.
AdaBoost
This method operates iteratively, identifying misclassified data points and adjusting their weights to
minimize the training error. The model continues optimize in a sequential fashion until it yields the
strongest predictor.
Parameters:
Here, we are running the algorithm with the below parameters:
(n_estimators=100, random_state=1)
Predicted Classes:
Below are the predicated classes rendered by the model.
Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.
Gradient boosting:
The gradient boosting trains on the residual errors of the previous predictor. The name, gradient
boosting, is used since it combines the gradient descent algorithm and boosting method.
Parameters:
Here, we are running the algorithm with the below parameters:
(n_estimators=100, random_state=1)
Predicted Classes:
Below are the predicated classes rendered by the model.
Accuracy:
Below is the accuracy for training and testing dataset rendered by the model.
Performance Metrics:
Performance metrics are a part of every machine learning pipeline. They tell you if you're making
progress and put a number on it. All machine learning models, whether it's linear regression, or a SOTA
technique like BERT, need a metric to judge performance. Some of the performance metrics that we are
going to use here are
Accuracy
Confusion Matrix
ROC curve
ROC AUC score
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations. One may think that, if we have high accuracy then our model is
best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false
positive and false negatives are almost same. Therefore, you have to look at other parameters to
evaluate the performance of your model. For our model, we have got 0.803 which means our model is
approx. 80% accurate.
Accuracy = TP+TN/TP+FP+FN+TN
Confusion Matrix:
A confusion matrix is a table that is used to define the performance of a classification algorithm. A
confusion matrix visualizes and summarizes the performance of a classification algorithm.
ROC Curve:
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:
Now that we have an idea of what these performance metrics are, let us check how our models have
performed.
Trainin
g Test
Initial Logistic Regression Model 0.83 0.86
Table 44 Accuracy - Initial model
Figure 18 AUC and ROC for the training data- Initial model
Trainin
g Test
Best Logistic Regression Model 0.83 0.86
Table 47 Accuracy - Best model
Trainin Test
g
Initial LDA Model 0.82 0.86
Table 50 Accuracy - Initial model
Figure 26 AUC and ROC for training and testing data- Initial model
Train Test
Custom cutoff LDA Model 0.81 0.85
Table 52 Accuracy - Custom Cut-Off
9. Classification Report of the custom cut-off train and test data are as follows:
o In the classification report of the training data, we can see that the precision values for 0
and 1 are 0.90 and 0.64 respectively and recall values are 0.81 and 0.80 respectively.
o In the classification report of the testing data, we can see that the precision values for 0
and 1 are 0.93 and 0.71 respectively and recall values are 0.85 and 0.85 respectively.
Table 53 Classification report - Custom cutoff test data – LDA
Trainin
g Test
Initial KNN Model 0.87 0.81
Table 54 Accuracy - Initial model
Figure 28 AUC and ROC for the training data- Initial model
Trainin
g Test
Best KNN Model 0.85 0.83
Table 57 Accuracy - Best model
Trainin
g Test
Initial Naïve Bayes Model 0.83 0.83
Table 60 Accuracy - Initial model
Figure 36 AUC and ROC for the training data- Initial model
Trainin
g Test
Best Naïve Bayes Model 0.83 0.82
Table 63 Accuracy - Best model
Figure 40 AUC and ROC for the training data- best model
Training Test
Bagging with Random Forest
Model 0.86 0.83
Table 66 Accuracy - Bagging
Boosting(AdaBoost):
1. Accuracy of the best model on the training and test data:
Trainin
g Test
Adaptive Boosting Model 0.85 0.82
Table 69 Accuracy - AdaBoost
2. AUC and ROC for the training data:
a. AUC value is calculated to be 0.913 and the ROC curve for the training data is as follows:
Boosting(Gradient Boosting):
1. Accuracy of the best model on the training and test data:
Trainin
g Test
Gradient Boosting Model 0.88 0.83
Table 72 Accuracy - Gradient Boosting
Figure 52 AUC and ROC for the training data- Gradient Boosting
Models Comparison:
Let us compare the models based on the performance metrics rendered by each model for the training
and test datasets.
Based on the comparison above, we can conclude the most stable algorithm would be the Naïve Bayes
Model based on the difference in the accuracy of predictions between the Training and Test datasets.
The most unstable ones seem to be the KNN Model before fine tuning and Gradient Boosting model.
Final Model:
The best or most optimized model based on the accuracy of the training and test datasets is found to be
the Naïve Bayes Model. The model has delivered the same kind of accuracy in both the sets.
We can confirm the same based on the classification report for training and testing data for the fine-
tuned Naïve Bayes models below.
10. We can see that the precision values for 0 and 1 are 0.88 and 0.73 respectively and recall values
are 0.88 and 0.73 respectively.
11. We can see that the precision values for 0 and 1 are 0.89 and 0.67 respectively and recall values
are 0.86 and 0.74 respectively.
12. The CNBE channel can use the Naïve Bayes model to make their poll predictions since it provides
the most stable prediction across the training and test sets.
13. The data already seem to have the least correlated variables captured which makes the Naïve
Bayes algorithm’s assumption that the variables are independent correct.
14. It is also possible to have an ensemble learning from multiple models and take up the most
common prediction from the models as the result. This could be an option if the organization
could decide upon the best model to predict the poll results.
Problem 2:
Executive Summary:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will
be looking at the following speeches of the Presidents of the United States of America:
Introduction:
The purpose of this exercise is to utilize various Text mining techniques to understand the information
and emotions involved in the speeches given. We will be using the Natural Language ToolKit to mine the
raw data and derive sensible information from the same.
Problem Questions:
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Below is total number of words, Sentences and characters in the docs.
vice presid johnson mr speaker mr chief justic presid eisenhow vice presid
nixon presid truman reverend clergi fellow citizen observ today victori
parti celebr freedom symbol end well begin signifi renew well chang sworn
almighti god solemn oath forebear l prescrib nearli centuri three quarter
ago world differ man hold mortal hand power abolish form human poverti
form human life yet revolutionari belief forebear fought still issu around
globe belief right man come generos state hand god dare forget today heir
first revolut word go forth time place friend foe alik torch pass new
gener american born centuri temper war disciplin hard bitter peac proud
ancient heritag unwil wit permit slow undo human right nation alway commit
commit today home around world everi nation whether wish well ill pay
price bear burden meet hardship support friend oppos foe order assur
surviv success liberti much pledg old alli whose cultur spiritu origin
share pledg loyalti faith friend unit littl cannot host cooper ventur
divid littl dare meet power challeng odd split asund new state welcom rank
free pledg word one form coloni control pass away mere replac far iron
tyranni alway expect find support view alway hope find strongli support
freedom rememb past foolishli sought power ride back tiger end insid peopl
hut villag across globe struggl break bond mass miseri pledg best effort
help help whatev period requir communist may seek vote right free societi
cannot help mani poor cannot save rich sister republ south border offer
special pledg convert good word good deed new allianc progress assist free
men free govern cast chain poverti peac revolut hope cannot becom prey
hostil power neighbor join oppos aggress subvers anywher america everi
power hemispher intend remain master hous world assembl sovereign state
unit nation last best hope age instrument war far outpac instrument peac
renew pledg support prevent becom mere forum invect strengthen shield new
weak enlarg area writ may run final nation would make adversari offer
pledg request side begin anew quest peac dark power destruct unleash
scienc engulf human plan accident self destruct dare tempt weak arm
suffici beyond doubt certain beyond doubt never employ neither two great
power group nation take comfort present cours side overburden cost modern
weapon rightli alarm steadi spread deadli atom yet race alter uncertain
balanc terror stay hand mankind final war begin anew rememb side civil
sign weak sincer alway subject proof never negoti fear never fear negoti
side explor problem unit instead belabor problem divid side first time
formul seriou precis propos inspect control arm bring absolut power
destroy nation absolut control nation side seek invok wonder scienc
instead terror togeth explor star conquer desert erad diseas tap ocean
depth encourag art commerc side unit heed corner earth command isaiah undo
heavi burden ... oppress go free ." beachhead cooper may push back jungl
suspicion side join creat new endeavor new balanc power new world law
strong weak secur peac preserv finish first 100 day finish first 1 000 day
life administr even perhap lifetim planet begin hand fellow citizen mine
rest final success failur cours sinc countri found gener american summon
give testimoni nation loyalti grave young american answer call servic
surround globe trumpet summon call bear arm though arm need call battl
though embattl call bear burden long twilight struggl year year rejoic
hope patient tribul struggl common enemi man tyranni poverti diseas war
forg enemi grand global allianc north south east west assur fruit life
mankind join histor effort long histori world gener grant role defend
freedom hour maximum danger shrink respons welcom believ would exchang
place peopl gener energi faith devot bring endeavor light countri serv
glow fire truli light world fellow american ask countri ask countri fellow
citizen world ask america togeth freedom man final whether citizen america
citizen world ask high standard strength sacrific ask good conscienc sure
reward histori final judg deed go forth lead land love ask bless help know
earth god work must truli
mr vice presid mr speaker mr chief justic senat cook mr eisenhow fellow
citizen great good countri share togeth met four year ago america bleak
spirit depress prospect seemingli endless war abroad destruct conflict
home meet today stand threshold new era peac world central question use
peac resolv era enter postwar period often time retreat isol lead stagnat
home invit new danger abroad resolv becom time great respons greatli born
renew spirit promis america enter third centuri nation past year saw far
reach result new polici peac continu revit tradit friendship mission peke
moscow abl establish base new durabl pattern relationship among nation
world america bold initi 1972 long rememb year greatest progress sinc end
world war ii toward last peac world peac seek world flimsi peac mere
interlud war peac endur gener come import understand necess limit america
role maintain peac unless america work preserv peac peac unless america
work preserv freedom freedom clearli understand new natur america role
result new polici adopt past four year respect treati commit support vigor
principl countri right impos rule anoth forc continu era negoti work limit
nuclear arm reduc danger confront great power share defend peac freedom
world expect other share time pass america make everi nation conflict make
everi nation futur respons presum tell peopl nation manag affair respect
right nation determin futur also recogn respons nation secur futur america
role indispens preserv world peac nation role indispens preserv peac
togeth rest world resolv move forward begin made continu bring wall hostil
divid world long build place bridg understand despit profound differ
system govern peopl world friend build structur peac world weak safe
strong respect right live differ system would influenc other strength idea
forc arm accept high respons burden gladli gladli chanc build peac noblest
endeavor nation engag gladli also act greatli meet respons abroad remain
great nation remain great nation act greatli meet challeng home chanc
today ever histori make life better america ensur better educ better
health better hous better transport cleaner environ restor respect law
make commun livabl insur god given right everi american full equal
opportun rang need great reach opportun great bold determin meet need new
way build structur peac abroad requir turn away old polici fail build new
era progress home requir turn away old polici fail abroad shift old polici
new retreat respons better way peac home shift old polici new retreat
respons better way progress abroad home key new respons lie place divis
respons live long consequ attempt gather power respons washington abroad
home time come turn away condescend polici patern washington know best ."
person expect act respons respons human natur encourag individu home
nation abroad decid locat respons place measur other today offer promis
pure government solut everi problem live long fals promis trust much
govern ask deliv lead inflat expect reduc individu effort disappoint
frustrat erod confid govern peopl govern must learn take less peopl peopl
rememb america built govern peopl welfar work shirk respons seek respons
live ask govern challeng face togeth ask govern help help nation govern
great vital role play pledg govern act act boldli lead boldli import role
everi one must play individu member commun day forward make solemn commit
heart bear respons part live ideal togeth see dawn new age progress
america togeth celebr 200th anniversari nation proud fulfil promis world
america longest difficult war come end learn debat differ civil decenc
reach one preciou qualiti govern cannot provid new level respect right
feel one anoth new level respect individu human digniti cherish birthright
everi american els time come renew faith america recent year faith
challeng children taught asham countri asham parent asham america record
home role world everi turn beset find everyth wrong america littl right
confid judgment histori remark time privileg live america record centuri
unparallel world histori respons generos creativ progress proud system
produc provid freedom abund wide share system histori world proud four war
engag centuri includ one bring end fought selfish advantag help other
resist aggress proud bold new initi steadfast peac honor made break toward
creat world world known structur peac last mere time gener come embark
today era present challeng great nation gener ever face answer god histori
conscienc way use year stand place hallow histori think other stood think
dream america think recogn need help far beyond order make dream come true
today ask prayer year ahead may god help make decis right america pray
help togeth may worthi challeng pledg togeth make next four year best four
year america histori 200th birthday america young vital began bright
beacon hope world go forward confid hope strong faith one anoth sustain
faith god creat strive alway serv purpos
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)
Top 3 frequent words for [Link] are 'america', 'peac' and 'world'
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords) – 3 Marks [ refer to the End-to-End Case Study done
in the Mentored Learning Session ]
[Link]
[Link]
[Link]