You are on page 1of 37

It is a branch of artificial

intelligence (AI) and


computer science which
focuses on the use of
data and algorithms to
imitate the way that
humans learn, gradually
improving its accuracy.

MACHINE
LEARNING
PROJECT

Created by Pranjal Singh


PGP-DSBA Online
05/03/2023
1

Table of Contents
Contents
Problem 1 Executive Summary………………………………………………………………………………………………………………………3
Introduction…………………………………………………………………………………………………………………………………………………..3
Data Description…………………………………………………………………………………………………………………………………………….3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check…………………………….. 4-6
Sample……………………………………………………………………………………………………………………………………………4
Shape………………………………………………………………………………………………………………………………………………4
Data Types……………………………………………………………………………………………………………………………………….4
Null Value Check……………………………………………………………………………………………………………………………..5
Summary Stats of Numerical Columns…………………………………………………………………………………………….5
Summary Stats of Categorical Columns……………………………………………………………………………………………6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers…………………….6-12
Checking for Duplicates and its treatment………………………………………………………………………………………..6
Univariate Analysis……………………………………………………………………………………………………………………………6-10
Bivariate Analysis………………………………………………………………………………………………………………………………10-11
Outlier Check…………………………………………………………………………………………………………………………………….11-12
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test………………………………………………………………………………………………………………….12-13
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)………………………………………………………………….13-16
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results………………………………………………………………….16-19 
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting………………………………….19-21
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized………………………………..21-31
1.8 Based on these predictions, what are the insights?...........................................................................................31-32

Problem 2 Introduction…………………………………………………………………………………………………………………………………………32
2.1 Find the number of characters, words, and sentences for the mentioned documents……………………………………32-33
2.2 Remove all the stopwords from all three speeches.……………………………………………………………………………………….33
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words………………………………………………………………………………………………………………………………33-34
2.4 Plot the word cloud of each of the speeches of the variable. ………………………………………………………………………..34-35

List of Figures

Problem 1
Fig. 1- age Histplot & Boxplot……………………………………………………………………………………………………………………………………6
Fig. 2- economic.cond.national Histplot & Boxplot…………………………………………………………………………………………………….7
Fig. 3- economic.cond.household Histplot & Boxplot……………………………………………………………………………………………….7
Fig. 4 - Blair Histplot & Boxplot………………………………………………………………………………………………………………………………….8
Fig. 5 - Hague Histplot & Boxplot………………………………………………………………………………………………………………………………8
Fig. 6 - Europe Histplot & Boxplot…………………………………………………………………………………………………………………………….9
Fig. 7 - political.knowledge Histplot & Boxplot…………………………………………………………………………………………………………9
Fig. 8 - vote Countplot……………………………………………………………………………………………………………………………………………….10
Fig. 9 - gender Countplot………………………………………………………………………………………………………………………………………….10
Fig. 10 - vote v/s age Stripplot………………………………………………………………………………………………………………………………….10
Fig. 11 - Correlation Heatmap……………………………………………………………………………………………………………………………………11
Fig. 12 - Numerical Columns with Outliers…………………………………………………………………………………………………………………11
Fig. 13 - MCE v/s K-Neighbours Plot………………………………………………………………………………………………………………………….17
2

Fig. 14 - Confusion Matrix of Train and Test sets-Logistic Regression……………………………………………………………………….22


Fig. 15 - ROC_AUC score and ROC curve of Train and Test Sets-Logistic Regression………………………………………………….22
Fig. 16- Confusion Matrix of Train and Test sets-LDA ………………………………………………………………………………………………23
Fig. 17 - ROC_AUC score and ROC curve of Train and Test Sets-LDA…………………………………………………………………………23
Fig. 18- Confusion Matrix of Train and Test sets- KNN………………………………………………………………………………………………24
Fig. 19 - ROC_AUC score and ROC curve of Train and Test Sets-KNN………………………………………………………………………..24
Fig. 20- Confusion Matrix of Train and Test sets-Naïve Bayes ………………………………………………………………………………….25
Fig. 21 - ROC_AUC score and ROC curve of Train and Test Sets-Naïve Bayes……………………………………………………………..25
Fig. 22- Confusion Matrix of Train and Test sets-Bagging(RF) …………………………………………………………………………………..26
Fig. 23 - ROC_AUC score and ROC curve of Train and Test Sets-Bagging(RF) …………………………………………………………….26
Fig. 24 - Confusion Matrix of Train and Test sets-Ada Boost…………………………………………………………………………………….27
Fig. 25 - ROC_AUC score and ROC curve of Train and Test Sets-Ada Boost……………………………………………………………….27
Fig. 26- Confusion Matrix of Train and Test sets-Gradient Boosting………………………………………………………………………….28
Fig. 27 - ROC_AUC score and ROC curve of Train and Test Sets-Gradient Boosting……………………………………………………28
Fig. 28- Confusion Matrix of Train and Test sets-Final Model …………………………………………………………………………………..30
Fig. 29 - ROC_AUC score and ROC curve of Train and Test Sets-Final Model…………………………………………………………….30

Problem 2
Fig. 30 - WordCloud of President Franklin D. Roosevelt’s Speech 1941……………………………………………………………………..34
Fig. 31 - WordCloud of President John F. Kennedy’s Speech 1961…………………………………………………………………………….35
Fig. 32 - WordCloud of President Richard Nixon’s Speech 1973………………………………………………………………………………..35

List of Tables

Problem 1
Table 1 : Dataset Sample…………………………………………………………………………………………………………………………………………....4
Table 2 : Data type table…………………………………………………………………………………………………………………………………………….4
Table 3 : Null value check table………………………………………………………………………………………………………………………………….5
Table 4 : Summary of Numerical Columns………………………………………………………………………………………………………………….5
Table 5 : Summary of Categorical Columns………………………………………………………………………………………………………………..6
Table 6 : Duplicates table …………………………………………………………………………………………………..........................................6
Table 7 : Sample dataset after Encoding…………………………………………………………………………………………………...................12
Table 8 : Five-Point Summary Before Scaling……………………………………………………………………………………………………………12
Table 9 : Five-Point Summary After Scaling………………………………………………………………………………………………………………13
Table 10 : Classification Report of Train Data-Logistic Regression……………………………………………………………………………14
Table 11 : Classification Report of Test Data-Logistic Regression………………………………………………………………………………14
Table 12 : Classification Report of Train Data-LDA……………………………………………………………………………………………………15
Table 13 : Classification Report of Test Data-LDA………………………………………………………………………………………………………15
Table 14 : Classification Report of Train Data-KNN……………………………………………………………………………………………………17
Table 15 : Classification Report of Test Data-KNN…………………………………………………………………………………………………….17
Table 16 : Classification Report of Train Data-Naïve Bayes………………………………………………………………………………………18
Table 17 : Classification Report of Test Data-Naïve Bayes…………………………………………………………………………………………18
Table 18 : Classification Report of Train Data-Bagging(RF) ………………………………………………………………………………………19
Table 19 : Classification Report of Test Data-Bagging(RF) ………………………………………………………………………………………19
Table 20 : Classification Report of Train Data-Ada Boost…………………………………………………………………………………………20
Table 21 : Classification Report of Test Data-Ada Boost…………………………………………………………………………………………..20
Table 22 : Classification Report of Train Data-Gradient Boosting………………………………………………………………………………20
Table 23 : Classification Report of Test Data- Gradient Boosting………………………………………………………………………………21
Table 24: Comparison Summary of Logistic Regression, LDA & KNN models……………………………………………………………28
Table 25: Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models…………………………28
Table 26 : Comparison Summary of LDA & Naïve Bayes after SMOTE………………………………………………………………………29
Table 27 : Comparing performances before and after SMOTE………………………………………………………………………………….29
3

Table 28 : Classification Report of Train Data-Final Model………………………………………………………………………………………30


Table 29 : Classification Report of Test Data-Final Model…………………………………………………………………………………………31

Datasets Used

Dataset for Problem 1: Election_Data.xlsx


Dataset for Problem 2: Inaugral corpora from nltk

Problem 1 Data Modelling

Executive Summary
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This
election dataset contains a survey that was conducted on 1525 voters with 9 variables.

Introduction 
The purpose of this whole exercise is to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.

Data Description
System measures used:
Vote: Party choice: Conservative or Labour
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
gender: female or male.

1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it. 

Sample of the dataset:


4

Table 1 Dataset Sample

Shape of the dataset:

The data has 10 columns and 1525 rows .

Data Types:

Let us check the types of variables in the data frame.

Table 2 Data type table

Out of 10, 8 columns are of integer type and rest 2 columns are of object data type.

Null Value Check:


5

Table 3 Null value check table

As seen in the above table, there are no null values in the dataset .

Also, we will drop the Unnamed column as it is insignificant.

Let’s also try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or
” ” in place of missing values. As seen from the code output in juptyer file , there are no "?" or " " present
in the data set.

Summary stats of Numerical Columns:

Table 4 Summary of Numerical Columns

From the above summary, we can infer that the average age of the voters is around 54 years and minimum
age is 24 years while maximum age is 93 years and also there is major difference between the 75
percentile value and maximum value of age column which means that age feature is slightly skewed to the
right and does not follow a normal distribution.

All other features except age represents normal distribution as the difference is not huge between 75
percentile value and maximum value.

Summary stats of Categorical Columns:


6

Table 5 Summary of Categorical Columns

From the above summary, we can infer that Labour party has received maximum number of votes and
there are slightly more number of female voters compared to male voters.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data


analysis. Check for Outliers.

Checking for duplicates :

Table 6 Duplicates table

There are 8 duplicates in the dataset . So lets remove them first and then do the analysis.

Univariate Analysis

Fig 1 age Histplot & Boxplot

Age feature is slightly right skewed because of which the data is not normally distributed and most number
of voters lies within 40 to 80 age group.
7

Fig 2 economic.cond.national Histplot & Boxplot

This feature has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.

Fig 3 economic.cond.household Histplot & Boxplot

This feature also has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.
8

Fig 4 Blair Histplot & Boxplot

This feature has no outlier and it does not show any distribution because it is a categorical feature coded
as ordinal numbers from 1 to 5 as already mentioned in the data dictionary. And most of the labour party
leader assessment grades lies between 2 to 4.

Fig 5 Hague Histplot & Boxplot

This feature also has no outlier and it does not show any distribution because it is a categorical feature in
coded form from 1 to 5 as already mentioned in the data dictionary.
9

Fig 6 Europe Histplot & Boxplot

This feature represents n 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment. And as seen in the above plots , there is more
data distribution in 6 to 10 scores range which indicates Eurosceptic sentiment.

Fig 7 political.knowledge Histplot & Boxplot

From the above plots , it can be inferred that most of the respondents’ knowledge of parties' positions on
European integration is quite low.
10

Fig 8 vote Countplot Fig 9 gender Countplot

Bivariate Analysis

Fig 10 vote v/s age Stripplot

From the above plot we can infer that aged people of 85 and above have voted for Conservative party .

Correlation Heatmap
11

Fig 11 Correlation Heatmap

There is hardly any correlation between any of the columns in this dataset.

As there is no correlation between any of the columns , doing a multivariate analysis using scatterplot
makes no sense for this dataset .

Outlier Check

Fig 12 Numerical Columns with Outliers


12

Only 2 outliers can be seen in the economic.cond.household and economic.cond.national columns but we
decide not to treat these outliers as these columns have ordinal set of numbers and outliers are to be
treated for only continuous columns analyses and also because there are very less number of outliers
present in the dataset .

1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30).
We will only encode the “gender” column using get dummies as all other categorical columns have ordinal
values and are of integer data type , so there is no need to encode them again.

The target column “vote” has two equally important classes but has object data type , so this needs to be
converted into integer data type for performing different models , hence we will use Label Encoder which
will convert the target variable into numeric and encode the data as well as 1 being the Labour party and 0
being the Conservative Party respectively because the maximum number of votes have been casted for
Labour party , hence it will be tagged as 1.

Table 7 Sample dataset after Encoding

Lets check whether scaling is necessary or not for this dataset by analysing the mean, standard deviation
and variance of all numerical features :

Table 8 Five-Point Summary Before Scaling


13

We can observe that only age and Europe feature requires scaling as its mean, standard deviation and
variance are not on the same scale unlike other features. Although Europe column is in ordinal state
ranking from 0 to 11 but its needs to be scaled as its range is different from the other columns. Scaling is a
necessity when using Distance-based models such as KNN etc. It also helps stabilize the accuracy of a
model and makes training faster. And without scaling , the algorithm may be biased toward the feature
with values higher in magnitude.

So we will use the MinMax Scaler from Sklearn library to scale the age and Europe columns and re-check
the five point summary again after scaling as below :

Table 9 Five-Point Summary After Scaling

Now the data looks scaled . Lets proceed to split the data into train and test sets in 70:30 ratio as splitting
the similar data can minimize the effects of data discrepancies and better understand the characteristics of
the model . Splitting is also useful to avoid or check overfitting of the model.

For splitting the data into train and test sets , first we need separate the target and predictor variables into
two different data frames namely X and y where X will contain all predictor variables and y will contain the
target variable which is “vote” in this dataset.

We will split the data into 70:30 using train_test_split from sklearn library.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression

Now we will apply Logistic Regression on train set and perform the predictions on test set. Here I will be
using Grid Search CV to find out best hyperparameters to be used for building the Logistic Regression
model on the train data set . And the best parameters that should be used are solver=’newton-cg’ instead
of default solver =lbfgs , max_iter=’10000’ instead of default value as 100 ,penalty=’l2’, C=1.0,
class_weight=’dict’ and n_jobs as 2 which means number of CPU cores used when parallelizing over classes
to achieve better accuracy while predicting on test sets .
14

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.83

Table 10 Classification Report of Train Data

Table 11 Classification Report of Test Data

Inferences :

For predicting votes for Conservative Party (Label 0)

 Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (72%) – Out of all the voters who have actually voted for Conservative party, 72% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (86%) – 86% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .

Overall accuracy of the model – 83% of total predictions are correct.

Accuracy score and Precision for test data is almost inline with training data .This proves that no overfitting
or underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class
imbalance in the data.
15

LDA (Linear Discriminant Analysis)

Now we will apply LDA on train set and perform the predictions on test set. Here also I will be using Grid
Search CV to find out best hyperparameters to be used for building the LDA model on the train data set .
And the best parameters that should be used are solver=’lsqr’ instead of default solver =’svd’ and
shrinkage as ‘auto’ instead of default value as None to achieve better accuracy while predicting on test
sets.

 Accuracy Score on Training Data : 0.83


 Accuracy Score on Testing Data : 0.84

Table 12 Classification Report of Train Data

Table 13 Classification Report of Test Data

Inferences :

Linear Discriminant Function = 1.77 + (-1.34*age) + (0.63*economic.cond.national) +


(0.08*economic.cond.household) + (0.77*Blair) + (-0.94*Hague) + (-2.39*Europe) + (-
0.43*political.knowledge) + (0.12*gender_male)

By the above equation and the coefficients it is clear that:


 predictor 'Blair’ has the largest magnitude thus this helps in classifying the best.
 predictor ‘Europe’ has the smallest magnitude thus this helps in classifying the least.

For predicting votes for Conservative Party (Label 0)


16

 Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (74%) – Out of all the voters who have actually voted for Conservative party, 74% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .

Overall accuracy of the model – 84% of total predictions are correct

Accuracy score and Precision for test data is almost inline with training data . This proves that no
overfitting or underfitting has happened. However, recall has reduced for Class 1 of test data which is due
to class imbalance in the data.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 

KNN Model

We will apply KNeighbours Classifier on the training set and evaluate the model performance on test set .
First we will apply using default values which is n_neighbors=5 and we get the scores as below :

 Accuracy Score on Training Data : 0.85


 Accuracy Score on Testing Data : 0.81

Now lets run the KNN with no of neighbours to be 1,3,5..19 and find the optimal number of
neighbours from K=1,3,5,7....19 using the Mis classification error
Note : Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours =
1,3,5...19 and find the model with lowest MCE

Plotting misclassification error vs k (with k value on X-axis) as below :


17

Fig 13 MCE v/s K-Neighbours Plot

And as seen in the above graph for K=9 it is giving the least MCE of approx. 0.16, so we will build the model for K=9
and check its performance .

 Accuracy Score on Training Data : 0.85


 Accuracy Score on Testing Data : 0.83

Table 14 Classification Report of Train Data

Table 15 Classification Report of Test Data

As the difference between train and test accuracies is less than 10%(1.4%), it is a valid model.

Overall accuracy of the model – 83% of total predictions are correct


18

Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data. And therefore, accuracy score will not be considered here as a measure to check the model evaluation as there
is an imbalance in the dataset. Our main goal here is to reduce the Type 2 error, i.e False-negative.

Naïve Bayes Model

For naive bayes algorithm while calculating likelihoods of numerical features it assumes the feature to be normally
distributed and then we calculate probability using mean and variance of that feature only and also it assumes that
all the predictors are independent to each other. Hence, there are no hyperparameters as such which can be used to
optimise this model .

We will apply the GaussianNB classifier on the train set and check the predictions on the test set :

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.82

Lets do a check to ensure the model validity by analysing cross validation scores on train and test
sets.

After 10 fold cross validation, scores both on train and test data set respectively for all 10 folds are almost same.
Hence our model is valid.

 Train Score : 0.83


 Test Score : 0.83

Table 16 Classification Report of Train Data

Table 17 Classification Report of Test Data

Overall accuracy of the model – 82% of total predictions are correct.


19

Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.

1.9 Model Tuning, Bagging (Random Forest should be applied for


Bagging), and Boosting.

Model tuning (hyperparameters) has already been done to the Logistic Regression, LDA, KNN and Naïve Bayes
Models .

Bagging (Using Random Forest as classifier)

Creating a Bagging model using Random Forest classifier as the base estimator, n_estimators as 100 and random
state as 1 as the hyperparameters on train data and checking the performance on test dataset.

 Accuracy Score on Training Data : 0.97


 Accuracy Score on Testing Data : 0.83

Table 18 Classification Report of Train Data

Table 19 Classification Report of Test Data

Overall accuracy of the model – 83% of total predictions are correct

Accuracy score and Precision for test data is not inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.

Ada Boost
20

Creating Ada Boost model using Adaboost classifier from sklearn ensemble library and tuning the parameters
suggested by Gridsearch CV like learning rate as 1 and n_estimators as 10. We get the below accuracy scores :

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.82

Table 20 Classification Report of Train Data

Table 21 Classification Report of Test Data

Overall accuracy of the model – 82% of total predictions are correct

Accuracy score and Precision for test data is almost inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.

Gradient Boosting

Creating Gradient Boosting model using GradientBoosting classifier from sklearn ensemble library and tuning the
parameters suggested by Gridsearch CV like learning rate as 0.5 and n_estimators as 12. And, we get the below
accuracy scores :

 Accuracy Score on Training Data : 0.89


 Accuracy Score on Testing Data : 0.84
21

Table 22 Classification Report of Train Data

Table 23 Classification Report of Test Data

Overall accuracy of the model – 84% of total predictions are correct

Accuracy score and Precision for test data is not inline with the training data . But the difference between training
and test set scores is within the industry standards, so it can be accepted as a valid model. But this proves that some
overfitting or underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class
imbalance in the data.

1.7 Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.
Lets check and compare the performance of predictions on Train and Test sets using Accuracy ,
Confusion Matrix, ROC_AUC scores of all the models and find out the best model among all .

1. Logistic Regression

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.83
22

Fig 14 Confusion Matrix of Train and Test sets

Fig 15 ROC_AUC score and ROC curve of Train and Test Sets

2. LDA (Linear Discriminant Analysis)

 Accuracy Score on Training Data : 0.83


 Accuracy Score on Testing Data : 0.84
23

Fig 16 Confusion Matrix of Train and Test sets

Fig 17 ROC_AUC score and ROC curve of Train and Test Sets

3. KNN Model

 Accuracy Score on Training Data : 0.85


 Accuracy Score on Testing Data : 0.83
24

Fig 18 Confusion Matrix of Train and Test sets

Fig 19 ROC_AUC score and ROC curve of Train and Test Sets

4. Naïve Bayes Model

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.82
25

Fig 20 Confusion Matrix of Train and Test sets

Fig 21 ROC_AUC score and ROC curve of Train and Test Sets

5. Bagging (Using Random Forest as classifier)

 Accuracy Score on Training Data : 0.97


 Accuracy Score on Testing Data : 0.83
26

Fig 22 Confusion Matrix of Train and Test sets

Fig 23 ROC_AUC score and ROC curve of Train and Test Sets

6. Ada Boost

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.82
27

Fig 24 Confusion Matrix of Train and Test sets

Fig 25 ROC_AUC score and ROC curve of Train and Test Sets

7. Gradient Boosting

 Accuracy Score on Training Data : 0.89


 Accuracy Score on Testing Data : 0.84
28

Fig 26 Confusion Matrix of Train and Test sets

Fig 27 ROC_AUC score and ROC curve of Train and Test Sets

Lets quickly compare all the performance metrics of above seven models and find out the best model
among all :

Table 24 Comparison Summary of Logistic Regression, LDA & KNN models


29

Table 25 Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models

As per the above summary , we can conclude below inferences :

 Accuracy score and Precision for Class 1 of Logistic Regression, LDA, KNN, Naïve Bayes and Ada Boost models
are almost inline with the testing data which indicates there is no overfitting or underfitting has happened.
 ROC_AUC Scores of Logistic Regression, LDA, Naïve Bayes and Ada Boost models are almost inline with the
testing data .
 Recall Scores of LDA, KNN and Naïve Bayes are almost inline with the testing data.

So, overall we can infer that LDA and Naive Bayes are the most optimized models from all the above mentioned
models . But as we know that there was class imbalance in the data , we will apply smote on the above 2 models i.e.
LDA and Naive Bayes to check if the performance has improved or not.

Table 26 Comparison Summary of LDA & Naïve Bayes after SMOTE

Table 27 Comparing performances before and after SMOTE

We can conclude that Naïve Bayes model’s performance is slightly better than LDA after smote although accuracy
has remained constant and ROC_AUC score has reduced but recall and precision scores have improved only for Class
1.

So we can infer that there is not much improvement in the models after applying SMOTE , hence Naïve Bayes model
before applying SMOTE is the best and most optimised model among all .
30

Final Model is Naïve Bayes and has below performance metrics :

 Accuracy Score on Training Data : 0.84


 Accuracy Score on Testing Data : 0.82

Fig 28 Confusion Matrix of Train and Test sets

Fig 29 ROC_AUC score and ROC curve of Train and Test Sets
31

Table 28 Classification Report of Train Data

Table 29 Classification Report of Test Data

For predicting votes for Conservative Party (Label 0)

 Precision (74%) – 74% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (73%) – Out of all the voters who have actually voted for Conservative party, 73% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (87%) – Out of all the voters who have actually voted for Labour party, 87% have been predicted
correctly .

Overall accuracy of the model – 82% of total predictions are correct and AUC score is also quite good which means
the model is able to better distinguish between the two classes.

Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. So overall it is a good and optimised model .

1.8 Based on these predictions, what are the insights?

Summing up all the above steps as below:


 Analysed the dataset thoroughly by doing EDA to analyse different variables and their relationship with each
other , pre-processed the data as there were some duplicates and checked for outliers .
32

 Encoded the categorical variable, bifurcated the data into train and test sets (70:30) and scaled the columns
which were on different scales.
 Created different models by tuning their hyperparameters using Gridsearch CV, Cross-validation and Mis-
classification error.
 Analysed and compared the performance metrics like accuracy scores, precision, recall , ROC_AUC scores
and Confusion Matrix for all the models to find out the best and optimised model among all .
 Applied SMOTE on the two best models to correct class imbalance where synthetic samples are generated
for the minority class and compared the performance metrics of the models before and after applying the
smote.
 Chose the final model based on the above analysis and commented on that model’s performance metrics.

Based on the above predictions of the final Naïve Bayes Model, following business insights can be drawn :
 Voters will vote mostly for the Labour party and their chances of winning in the elections are quite high
compared to Conservative Party.
 And the exit poll indicates that Labour party will get more votes as 82% of the total predictions are accurate.

END OF PROBLEM 1

Problem 2 Text Mining

Introduction 
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be looking
at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the
mentioned documents.

After importing all the 3 speeches from nltk library , lets check the number of characters , words and sentences in
each one of them :
33

2.2 Remove all the stopwords from all three speeches.

Before removing stopwords , lets do some pre-processing or cleaning of the texts in each of the three 3 speeches
as per the below steps :

 Firstly we will convert all the speech text to lowercase .


 Then we will clean all special characters using re.sub() function for string substitution using regular
expressions.
 Now we will tokenize the text which means splitting the text files into words .
 And we will remove the stopwords which means removing the meaningless words.
 Finally we will stem the words to its root words using Porterstemmer.

1. Speech of President Franklin D. Roosevelt in 1941 : Checking word count before and after
removal of stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1348


Word count after removal of stopwords : 625

2. Speech of President John F. Kennedy in 1961 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1371


Word count after removal of stopwords : 688

3. Speech of President Richard Nixon in 1973 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1819


Word count after removal of stopwords : 833

Note : Word count in question 1 and question 2 (word count before removal of stopwords) is different because
when we do .words it includes spaces as well and after text cleaning it only has words without spaces .

2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

We can find the words which occurs the most number of times by nltk.FreqDist() function
34

1. Speech of President Franklin D. Roosevelt in 1941 : In this speech, below three words occur most
number of times :
Nation : 17 times
Know : 10 times
Peopl : 9 times

2. Speech of President John F. Kennedy in 1961 : In this speech, below three words occur most
number of times :
Let : 16 times
Us : 12 times
Power : 9 times

3. Speech of President Richard Nixon in 1973 : In this speech, below three words occur most number
of times :
Us : 26 times
Let : 22 times
America : 21 times

2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)

Now we will plot the word cloud of the most used words in each of the three speeches using WordCloud from
matplotlib library .

Fig 30 WordCloud of President Franklin D. Roosevelt’s Speech 1941


35

Fig 31 WordCloud of President John F. Kennedy’s Speech 1961

Fig 32 WordCloud of President Richard Nixon’s Speech 1973

END OF PROBLEM 2
36

You might also like