You are on page 1of 31

Machine

Learning

Sakchi Saraf
PGP-DSBA Online
May’22
Date: 16/11/22

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents

Contents
Problem 1 .................................................................................................................................. 3
Executive summary....................................................................................................................
Introduction................................................................................................................................ 3
Data description......................................................................................................................... 3
Sample of the dataset...............................................................................................................
Exploratory data analysis........................................................................................................... 4
Data type....................................................................................................................................
Describe....................................................................................................................................
Boxplot......................................................................................................................................
Histogram.................................................................................................................................
Strip plot...................................................................................................................................
Pairplot.....................................................................................................................................
Correlation ...............................................................................................................................
Model analysis............................................................................................................................ 10
Train and Test data...................................................................................................................
Logistic Regression Model........................................................................................................
Linear Discriminate Analysis.....................................................................................................
Gaussian Naïve Bayes...............................................................................................................
KNN Model...............................................................................................................................
AdaBoost Classifier...................................................................................................................
Decision Tree............................................................................................................................
Random Forest.........................................................................................................................

Insights......................................................................................................................................... 24

Problem 2................................................................................................................................... 25
Introduction................................................................................................................................ 25
Data description......................................................................................................................... 25
Model analysis............................................................................................................................ 25
THE END!............................................................................................................................................. 29

1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Sl No. List of tables Sl No. List of graph or picture
1 Sample of the dataset 1 Boxplot
2 Data type 2 Histogram
3 Describe 3 Strip plot
4 Sample of the dataset after dummy 4 Pairplot
variable
5.1 Probability table 5 Correlation
5.2 Confusion matrix 6.1 Confusion matrix
6.1 Probability table 6.2 Classification report
6.2 Confusion matrix 6.3 ROC & AUC
7 Confusion matrix 7.1 Confusion matrix
8.1 Confusion matrix with k=5 7.2 Classification report
8.2 Confusion matrix with k = 17 7.3 ROC & AUC
9 Confusion matrix 7.4 Confusion matrix & Classification
report
10 Confusion matrix 8.1 Confusion matrix
11 Confusion matrix 8.2 Classification report
12 Comparison table 8.3 ROC & AUC
    9.1 Confusion matrix
    9.2 Classification report
    9.3 Misclassification report
    9.4 Confusion matrix
    9.5 Classification report
    9.6 ROC & AUC
    10.1 Confusion matrix & Classification
report
    10.2 ROC & AUC
    11.1 Confusion matrix & Classification
report
    11.2 ROC & AUC
    12.1 Confusion matrix & Classification
report
    12.2 ROC & AUC

2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win
and seats covered by a particular party.

Executive Summary
An exit poll is a poll of voters taken immediately after they have exited the polling stations. Exit polls have
been used as a check against, and rough indicator of who’s winning. From the dataset also we will predict
who has received the vote depending upon the responses received from the voters.

Introduction
The purpose of this whole exercise is to draw inferences from the dataset. We will create different models
on the dataset to draw inferences. The data consist of the votes of the 1525 voters and 8 different attributes
determining the vote.

Data Description
1. Vote - Party choice: Conservative or Labour
2. Age - in years
3. economic.cond.national - Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household - Assessment of current household economic conditions, 1 to 5.
5. Blair - Assessment of the Labour leader, 1 to 5.
6. Hague - Assessment of the Conservative leader, 1 to 5.
7. Europe - an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge - Knowledge of parties' positions on European integration, 0 to 3.
9. Gender - female or male.

Sample of the dataset

The dataset has 1 dependent variable – “Vote” and 8 independent variables. Each variable has different set
of attributes. Based on the variables the vote of the person will be predicted.

3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Exploratory Data Analysis
Data type in the data frame:

 There is total 1525 rows and 9 columns in the dataset.


 The dataset has 7 numerical variables and the rest 2 are categorical variables.
 While performing machine learning we only require numerical data as input. So, the object data type
was changed.
 There is no missing value present in the dataset.

Describe

From the above table we get to know that there is no miss value in
one variable but no bad data. We can see that the range of the
numerical variables are not very different from each other except
than age. We can see the most frequent attribute of the
categorical variable, “Labour” and “Female” being the most
occurred observation for vote and gender category respectively.

4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Boxplot

The left boxplot shows the spread of the data and the right boxplot shows the spread of the same variable
but for the attributes of the dependent variable, Vote. Observations as follows:
1. Out of 7 numerical variables only 2 have outliers. All the outliers are reasonable outliers but our
models are sensitive with outliers so we will treat outliers.
2. In attributes like Blair and Hauge, there are no outliers as a whole but when we separate for Labour
and conservative, we see outliers.
3. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.
4. Voters of Labour find national and household economic condition better than the conservative voters.

5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Histogram

6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The first 9 graphs show the distribution of the attributes and the next 9 graphs shows the distribution along
with the bifurcation of the voters. From the above histogram we can see distribution of all the variables.
From the histogram we can infer the following:
1. Vote - Labour is preferred by most of the voters.
2. Age – It is almost normally distributed with 24 being the minimum and 93 being the largest.
Labour voters are comparatively younger than the Conservative voters.
3. Economic condition of nation – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
4. Economic condition of household – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
5. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.

Strip Plot

7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Strip plot shows the distribution of the variables in the form of scatterplot. The blue dots on the scatter plot
represent “Labour” and the orange dots represent “Conservative”. From the strip plot, we can infer that
there is no clear inference.

8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Pairplot

Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of the
variable in the form of histogram. The orange dots on the scatter plot represent “Labour” and the blue dots
represent “Conservative”. From the scatterplot, we can infer that there is no clear linear relationship
between any variables.

9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Correlation Plot

From the correlation plot, we can see that various attributes are highly correlated to each other. Correlation
values near to 1 or -1 are highly positively correlated and highly negatively correlated respectively.
Correlation values near to 0 are not correlated to each other.

From the above table we can know that there is very less correlation between few variables. The maximum
correlated variable are economic condition of nation and the economic condition of the household with 0.35
correlated.

10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Model Analysis
Train and Test data

The dataset has been divided into two parts – X and Y, X being the set of independent variables and Y being
the dependent variable. The dataset has been divided in 70%:30% ratio. Before the division of the data,
dummy variables were created for the categorical string value to categorical numerical value and the column
of the first attribute created was dropped for each categorical numerical value. Scaling is not required as it is
not mandatory for the models we are going to perform.

Below is the sample of dataset after dummy variable:

 For Vote – dummy variable converted Labour to 1 and Conservative to 0


 For Gender – dummy variable converted Female to 1 and Male to

Below table shows the attribute breakdown of the Vote:

Vote Original dataset Train data Test data


Conservative or 0 30.32% 30.35% 30.26%
Labour or 1 69.68% 69.65% 69.74%

We can see that the data split is fine as the split has happened on the similar bracket.

Logistic Regression Model

In Logistic Regression model creation, we took the following parameters as follows:


1. penalty: L2, none, L1: it is the amount of shrinkage, where data values are shrunk towards a central
point, like the mean
2. solver: newton-cg, sag, saga: use to successively optimization problems, ‘sag’ & ‘saga’ are faster for
larger dataset
3. tol: 0.0001, 0.00001: the tolerance level is used as an indicator of multicollinearity
4. multi_class: multinomial, auto, ovr: it is used for classification tasks that have more than two class
labels

Following is the attribute output of the Grid search:


1. penalty: none
2. solver: saga
3. tol: 0.00001
4. multi_class: auto

11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The table shows the probability of whether the particular voters have voted. ‘0’
is for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above
50%

We can see that 4 out of 5 employees have voted for “Labour” as the
probability of is above 50% and the second voter has voted for “conservative”.

Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data


True Negative 215 94 The voter voted for Conservative and model
predicted the same
False Negative 75 21 The voter voted for Labour and model
predicted as Conservative
False Positive 107 44 The voter voted for Conservative and model
predicted as Labour
True Positive 663 297 The voter voted for Labour and model
predicted the same

Classification report:

Train Data Test Data

12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Inference from Model:
1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 83% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

ROC_AUC score and ROC curve:

Train Data Test Data

The ROC AUC score for both the train and test data is 0.877 and 0.914 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

Linear Discriminate Analysis

The first step was to fit the Linear Discriminant Analysis to the train set and then followed by predicting the
test data.

The table shows the probability of whether the particular voters have voted. ‘0’ is
for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above 50%

We can see that the all the 5 voters votes for Labour as the probability of not
opting is above 50%.

13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data


True Negative 219 97 The voter voted for Conservative and model
predicted the same
False Negative 85 23 The voter voted for Labour and model
predicted as Conservative
False Positive 103 41 The voter voted for Conservative and model
predicted as Labour
True Positive 654 295 The voter voted for Labour and model
predicted the same

Classification report:

Train Data Test Data

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model

14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
performed better in the test data

ROC_AUC score and ROC curve:

Train Data Test Data

The ROC AUC score for both the train and test data with 0.877 and 0.915 respectively shows that the model
has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area under
the curve also seems similar.

Changing the cut-off manually:

The default cut-off for the probability of what a particular voter will vote is above 50%. Here, ‘0’ is for
Conservative whereas ‘1’ is for Labour. Now we will change the cut-off value to look for the cut-off value
where the model prediction improves.

We tried the cut-off from 10% - 90% and got the best Accuracy and F1 score at the cut-off of 40%. The below
are the confusion matrix and classification report for the test data at 40% cut-off:

Confusion matrix: Classification Report:

15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Gaussian Naïve Bayes

Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data


True Negative 223 100 The voter voted for Conservative and model
predicted the same
False Negative 92 28 The voter voted for Labour and model
predicted as Conservative
False Positive 99 38 The voter voted for Conservative and model
predicted as Labour
True Positive 647 290 The voter voted for Labour and model
predicted the same

Classification report:

Train Data Test Data

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:

AUC: 0.874 AUC: 0.913


Train Data Test Data

The ROC AUC score for both the train and test data is 0.874 and 0.913 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

KNN Model

KNN Model is a distance-based model so we performed scaling before the model preparation. The inbuild
model take the k-neighbor value as 5. The result for the same as follows:

Confusion matrix:

Train Data Test Data

17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 217 109 The voter voted for Conservative and model
predicted the same
False Negative 62 35 The voter voted for Labour and model
predicted as Conservative
False Positive 90 44 The voter voted for Conservative and model
predicted as Labour
True Positive 692 268 The voter voted for Labour and model
predicted the same

Classification report:

Train Data Test Data

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 86% & 83% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

As we know the inbuild k-neighbor take the value as 5, so we came up with Misclassification Error graph
where we can confirm the k-neighbor value. The values in left shows the value of misclassification error for
the k values and the graph on the right represent the values. The lower the value of the error the most
effective is the k value.

From the above graph we can see that at k as 17 the error is minimal. So, we will perform KNN model for k-
neighbor value as 17.

18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data


True Negative 205 102 The voter voted for Conservative and model
predicted the same
False Negative 67 25 The voter voted for Labour and model
predicted as Conservative
False Positive 102 51 The voter voted for Conservative and model
predicted as Labour
True Positive 687 278 The voter voted for Labour and model
predicted the same

Classification report:

Train Data Test Data

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:

AUC: 0.905 AUC: 0.887


Train Data Test Data

The ROC AUC score for both the train and test data is 0.905 and 0.887 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

AdaBoost Classifier

Classification report & Confusion matrix:

Train Data Test Data

20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 186 98 The voter voted for Conservative and model
predicted the same
False Negative 52 32 The voter voted for Labour and model
predicted as Conservative
False Positive 121 55 The voter voted for Conservative and model
predicted as Labour
True Positive 702 271 The voter voted for Labour and model
predicted the same

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

ROC_AUC score and ROC curve:

AUC: 0.902 AUC: 0.884


Train Data Test Data

The ROC AUC score for both the train and test data is 0.902 and 0.884 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Decision Tree

Classification report & Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data


True Negative 234 113 The voter voted for Conservative and model
predicted the same
False Negative 97 53 The voter voted for Labour and model
predicted as Conservative
False Positive 73 40 The voter voted for Conservative and model
predicted as Labour
True Positive 657 250 The voter voted for Labour and model
predicted the same

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:

AUC: 0.902 AUC: 0.853


Train Data Test Data

The ROC AUC score for both the train and test data is 0.902 and 0.853 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

Random Forest

Classification report & Confusion matrix:

Train Data Test Data

23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 207 101 The voter voted for Conservative and model
predicted the same
False Negative 56 27 The voter voted for Labour and model
predicted as Conservative
False Positive 100 52 The voter voted for Conservative and model
predicted as Labour
True Positive 698 276 The voter voted for Labour and model
predicted the same

Inference from Model:


1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 82% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data

ROC_AUC score and ROC curve:

AUC: 0.914 AUC: 0.892


Train Data Test Data

The ROC AUC score for both the train and test data is 0.914 and 0.892 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.

24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Comparison of the models and Inferences

Model Data Type Accuracy Precision Recall F1 AUC


Logistic Regression Train 83% 86% 90% 88% 0.877
Model Test 86% 87% 93% 90% 0.914
Linear Discriminate Train 82% 86% 88% 87% 0.877
Analysis Test 86% 88% 93% 90% 0.915
Train 82% 87% 88% 87% 0.874
Gaussian Naïve Bayes
Test 86% 88% 91% 90% 0.913
Train 84% 87% 91% 89% 0.905
KNN Model
Test 83% 84% 92% 88% 0.887
Train 84% 85% 93% 89% 0.902
AdaBoost Classifier
Test 81% 83% 89% 86% 0.884
Train 84% 90% 87% 89% 0.902
Decision Tree
Test 80% 86% 83% 84% 0.853
Train 85% 87% 93% 90% 0.914
Random Forest
Test 83% 84% 91% 87% 0.892

Conclusion:

1. All models are similar to each other and performed good but the best model is Logistic Regression
Model.
2. The Accuracy for the Logistic Regression Model is 83% for train data and 86% for the test data, the
model performed better in the test data.
3. The Recall and Precision of the Logistic Regression Model is also good and the test data performed
better than the train data.
4. The AUC score of the test data is better than the train data for the Logistic Regression Model.

The Inferences and insights from the data:

1. The voters of Labour have a better national and household economical condition perception with an
average of 3.43 and 3.26 respectively.
2. The voters of Conservative have higher ‘European’ sentiments compared to voters of Labour with
8.66 and 5.9 respectively.
3. The voters of the Conservative have higher political knowledge compared to voters of Labour with
1.72 and 1.46 respectively.
4. The voters of the Conservative are more politically sound people along with more ‘European’
sentiments, so maybe they restrict their opinion on European areas compare to the rest of the
world.
5. The voters of the Labour maybe carried away with the recent activities rather than knowing the past.

25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

Introduction
The purpose of this whole exercise is to draw inference from the dataset. We will perform Text analysis on
the dataset. We have used 3 inaugural speeches from the inaugural pack for the analysis.

Data Description
We will be looking at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

Model Analysis
2.1 Find the number of characters, words, and sentences for the mentioned documents.
The total and individual number of characters in the speeches are as follows: -

The total and individual number of words in the speeches are as follows: -

The total and individual number of sentences in the speeches are as follows: -

26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.2 Remove all the stopwords from all three speeches.
All the stopwords and the punctuations have been removed from the speeches.

Below are the samples of 10 most common words from each speech before and after removal of stopwords:

1. Before the removal of stopwords:

2. After the removal of stopwords:

2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (After removing the stopwords)
Find the 3 most common words and there counts from the 3 speeches as follows: -

27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.4 Plot the word cloud of each of the speeches of the variable. (After removing the stopwords)
Word cloud from the inaugural speech of President Franklin D. Roosevelt in 1941:

As the data was collected from the inaugural speech of the president, it is already indicated by ‘Nation’,
‘Spirit’’, ‘People, ‘know’ or ‘Democracy’.

Word cloud from the inaugural speech of President John F. Kennedy in 1961:

28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘World’,
‘Power’, ‘sides’ or ‘Nation’.

Word cloud from the inaugural speech of President Richard Nixon in 1973:

As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘Us’,
‘America’, Nation’ or ‘ Peace’.

29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
THE END!

30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited

You might also like