Sakchi Saraf Project 6

Machine
Learning
Sakchi Saraf
PGP-DSBA Online
May’22
Date: 16/11/22
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Contents
Problem 1 .................................................................................................................................. 3
Executive summary....................................................................................................................
Introduction................................................................................................................................ 3
Data description......................................................................................................................... 3
Sample of the dataset...............................................................................................................
Exploratory data analysis........................................................................................................... 4
Data type....................................................................................................................................
Describe....................................................................................................................................
Boxplot......................................................................................................................................
Histogram.................................................................................................................................
Strip plot...................................................................................................................................
Pairplot.....................................................................................................................................
Correlation ...............................................................................................................................
Model analysis............................................................................................................................ 10
Train and Test data...................................................................................................................
Logistic Regression Model........................................................................................................
Linear Discriminate Analysis.....................................................................................................
Gaussian Naïve Bayes...............................................................................................................
KNN Model...............................................................................................................................
AdaBoost Classifier...................................................................................................................
Decision Tree............................................................................................................................
Random Forest.........................................................................................................................
Insights......................................................................................................................................... 24
Problem 2................................................................................................................................... 25
Introduction................................................................................................................................ 25
Data description......................................................................................................................... 25
Model analysis............................................................................................................................ 25
THE END!............................................................................................................................................. 29
1
Sl No. List of tables Sl No. List of graph or picture
1 Sample of the dataset 1 Boxplot
2 Data type 2 Histogram
3 Describe 3 Strip plot
4 Sample of the dataset after dummy 4 Pairplot
variable
5.1 Probability table 5 Correlation
5.2 Confusion matrix 6.1 Confusion matrix
6.1 Probability table 6.2 Classification report
6.2 Confusion matrix 6.3 ROC & AUC
7 Confusion matrix 7.1 Confusion matrix
8.1 Confusion matrix with k=5 7.2 Classification report
8.2 Confusion matrix with k = 17 7.3 ROC & AUC
9 Confusion matrix 7.4 Confusion matrix & Classification
report
10 Confusion matrix 8.1 Confusion matrix
11 Confusion matrix 8.2 Classification report
12 Comparison table 8.3 ROC & AUC
9.1 Confusion matrix
9.2 Classification report
9.3 Misclassification report
9.4 Confusion matrix
9.5 Classification report
9.6 ROC & AUC
10.1 Confusion matrix & Classification
report
10.2 ROC & AUC
report
11.2 ROC & AUC
report
12.2 ROC & AUC
2
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win
and seats covered by a particular party.
Executive Summary
An exit poll is a poll of voters taken immediately after they have exited the polling stations. Exit polls have
been used as a check against, and rough indicator of who’s winning. From the dataset also we will predict
who has received the vote depending upon the responses received from the voters.
Introduction
The purpose of this whole exercise is to draw inferences from the dataset. We will create different models
on the dataset to draw inferences. The data consist of the votes of the 1525 voters and 8 different attributes
determining the vote.
Data Description
1. Vote - Party choice: Conservative or Labour
2. Age - in years
3. economic.cond.national - Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household - Assessment of current household economic conditions, 1 to 5.
5. Blair - Assessment of the Labour leader, 1 to 5.
6. Hague - Assessment of the Conservative leader, 1 to 5.
7. Europe - an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge - Knowledge of parties' positions on European integration, 0 to 3.
9. Gender - female or male.
Sample of the dataset
The dataset has 1 dependent variable – “Vote” and 8 independent variables. Each variable has different set
of attributes. Based on the variables the vote of the person will be predicted.
3
Exploratory Data Analysis
Data type in the data frame:
 There is total 1525 rows and 9 columns in the dataset.

 The dataset has 7 numerical variables and the rest 2 are categorical variables.
 While performing machine learning we only require numerical data as input. So, the object data type
was changed.
 There is no missing value present in the dataset.
Describe
From the above table we get to know that there is no miss value in
one variable but no bad data. We can see that the range of the
numerical variables are not very different from each other except
than age. We can see the most frequent attribute of the
categorical variable, “Labour” and “Female” being the most
occurred observation for vote and gender category respectively.
4
Boxplot
The left boxplot shows the spread of the data and the right boxplot shows the spread of the same variable
but for the attributes of the dependent variable, Vote. Observations as follows:
1. Out of 7 numerical variables only 2 have outliers. All the outliers are reasonable outliers but our
models are sensitive with outliers so we will treat outliers.
2. In attributes like Blair and Hauge, there are no outliers as a whole but when we separate for Labour
and conservative, we see outliers.
3. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.
4. Voters of Labour find national and household economic condition better than the conservative voters.
5
Histogram
6
The first 9 graphs show the distribution of the attributes and the next 9 graphs shows the distribution along
with the bifurcation of the voters. From the above histogram we can see distribution of all the variables.
From the histogram we can infer the following:
1. Vote - Labour is preferred by most of the voters.
2. Age – It is almost normally distributed with 24 being the minimum and 93 being the largest.
Labour voters are comparatively younger than the Conservative voters.
3. Economic condition of nation – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
4. Economic condition of household – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
5. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.
Strip Plot
7
Strip plot shows the distribution of the variables in the form of scatterplot. The blue dots on the scatter plot
represent “Labour” and the orange dots represent “Conservative”. From the strip plot, we can infer that
there is no clear inference.
8
Pairplot
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of the
variable in the form of histogram. The orange dots on the scatter plot represent “Labour” and the blue dots
represent “Conservative”. From the scatterplot, we can infer that there is no clear linear relationship
between any variables.
9
Correlation Plot
From the correlation plot, we can see that various attributes are highly correlated to each other. Correlation
values near to 1 or -1 are highly positively correlated and highly negatively correlated respectively.
Correlation values near to 0 are not correlated to each other.
From the above table we can know that there is very less correlation between few variables. The maximum
correlated variable are economic condition of nation and the economic condition of the household with 0.35
correlated.
10
Model Analysis
Train and Test data
The dataset has been divided into two parts – X and Y, X being the set of independent variables and Y being
the dependent variable. The dataset has been divided in 70%:30% ratio. Before the division of the data,
dummy variables were created for the categorical string value to categorical numerical value and the column
of the first attribute created was dropped for each categorical numerical value. Scaling is not required as it is
not mandatory for the models we are going to perform.
Below is the sample of dataset after dummy variable:
 For Vote – dummy variable converted Labour to 1 and Conservative to 0

 For Gender – dummy variable converted Female to 1 and Male to
Below table shows the attribute breakdown of the Vote:
Vote Original dataset Train data Test data

Conservative or 0 30.32% 30.35% 30.26%
Labour or 1 69.68% 69.65% 69.74%
We can see that the data split is fine as the split has happened on the similar bracket.
Logistic Regression Model
In Logistic Regression model creation, we took the following parameters as follows:

1. penalty: L2, none, L1: it is the amount of shrinkage, where data values are shrunk towards a central
point, like the mean
2. solver: newton-cg, sag, saga: use to successively optimization problems, ‘sag’ & ‘saga’ are faster for
larger dataset
3. tol: 0.0001, 0.00001: the tolerance level is used as an indicator of multicollinearity
4. multi_class: multinomial, auto, ovr: it is used for classification tasks that have more than two class
labels
Following is the attribute output of the Grid search:

1. penalty: none
2. solver: saga
3. tol: 0.00001
4. multi_class: auto
11
The table shows the probability of whether the particular voters have voted. ‘0’
is for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above
50%
We can see that 4 out of 5 employees have voted for “Labour” as the
probability of is above 50% and the second voter has voted for “conservative”.
Confusion matrix:
Train Data Test Data
Confusion Matrix Train Data Test Data

True Negative 215 94 The voter voted for Conservative and model
predicted the same
False Negative 75 21 The voter voted for Labour and model
predicted as Conservative
False Positive 107 44 The voter voted for Conservative and model
predicted as Labour
True Positive 663 297 The voter voted for Labour and model
predicted the same
Classification report:
12
Inference from Model:
1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 83% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data
ROC_AUC score and ROC curve:
The ROC AUC score for both the train and test data is 0.877 and 0.914 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
Linear Discriminate Analysis
The first step was to fit the Linear Discriminant Analysis to the train set and then followed by predicting the
test data.
The table shows the probability of whether the particular voters have voted. ‘0’ is
for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above 50%
We can see that the all the 5 voters votes for Labour as the probability of not
opting is above 50%.
13
Confusion matrix:

predicted the same
predicted as Labour
predicted the same

14
The ROC AUC score for both the train and test data with 0.877 and 0.915 respectively shows that the model
has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area under
the curve also seems similar.
Changing the cut-off manually:
The default cut-off for the probability of what a particular voter will vote is above 50%. Here, ‘0’ is for
Conservative whereas ‘1’ is for Labour. Now we will change the cut-off value to look for the cut-off value
where the model prediction improves.
We tried the cut-off from 10% - 90% and got the best Accuracy and F1 score at the cut-off of 40%. The below
are the confusion matrix and classification report for the test data at 40% cut-off:
Confusion matrix: Classification Report:
15
Gaussian Naïve Bayes
Confusion matrix:

predicted the same
predicted as Labour
predicted the same

16
AUC: 0.874 AUC: 0.913

KNN Model
KNN Model is a distance-based model so we performed scaling before the model preparation. The inbuild
model take the k-neighbor value as 5. The result for the same as follows:
Confusion matrix:
17
predicted the same
predicted as Labour
predicted the same

As we know the inbuild k-neighbor take the value as 5, so we came up with Misclassification Error graph
where we can confirm the k-neighbor value. The values in left shows the value of misclassification error for
the k values and the graph on the right represent the values. The lower the value of the error the most
effective is the k value.
From the above graph we can see that at k as 17 the error is minimal. So, we will perform KNN model for k-
neighbor value as 17.
18
Confusion matrix:

predicted the same
predicted as Labour
predicted the same

19
AUC: 0.905 AUC: 0.887

AdaBoost Classifier
Classification report & Confusion matrix:
20
predicted the same
predicted as Labour
predicted the same

AUC: 0.902 AUC: 0.884

21
Decision Tree

predicted the same
predicted as Labour
predicted the same

22
AUC: 0.902 AUC: 0.853

Random Forest
23
predicted the same
predicted as Labour
predicted the same

AUC: 0.914 AUC: 0.892

24
Comparison of the models and Inferences
Model Data Type Accuracy Precision Recall F1 AUC

Logistic Regression Train 83% 86% 90% 88% 0.877
Model Test 86% 87% 93% 90% 0.914
Linear Discriminate Train 82% 86% 88% 87% 0.877
Analysis Test 86% 88% 93% 90% 0.915
Train 82% 87% 88% 87% 0.874
Gaussian Naïve Bayes
Test 86% 88% 91% 90% 0.913
Train 84% 87% 91% 89% 0.905
KNN Model
Test 83% 84% 92% 88% 0.887
Train 84% 85% 93% 89% 0.902
AdaBoost Classifier
Test 81% 83% 89% 86% 0.884
Train 84% 90% 87% 89% 0.902
Decision Tree
Test 80% 86% 83% 84% 0.853
Train 85% 87% 93% 90% 0.914
Random Forest
Test 83% 84% 91% 87% 0.892
Conclusion:
1. All models are similar to each other and performed good but the best model is Logistic Regression
Model.
2. The Accuracy for the Logistic Regression Model is 83% for train data and 86% for the test data, the
3. The Recall and Precision of the Logistic Regression Model is also good and the test data performed
better than the train data.
4. The AUC score of the test data is better than the train data for the Logistic Regression Model.
The Inferences and insights from the data:
1. The voters of Labour have a better national and household economical condition perception with an
average of 3.43 and 3.26 respectively.
2. The voters of Conservative have higher ‘European’ sentiments compared to voters of Labour with
8.66 and 5.9 respectively.
3. The voters of the Conservative have higher political knowledge compared to voters of Labour with
1.72 and 1.46 respectively.
4. The voters of the Conservative are more politically sound people along with more ‘European’
sentiments, so maybe they restrict their opinion on European areas compare to the rest of the
world.
5. The voters of the Labour maybe carried away with the recent activities rather than knowing the past.
25
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
Introduction
The purpose of this whole exercise is to draw inference from the dataset. We will perform Text analysis on
the dataset. We have used 3 inaugural speeches from the inaugural pack for the analysis.
Data Description
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
Model Analysis
2.1 Find the number of characters, words, and sentences for the mentioned documents.
The total and individual number of characters in the speeches are as follows: -
The total and individual number of words in the speeches are as follows: -
The total and individual number of sentences in the speeches are as follows: -
26
2.2 Remove all the stopwords from all three speeches.
All the stopwords and the punctuations have been removed from the speeches.
Below are the samples of 10 most common words from each speech before and after removal of stopwords:
1. Before the removal of stopwords:
2. After the removal of stopwords:
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (After removing the stopwords)
Find the 3 most common words and there counts from the 3 speeches as follows: -
27
2.4 Plot the word cloud of each of the speeches of the variable. (After removing the stopwords)
Word cloud from the inaugural speech of President Franklin D. Roosevelt in 1941:
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Nation’,
‘Spirit’’, ‘People, ‘know’ or ‘Democracy’.
Word cloud from the inaugural speech of President John F. Kennedy in 1961:
28
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘World’,
‘Power’, ‘sides’ or ‘Nation’.
Word cloud from the inaugural speech of President Richard Nixon in 1973:
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘Us’,
‘America’, Nation’ or ‘ Peace’.
29
THE END!
30

Sakchi Saraf Project 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sakchi Saraf Project 6

Uploaded by

Copyright:

Available Formats

Machine

Sample of the dataset

 There is total 1525 rows and 9 columns in the dataset.

Below is the sample of dataset after dummy variable:

 For Vote – dummy variable converted Labour to 1 and Conservative to 0

Below table shows the attribute breakdown of the Vote:

Vote Original dataset Train data Test data

Logistic Regression Model

In Logistic Regression model creation, we took the following parameters as follows:

Following is the attribute output of the Grid search:

Train Data Test Data

Confusion Matrix Train Data Test Data

Train Data Test Data

ROC_AUC score and ROC curve:

Train Data Test Data

Linear Discriminate Analysis

Train Data Test Data

Confusion Matrix Train Data Test Data

Train Data Test Data

Inference from Model:

ROC_AUC score and ROC curve:

Train Data Test Data

Changing the cut-off manually:

Confusion matrix: Classification Report:

Train Data Test Data

Confusion Matrix Train Data Test Data

Train Data Test Data

Inference from Model:

AUC: 0.874 AUC: 0.913

Train Data Test Data

Train Data Test Data

Inference from Model:

Train Data Test Data

Confusion Matrix Train Data Test Data

Train Data Test Data

Inference from Model:

AUC: 0.905 AUC: 0.887

Classification report & Confusion matrix:

Train Data Test Data

Inference from Model:

ROC_AUC score and ROC curve:

AUC: 0.902 AUC: 0.884

Classification report & Confusion matrix:

Train Data Test Data

Confusion Matrix Train Data Test Data

Inference from Model:

AUC: 0.902 AUC: 0.853

Classification report & Confusion matrix:

Train Data Test Data

Inference from Model:

ROC_AUC score and ROC curve:

AUC: 0.914 AUC: 0.892

Model Data Type Accuracy Precision Recall F1 AUC

The Inferences and insights from the data:

1. President Franklin D. Roosevelt in 1941

1. President Franklin D. Roosevelt in 1941

1. Before the removal of stopwords:

2. After the removal of stopwords:

You might also like