Professional Documents
Culture Documents
Learning
Sakchi Saraf
PGP-DSBA Online
May’22
Date: 16/11/22
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Contents
Problem 1 .................................................................................................................................. 3
Executive summary....................................................................................................................
Introduction................................................................................................................................ 3
Data description......................................................................................................................... 3
Sample of the dataset...............................................................................................................
Exploratory data analysis........................................................................................................... 4
Data type....................................................................................................................................
Describe....................................................................................................................................
Boxplot......................................................................................................................................
Histogram.................................................................................................................................
Strip plot...................................................................................................................................
Pairplot.....................................................................................................................................
Correlation ...............................................................................................................................
Model analysis............................................................................................................................ 10
Train and Test data...................................................................................................................
Logistic Regression Model........................................................................................................
Linear Discriminate Analysis.....................................................................................................
Gaussian Naïve Bayes...............................................................................................................
KNN Model...............................................................................................................................
AdaBoost Classifier...................................................................................................................
Decision Tree............................................................................................................................
Random Forest.........................................................................................................................
Insights......................................................................................................................................... 24
Problem 2................................................................................................................................... 25
Introduction................................................................................................................................ 25
Data description......................................................................................................................... 25
Model analysis............................................................................................................................ 25
THE END!............................................................................................................................................. 29
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Sl No. List of tables Sl No. List of graph or picture
1 Sample of the dataset 1 Boxplot
2 Data type 2 Histogram
3 Describe 3 Strip plot
4 Sample of the dataset after dummy 4 Pairplot
variable
5.1 Probability table 5 Correlation
5.2 Confusion matrix 6.1 Confusion matrix
6.1 Probability table 6.2 Classification report
6.2 Confusion matrix 6.3 ROC & AUC
7 Confusion matrix 7.1 Confusion matrix
8.1 Confusion matrix with k=5 7.2 Classification report
8.2 Confusion matrix with k = 17 7.3 ROC & AUC
9 Confusion matrix 7.4 Confusion matrix & Classification
report
10 Confusion matrix 8.1 Confusion matrix
11 Confusion matrix 8.2 Classification report
12 Comparison table 8.3 ROC & AUC
9.1 Confusion matrix
9.2 Classification report
9.3 Misclassification report
9.4 Confusion matrix
9.5 Classification report
9.6 ROC & AUC
10.1 Confusion matrix & Classification
report
10.2 ROC & AUC
11.1 Confusion matrix & Classification
report
11.2 ROC & AUC
12.1 Confusion matrix & Classification
report
12.2 ROC & AUC
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win
and seats covered by a particular party.
Executive Summary
An exit poll is a poll of voters taken immediately after they have exited the polling stations. Exit polls have
been used as a check against, and rough indicator of who’s winning. From the dataset also we will predict
who has received the vote depending upon the responses received from the voters.
Introduction
The purpose of this whole exercise is to draw inferences from the dataset. We will create different models
on the dataset to draw inferences. The data consist of the votes of the 1525 voters and 8 different attributes
determining the vote.
Data Description
1. Vote - Party choice: Conservative or Labour
2. Age - in years
3. economic.cond.national - Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household - Assessment of current household economic conditions, 1 to 5.
5. Blair - Assessment of the Labour leader, 1 to 5.
6. Hague - Assessment of the Conservative leader, 1 to 5.
7. Europe - an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge - Knowledge of parties' positions on European integration, 0 to 3.
9. Gender - female or male.
The dataset has 1 dependent variable – “Vote” and 8 independent variables. Each variable has different set
of attributes. Based on the variables the vote of the person will be predicted.
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Exploratory Data Analysis
Data type in the data frame:
Describe
From the above table we get to know that there is no miss value in
one variable but no bad data. We can see that the range of the
numerical variables are not very different from each other except
than age. We can see the most frequent attribute of the
categorical variable, “Labour” and “Female” being the most
occurred observation for vote and gender category respectively.
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Boxplot
The left boxplot shows the spread of the data and the right boxplot shows the spread of the same variable
but for the attributes of the dependent variable, Vote. Observations as follows:
1. Out of 7 numerical variables only 2 have outliers. All the outliers are reasonable outliers but our
models are sensitive with outliers so we will treat outliers.
2. In attributes like Blair and Hauge, there are no outliers as a whole but when we separate for Labour
and conservative, we see outliers.
3. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.
4. Voters of Labour find national and household economic condition better than the conservative voters.
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Histogram
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The first 9 graphs show the distribution of the attributes and the next 9 graphs shows the distribution along
with the bifurcation of the voters. From the above histogram we can see distribution of all the variables.
From the histogram we can infer the following:
1. Vote - Labour is preferred by most of the voters.
2. Age – It is almost normally distributed with 24 being the minimum and 93 being the largest.
Labour voters are comparatively younger than the Conservative voters.
3. Economic condition of nation – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
4. Economic condition of household – Is mostly responded on average basis as 3, Labour voters
have positive response compared to Conservative voters
5. Conservative voters have more ‘Eurosceptic’ sentiments compared to the Labour voters.
Strip Plot
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Strip plot shows the distribution of the variables in the form of scatterplot. The blue dots on the scatter plot
represent “Labour” and the orange dots represent “Conservative”. From the strip plot, we can infer that
there is no clear inference.
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Pairplot
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of the
variable in the form of histogram. The orange dots on the scatter plot represent “Labour” and the blue dots
represent “Conservative”. From the scatterplot, we can infer that there is no clear linear relationship
between any variables.
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Correlation Plot
From the correlation plot, we can see that various attributes are highly correlated to each other. Correlation
values near to 1 or -1 are highly positively correlated and highly negatively correlated respectively.
Correlation values near to 0 are not correlated to each other.
From the above table we can know that there is very less correlation between few variables. The maximum
correlated variable are economic condition of nation and the economic condition of the household with 0.35
correlated.
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Model Analysis
Train and Test data
The dataset has been divided into two parts – X and Y, X being the set of independent variables and Y being
the dependent variable. The dataset has been divided in 70%:30% ratio. Before the division of the data,
dummy variables were created for the categorical string value to categorical numerical value and the column
of the first attribute created was dropped for each categorical numerical value. Scaling is not required as it is
not mandatory for the models we are going to perform.
We can see that the data split is fine as the split has happened on the similar bracket.
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The table shows the probability of whether the particular voters have voted. ‘0’
is for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above
50%
We can see that 4 out of 5 employees have voted for “Labour” as the
probability of is above 50% and the second voter has voted for “conservative”.
Confusion matrix:
Classification report:
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Inference from Model:
1. For this case, both False Negative and False Positive are harmful as prediction will get affected for
each and every wrong prediction of the vote.
2. Accuracy of both the train data and the test data are good with 83% & 86% respectively, here the
model performed better in the test data.
3. Recall for the Labour votes is better that the recall of the Conservative votes, here the model
performed better in the test data
The ROC AUC score for both the train and test data is 0.877 and 0.914 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
The first step was to fit the Linear Discriminant Analysis to the train set and then followed by predicting the
test data.
The table shows the probability of whether the particular voters have voted. ‘0’ is
for “Conservative” whereas ‘1’ is for “Labour”. Here the cut-off is of above 50%
We can see that the all the 5 voters votes for Labour as the probability of not
opting is above 50%.
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix:
Classification report:
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
performed better in the test data
The ROC AUC score for both the train and test data with 0.877 and 0.915 respectively shows that the model
has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area under
the curve also seems similar.
The default cut-off for the probability of what a particular voter will vote is above 50%. Here, ‘0’ is for
Conservative whereas ‘1’ is for Labour. Now we will change the cut-off value to look for the cut-off value
where the model prediction improves.
We tried the cut-off from 10% - 90% and got the best Accuracy and F1 score at the cut-off of 40%. The below
are the confusion matrix and classification report for the test data at 40% cut-off:
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Gaussian Naïve Bayes
Confusion matrix:
Classification report:
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:
The ROC AUC score for both the train and test data is 0.874 and 0.913 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
KNN Model
KNN Model is a distance-based model so we performed scaling before the model preparation. The inbuild
model take the k-neighbor value as 5. The result for the same as follows:
Confusion matrix:
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 217 109 The voter voted for Conservative and model
predicted the same
False Negative 62 35 The voter voted for Labour and model
predicted as Conservative
False Positive 90 44 The voter voted for Conservative and model
predicted as Labour
True Positive 692 268 The voter voted for Labour and model
predicted the same
Classification report:
As we know the inbuild k-neighbor take the value as 5, so we came up with Misclassification Error graph
where we can confirm the k-neighbor value. The values in left shows the value of misclassification error for
the k values and the graph on the right represent the values. The lower the value of the error the most
effective is the k value.
From the above graph we can see that at k as 17 the error is minimal. So, we will perform KNN model for k-
neighbor value as 17.
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix:
Classification report:
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:
The ROC AUC score for both the train and test data is 0.905 and 0.887 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
AdaBoost Classifier
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 186 98 The voter voted for Conservative and model
predicted the same
False Negative 52 32 The voter voted for Labour and model
predicted as Conservative
False Positive 121 55 The voter voted for Conservative and model
predicted as Labour
True Positive 702 271 The voter voted for Labour and model
predicted the same
The ROC AUC score for both the train and test data is 0.902 and 0.884 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Decision Tree
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
ROC_AUC score and ROC curve:
The ROC AUC score for both the train and test data is 0.902 and 0.853 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
Random Forest
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion Matrix Train Data Test Data
True Negative 207 101 The voter voted for Conservative and model
predicted the same
False Negative 56 27 The voter voted for Labour and model
predicted as Conservative
False Positive 100 52 The voter voted for Conservative and model
predicted as Labour
True Positive 698 276 The voter voted for Labour and model
predicted the same
The ROC AUC score for both the train and test data is 0.914 and 0.892 respectively, which shows that the
model has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area
under the curve also seems similar.
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Comparison of the models and Inferences
Conclusion:
1. All models are similar to each other and performed good but the best model is Logistic Regression
Model.
2. The Accuracy for the Logistic Regression Model is 83% for train data and 86% for the test data, the
model performed better in the test data.
3. The Recall and Precision of the Logistic Regression Model is also good and the test data performed
better than the train data.
4. The AUC score of the test data is better than the train data for the Logistic Regression Model.
1. The voters of Labour have a better national and household economical condition perception with an
average of 3.43 and 3.26 respectively.
2. The voters of Conservative have higher ‘European’ sentiments compared to voters of Labour with
8.66 and 5.9 respectively.
3. The voters of the Conservative have higher political knowledge compared to voters of Labour with
1.72 and 1.46 respectively.
4. The voters of the Conservative are more politically sound people along with more ‘European’
sentiments, so maybe they restrict their opinion on European areas compare to the rest of the
world.
5. The voters of the Labour maybe carried away with the recent activities rather than knowing the past.
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
Introduction
The purpose of this whole exercise is to draw inference from the dataset. We will perform Text analysis on
the dataset. We have used 3 inaugural speeches from the inaugural pack for the analysis.
Data Description
We will be looking at the following speeches of the Presidents of the United States of America:
Model Analysis
2.1 Find the number of characters, words, and sentences for the mentioned documents.
The total and individual number of characters in the speeches are as follows: -
The total and individual number of words in the speeches are as follows: -
The total and individual number of sentences in the speeches are as follows: -
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.2 Remove all the stopwords from all three speeches.
All the stopwords and the punctuations have been removed from the speeches.
Below are the samples of 10 most common words from each speech before and after removal of stopwords:
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (After removing the stopwords)
Find the 3 most common words and there counts from the 3 speeches as follows: -
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.4 Plot the word cloud of each of the speeches of the variable. (After removing the stopwords)
Word cloud from the inaugural speech of President Franklin D. Roosevelt in 1941:
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Nation’,
‘Spirit’’, ‘People, ‘know’ or ‘Democracy’.
Word cloud from the inaugural speech of President John F. Kennedy in 1961:
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘World’,
‘Power’, ‘sides’ or ‘Nation’.
Word cloud from the inaugural speech of President Richard Nixon in 1973:
As the data was collected from the inaugural speech of the president, it is already indicated by ‘Let’, ‘Us’,
‘America’, Nation’ or ‘ Peace’.
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
THE END!
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited