You are on page 1of 34

Business Report

Machine Learning

P L Lohitha | DSBA | 22-01-23


1
TABLE OF CONTENTS

S.NO TOPIC PAGE NO

1 Problem-1 4
2 Sample of the dataset 4
3 Outlier Treatment 6
4 Univariate analysis 6
4 Bivariate analysis 8
5 Multivariate analysis 10
6 Data encoding, Scaling and Splitting 12
7 Logistic Regression 13
8 Linear Discriminant Analysis 15
9 K- Nearest Neighbor 18
10 Naïve Baye’s Model 20
11 Random Forest 22
12 Bagging 25
13 Boosting 27
14 Model Comparison 29
15 Business Insights 31
16 Problem - 2 32
17 Solution 32

LIST OF FIGURES
Figures PAGE NO

Figure 1 6
Figure 2 6
Figure 3 7
Figure 4 7
Figure 5 7
Figure 6 8
Figure 7 8
Figure 8 8
Figure 9 9
Figure 10 10
Figure 11 11
Figure 12 11
Figure 13 12
Figure 14 12
Figure 15 14
Figure 16 14
Figure 17 15
Figure 18 15
Figure 19 16
Figure 20 17
Figure 21 17
2
Figure 22 18
Figure 23 19
Figure 24 19
Figure 25 20
Figure 26 20
Figure 27 21
Figure 28 21
Figure 29 22
Figure 30 22
Figure 31 23
Figure 32 24
Figure 33 24
Figure 34 25
Figure 35 26
Figure 36 26
Figure 37 27
Figure 38 27
Figure 39 28
Figure 40 28
Figure 41 29
Figure 42 29
Figure 43 30
Figure 44 31
Figure 45 31
Figure 46 31
Figure 47 33
Figure 48 33
Figure 49 33
Figure 50 34
Figure 51 34
Figure 52 34
Figure 53 34

LIST OF TABLES

TABLE PAGE NO

Sample of the data – problem 1 4


Summary of the variables 5
Dummy Variables 12

3
Problem 1
Problem Statement:

You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.

Data Ingestion:
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers .

Data Pre333333.
paration:

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).

Modelling:
1.4 Apply Logistic Regression and LDA (linear discriminant analysis
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.

Final Model:

Compare the models and write inference which model is best/optimized.

Sample of the dataset:

Table 1

 The dataset has been loaded after importing necessary libraries.


 The dataset has 1525 rows and 10 columns.
(1525, 10)

4
 The dataset has 8 integer variables and 2 object variables.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 vote 1525 non-null object
2 age 1525 non-null int64
3 economic.cond.national 1525 non-null int64
4 economic.cond.household 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64
7 Europe 1525 non-null int64
8 political.knowledge 1525 non-null int64
9 gender 1525 non-null object
dtypes: int64(8), object(2)
memory usage: 119.3+ KB

 The dataset has no null and NA values.


Unnamed: 0 0
vote 0
age 0
economic.cond.national 0
economic.cond.household 0
Blair 0
Hague 0
Europe 0
political.knowledge 0
gender 0
dtype: int64

 The five point summary of the dataset helped us understand the mean, median and standard
deviation of the data and it also helps us to identify anomalies in the data. There is a column in
the dataset ‘Unnamed: 0’ which is just numbering of each column so we can drop that column as
it doesn’t help us in model building. The minimum age of the customers is 24 and the
maximum age is 93. The average age of the voters is 53.

Table 2

5
 we dropped the column ‘Unnamed: 0’ and checked for duplicate values. There are no duplicate
values in the data.
0
Outlier Treatment:

 There are no outliers in the data except for one outlier in the columns ‘ economic.cond.national’,
‘economic.cond.household’. we choose to not to treat that outlier as it represents a scale of
economic conditions of the voters and treating the outlier might alter the predictions of the

model.
Figure 1

Univariate analysis:

 The age of the voters is evenly distributed and it is neither left skewed nor right skewed.

Figure 2

6
 The column economic.cond.national is right skewed and there is a outlier in the data.

Figure 3

 The column economic.cond.household is right skewed and there is a outlier in the data.

Figure 4

 The column Blair is evenly distributed and ranges from a scale of 1 to 5.

7
Figure 5

 The data in the column Europe is slightly right skewed.

figure 6

 The majority of the voters choice of party is labour and only 30.30% prefer conservative.

Figure 7

 The gender ratio of the voters is almost equally distributed between male and female.

8
Figure 8

Bivariate analysis:

 We can infer from the below graph that there is very low or no correlation between the
variables.
 Since the correlation is low the data is very ideal for model building and we can predict from
the data properly.
 There are no patterns in the data to infer from the graph.

9
Figure 9

 From the heatmap we can infer that all the variables have very low correlation.
 The variables with the highest positive correlation of 35% is economic.cond.national and
economic.cond.household.
 The variables with the highest negative correlation of -30% is Blair and Europe.

Figure 10

Multivariate analysis:

 The national economic conditions of voters whose party of choice is Labour is higher when
compared to those who’s party of choice is conservative.

10
Figure 11

 Majority of the male voters have higher political knowledge when compared to female voters
and irrespective of gender, voters whose party of choice is conservative have higher political
knowledge when compared to the others.

Figure 12

 The voters with lower age have greater national economic conditions and the age of the
voters whose party of choice is conservative is higher when compared to others.

11
figure 13

 The voters with high political knowledge have high Eurosceptic sentiment.

Figure 14

Data Encoding, Scaling and Splitting:


 After multivariate analysis we will then encode the two categorical variables
using one hot encoding and convert those into integer variables.

12
Table 3

 We will then chec the standard deviation and variance of the variables to
decide whether to scale the data or not.
vote 0.459534
age 15.706057
economic.cond.national 0.880680
economic.cond.household 0.929646 Standard
Blair 1.174439
Hague 1.230300
Deviation
Europe 3.296457
political.knowledge 1.082960
gender 0.498945
dtype: float64

vote 0.211172
age 246.680211
economic.cond.national 0.775598
economic.cond.household 0.864243
Blair 1.379307 Variance
Hague 1.513638
Europe 10.866629
political.knowledge 1.172801
gender 0.248946
dtype: float64

 Since the standard deviation and variance of few variables are very high we
decided to scale the data. The magnitude of the age and other factors are
also varying, so scaling would help us to standardise the which will help us to
yield better predictions.
 We will then split the data into train and test with the ratio [70 : 30 ]. We split
the data to avoid overfitting and don’t want the data to memorise it instead we
want it to learn a pattern from the data so we will split the data.

Logistic Regression:
 We have build a logistic regression model after splitting the data.
0.8397375820056232
precision recall f1-score support

0 0.87 0.91 0.89 735


1 0.77 0.69 0.73 332

accuracy 0.84 1067


macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 84% and recall is 91 and 69 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 89 and 73.
0.8231441048034934
precision recall f1-score support

13
0 0.87 0.89 0.88 328
1 0.70 0.65 0.68 130

accuracy 0.82 458


macro avg 0.78 0.77 0.78 458
weighted avg 0.82 0.82 0.82 458
 We have verified the performance of the model on test data and the accuracy
is 82% which haven’t dropped much and the f1 score also haven’t dropped
much from train set so there is no case of over fitting of model here.

Figure 15
 The confusion matrix is plotted above and the number of true positives and
true negatives are 292 and 85 in the test set and the are under the ROC curve
is 0.882387.

Figure 16

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8209606986899564

14
precision recall f1-score support

0 0.86 0.89 0.88 328


1 0.70 0.65 0.67 130

accuracy 0.82 458


macro avg 0.78 0.77 0.77 458
weighted avg 0.82 0.82 0.82 458

 Using those parameters we have checked the performance of the model on


test data and there is no much difference in the accuracy but there is a slight
increase in the f1 score of target class 1.

Figure 17

 The confusion matrix is plotted above and the number of true positives and
true negatives are 292 and 84 in the test set and the area under the ROC
curve is 0.882387. There is no major difference in the confusion matrix and
the area under the ROC curve.

Figure 18

Linear Discriminant Analysis (LDA):


 We have build a LDA model and the classification report is given below.

15
0.8369259606373008

precision recall f1-score support

0 0.87 0.90 0.88 735


1 0.76 0.70 0.73 332

accuracy 0.84 1067


macro avg 0.81 0.80 0.81 1067
weighted avg 0.83 0.84 0.84 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 84% and recall is 90 and 70 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 88 and 73.
0.8187772925764192

precision recall f1-score support

0 0.87 0.88 0.87 328


1 0.69 0.66 0.67 130

accuracy 0.82 458


macro avg 0.78 0.77 0.77 458
weighted avg 0.82 0.82 0.82 458

 We have verified the performance of the model on test data and the accuracy
is 82% which haven’t dropped much from train set so there is no case of over
fitting of model here.

Figure 19

 The confusion matrix is plotted above and the number of true positives and
true negatives are 289 and 86 in the test set and the are under the ROC curve
is 0.883771.

16
figure 20

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8231441048034934

precision recall f1-score support

0 0.88 0.88 0.88 328


1 0.69 0.68 0.69 130

accuracy 0.82 458


macro avg 0.78 0.78 0.78 458
weighted avg 0.82 0.82 0.82 458

 Using those parameters we have checked the performance of the model on


test data and there is no much difference in the accuracy but there is a slight
increase in the f1 score which improved from 87, 67 to 88 and 69.

Figure 21

 The confusion matrix is plotted above and the number of true positives and
true negatives are 288 and 89 in the test set and the area under the ROC

17
curve is 0.885272. There is slight difference in the confusion matrix and the
area under the ROC curve where the TP and TN changed from 289 , 86 to

288 and 89.


Figure 22

KNN Model:
 We have build a KNN model and the classification report is given below.
0.8631677600749765

precision recall f1-score support

0 0.89 0.92 0.90 735


1 0.80 0.75 0.77 332

accuracy 0.86 1067


macro avg 0.84 0.83 0.84 1067
weighted avg 0.86 0.86 0.86 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 86% and recall is 92 and 75 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 90 and 77.
0.8253275109170306

precision recall f1-score support

0 0.88 0.87 0.88 328


1 0.69 0.71 0.70 130

accuracy 0.83 458


macro avg ]\
0.78 0.79 0.79 458
weighted avg 0.83 0.83 0.83 458

 We have verified the performance of the model on test data and the accuracy
is 83% which haven’t dropped much from train set so there is no case of over
fitting of model here.

18
figure 23

 The confusion matrix is plotted above and the number of true positives and
true negatives are 286 and 92 in the test set and the are under the ROC curve
is 0.870556.

figure 24

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8253275109170306

precision recall f1-score support

0 0.88 0.87 0.88 328


1 0.69 0.71 0.70 130

accuracy 0.83 458


macro avg 0.78 0.79 0.79 458
weighted avg 0.83 0.83 0.83 458

19
 Using those parameters we have checked the performance of the model on
test data and there is no difference in the accuracy and f1 score.

figure 25

 The confusion matrix is plotted above and the number of true positives and
true negatives are 286 and 92 in the test set and the area under the ROC
curve is 0.870556. There is no difference in the confusion matrix and the area
under the ROC curve even after hyper parameter tuning.

Figure 26

Naïve Bayes Model:


 We have build a Naïve Bayes model and the classification report is given
below.
0.8331771321462043

precision recall f1-score support

0 0.88 0.88 0.88 735


1 0.74 0.72 0.73 332

accuracy 0.83 1067


macro avg 0.81 0.80 0.80 1067
weighted avg 0.83 0.83 0.83 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 83% and recall is 88 and 72 for the classes.

20
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 88 and 73.
0.8253275109170306

precision recall f1-score support

0 0.89 0.87 0.88 328


1 0.68 0.72 0.70 130

accuracy 0.83 458


macro avg 0.78 0.79 0.79 458
weighted avg 0.83 0.83 0.83 458

 We have verified the performance of the model on test data and the accuracy
is 83% which haven’t dropped much from train set so there is no case of over
fitting of model here.

Figure 27

 The confusion matrix is plotted above and the number of true positives and
true negatives are 284 and 94 in the test set and the are under the ROC curve
is 0.884545.

figure 28

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.

21
 To optimise the performance of the model we used the SMOTE technique to
solve the class imbalance.
 We increased the sample size of class 1 to match the class 0 so that the
algorithm is not biased and built the model to verify the performance on test
data.
0.7903930131004366

precision recall f1-score support

0 0.91 0.79 0.84 328


1 0.60 0.80 0.68 130

accuracy 0.79 458


macro avg 0.75 0.79 0.76 458
weighted avg 0.82 0.79 0.80 458

 Using those parameters we have checked the performance of the model on


test data and the accuracy has dropped from 83 to 79, also there is a
decrease in the f1 score which dropped from 88, 70 to 84 and 68.

Figure 29

 The confusion matrix is plotted above and the number of true positives and
true negatives are 258 and 104 in the test set and the area under the ROC
curve is 0.884545. There is a drop in the TP’s and TN’s in the confusion
matrix after applying SMOTE so its not ideal for this model and data. The
ROC curve remained the same.

Figure 30

22
Model Tuning , Bagging and Boosting:
Random Forest:
 We have build a Random forest model to perform bagging and the
classification report is given below.
0.9990627928772259
precision recall f1-score support

0 1.00 1.00 1.00 735


1 1.00 1.00 1.00 332

accuracy 1.00 1067


macro avg 1.00 1.00 1.00 1067
weighted avg 1.00 1.00 1.00 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 100% and recall is 100 and 100 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 100 and 100.
0.8209606986899564

precision recall f1-score support

0 0.88 0.87 0.87 328


1 0.68 0.70 0.69 130

accuracy 0.82 458


macro avg 0.78 0.78 0.78 458
weighted avg 0.82 0.82 0.82 458

 We have verified the performance of the model on test data and the accuracy
is 82%. There is a major drop in the accuracy from 100 to 82 which is a clear
case of over fitting of the data.

Figure 31

 The confusion matrix is plotted above and the number of true positives and
true negatives are 285 and 91 in the test set and the are under the ROC curve
is 0.889962.

23
figure 32

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model and to solve the problem of over
fitting we will perform Hyper parameter tuning using grid search.
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8209606986899564

precision recall f1-score support

0 0.87 0.88 0.88 328


1 0.69 0.66 0.68 130

accuracy 0.82 458


macro avg 0.78 0.77 0.78 458
weighted avg 0.82 0.82 0.82 458

 Using those parameters we have checked the performance of the model on


test data and there is no much difference in the accuracy but there is a slight
difference in the f1 score of class 0 from 87 to 88 and class 1 from 69 to 68.

Figure 33

 The confusion matrix is plotted above and the number of true positives and
true negatives are 290 and 86 in the test set and the area under the ROC
curve is 0.889962. There is a slight difference in the confusion matrix but no

24
difference in the area under the ROC curve even after hyper parameter
tuning.

figure 34

Bagging :
 We have built a bagging model using random forest classifier and the
classification report is given below.
0.9662605435801312

precision recall f1-score support

0 0.96 0.99 0. 98 735


1 0.97 0.92 0.94 332

accuracy 0.97 1067


macro avg 0.97 0.95 0.96 1067
weighted avg 0.97 0.97 0.97 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 97% and recall is 99 and 92 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 98 and 94.
08362445414847162

precision recall f1-score support

0 0.88 0.89 0.89 328


1 0.71 0.71 0.71 130

accuracy 0.84 458


macro avg 0.80 0.80 0.80 458
weighted avg 0.84 0.84 0.84 458

 We have verified the performance of the model on test data and the accuracy
is 84%. There is a major drop in the accuracy from 97 to 84 which is a clear
case of over fitting of the data.

25
Figure 35

 The confusion matrix is plotted above and the number of true positives and
true negatives are 291 and 92 in the test set and the are under the ROC curve
is 0.897291.

Figure 36

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model and to solve the problem of over
fitting we will perform Hyper parameter tuning using grid search.
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8318777292576419

precision recall f1-score support

0 0.88 0.89 0.88 328


1 0.71 0.69 0.70 130

accuracy 0.83 458


macro avg 0.79 0.79 0.79 458
weighted avg 0.83 0.83 0.83 458

 Using those parameters we have checked the performance of the model on


test data and there is a slight drop in the accuracy from 84 to 83, similarly f1
score dropped from 89 and 71 to 88 and 70.

26
figure 37
 The confusion matrix is plotted above and the number of true positives and
true negatives are 291 and 90 in the test set and the area under the ROC
curve is 0.897291. There is a slight difference in the confusion matrix but no
difference in the area under the ROC curve even after hyper parameter
tuning.

Figure 38

Boosting:
 We have build a Boosting model and the classification report is given below.
0.8865979381443299

precision recall f1-score support

0 0.91 0.93 0.92 735


1 0.84 0.79 0.81 332

accuracy 0.89 1067


macro avg 0.87 0.86 0.87 1067
weighted avg 0.89 0.89 0.89 1067

 We can understand from the above classification report that the accuracy of
the model on train data set is 89% and recall is 93 and 79 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 92 and 81.

27
0.8318777292576419

precision recall f1-score support

0 0.89 0.87 0.88 328


1 0.69 0.74 0.71 130

accuracy 0.83 458


macro avg 0.79 0.80 0.80 458
weighted avg 0.84 0.83 0.83 458

 We have verified the performance of the model on test data and the accuracy
is 83%. There is a drop in the accuracy from 89 to 83 , so there is no problem
of overfitting in the model.

Figure 39

 The confusion matrix is plotted above and the number of true positives and
true negatives are 285 and 96 in the test set and the are under the ROC curve
is 0.897291.

Figure 40

 The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
 To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.

28
 We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8318777292576419

precision recall f1-score support

0 0.89 0.88 0.88 328


1 0.70 0.72 0.71 130

accuracy 0.83 458


macro avg 0.79 0.80 0.80 458
weighted avg 0.83 0.83 0.83 458

 Using those parameters we have checked the performance of the model on


test data and there is no change in the accuracy i.e. 83, similarly f1 score also
remained the same.

Figure 41

 The confusion matrix is plotted above and the number of true positives and
true negatives are 287 and 94 in the test set and the area under the ROC
curve is 0.904245. There is a slight difference in the confusion matrix but no
difference in the area under the ROC curve even after hyper parameter
tuning.

Figure 42

29
Model comparison:
 The below visualisations depicts the performance of models on multiple
parameters.

Figure 43

 From the above graph we can infer that the models Bagging and Boosting
have covered larger area when compared to other models.

30
Figure 44 figure 45

Figure 46

 The above graphs displays the ‘testing accuracy’ ‘f1 score’ and ‘precision’ of
all the models and clearly bagging is performing well, when compared with
other models.
 So considering all the parameters like Accuracy, Confusion Matrix, Plot ROC
curve, f1_score, precision and ROC_AUC score the best performing model is
Bagging which is a best suited model for the given dataset.

Business Insights:
 The models prediction towards voting for a party highly depends on the voters
attitude on European integration, age of the voter, assessment of the
conservative leader and labour party leader.
 A voter is likely to vote for conservative party if he/she has high political
knowledge and low Eurosceptic sentiment.

31
 A voter is likely to vote for Labour party if he/she has low political knowledge
and high Eurosceptic sentiment.
 Older age people with lower Eurosceptic sentiment tend to voter for Labour
party and young people with lesser Eurosceptic sentiment are voting for
conservative party.

Problem 2 :
In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the
United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned
documents.

2.2 Remove all the stopwords from all three speeches.


2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)

Solution :
 We imported the necessary libraries to perform the analysis and loaded the
speeches of the three presidents.
 The number of characters for the mentioned documents are 25180.
 The number of words for the mentioned documents are 5110.
 The number of sentences for the mentioned documents are 189.

The total number of characters are 25180


The total number of words are 5110
The total number of sentences are 189

 we then removed all the stop words from the documents.

Number of words after removing stop words 2165

The sample sentence after removing stop words :-

national day inauguration since 1789 people renewed sense


dedication United States

 The top three words after removing stop words for the speech of President
Franklin D. Roosevelt are ‘Know’, ‘Spirit’, ‘Life’.

32
Figure 47

 The top three words after removing stop words for the speech of President
John F. Kennedy are ‘US’, ‘World’, ‘Let’.

Figure 48

 The top three words after removing stop words for the speech of Let’, ‘US’,
‘America’, ‘Peace’.

Figure 49

33
 The top three words for all the three speeches combined are ‘US’, ‘America’,
‘World’.

Figure 50
 The word cloud for all the three speeches are

Figure 51 Figure 52

Figure 53

34

You might also like