You are on page 1of 34

Machine Learning Business

Report

Prepared by:
Mitesh Kumar Agrawal
DSBA June’21

1|Page GreatLakes Assignment: Machine Learning


Table of Contents:
List of Tables ..................................................................................................................................................................3
List of Figures..................................................................................................................................................................4
Problem1…………………………………………………………………………………………………………………………………………………………... 5-33
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference on it....5-7
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers................................7-12
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split the data into
train and test...............................................................................................................................................................12-14
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).........................................................................14-15
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results..........................................................................15-16
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.......................................16-18
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models and write inference which
model is best/optimized.............................................................................................................................................18-32
1.8 Based on these predictions, what are the insights?..............................................................................................32-33

Problem 2.............................................................................................................................................................33-34
2.1 Find the number of characters, words, and sentences for the mentioned documents...........................................33
2.2 Remove all the stopwords from all three speeches.................................................................................................33
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords).........................................................................................................................34

2|Page GreatLakes Assignment: Machine Learning


List of Tables
Table1: Sample Data 5
Table1: Sample Data 5
Table3: Null Value Check 6

Table4 : Data Summary 6

Table5: Skewness 6

Table6: Sample data after encoding 13

Table7: Data sample after scaling 13

Table8: Logistic Regression Classification Report on Train 19

Tabl9: Logistic Regression Classification Report on Test 19

Table10: LDA Classification report on Train and test 21

Table11: NB classification report on train 22

Table12: NB classification report on test 23

Table13: KNN Classification report on Train 24

Table14: KNN Classification report on Test 25

Table15: Bagging Classification report on Train 26

Table16: Bagging Classification report on Test 27

Table17: Ada Boosting Classification report on Train 28

Table18: Ada Boosting Classification report on Train 29

Table19: Gradient Boosting Classification report on Train 30

Table20: Gradient Boosting Classification report on Test 32

3|Page GreatLakes Assignment: Machine Learning


List of Figures
Fig1: Distribution of numerical variable 7
Fig2: vote and gender 8
Fig3: Vote and age 8
Fig4: vote and economic.cond.national 9
Fig5: vote and economic.cond.household 9
Fig6: vote and Blair 10
Fig7: vote and Hague 10
Fig8: vote and Europe 10
Fig9: vote and political.knowledge 11
Fig10: Pairplot 11
Fig11: Outlier check 12
Fig12: Bagging 16
Fig13: Boosting 17
Fig14: Logistic Regression Confusion Matrix on Train..18
Fig15: Logistic Regression Confusion Matrix on Test..18
Fig16: Logistic Regression ROC curve and ROC_AUC score on Train 19
Fig16: Logistic Regression ROC curve and ROC_AUC score on Train 20
Fig19: LDA ROC curve and ROC_AUC score on Train and Test 21
Fig20: Naive Bayes Confusion matrix on Train 22
Fig21: Naive Bayes Confusion matrix on Test 22
Fig22: NB ROC curve and ROC_AUC score on Train 23
Fig23: NB ROC curve and ROC_AUC score on Test 23
Fig24: KNN Confusion matrix on Train 24
Fig25: KNN Confusion matrix on Test 24
Fig26: KNN ROC curve and ROC_AUC score on Train 25
Fig27: KNN ROC curve and ROC_AUC score on Test 25
Fig28: Bagging Confusion matrix on Train 26
Fig29: Bagging Confusion matrix on Test 26
Fig30: Bagging ROC curve and ROC_AUC score on Train 27
Fig31: Bagging ROC curve and ROC_AUC score on Test 27
Fig32: Ada Boosting Confusion matrix on train 28
Fig33: Ada Boosting Confusion matrix on test 28
Fig34: Ada Boosting ROC curve and ROC_AUC score on Train 29
Fig35: Ada Boosting ROC curve and ROC_AUC score on Test 29
Fig36: Gradient Boosting Confusion matrix on Train 30
Fig37: Gradient Boosting Confusion matrix on Test 30
Fig38: Gradient Boosting ROC curve and ROC_AUC score on Train 31
Fig39: Gradient Boosting ROC curve and ROC_AUC score on Test 31

4|Page GreatLakes Assignment: Machine Learning


Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was
conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.

Data Ingestion
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference
on it
Sample Data:

Table1: Sample Data

Data Info:

Table2: Data Info

Inferences:

 Dataset consists of 1525 voters and 9 columns. Out of which 2 are object datatype i.e. Vote and gender and rest
7 are of integer datatype

5|Page GreatLakes Assignment: Machine Learning


Null Value Check:

Table3: Null Value Check

As we can see from the above that there is 0 null values in the complete dataset.

Data Summary:

Table4 : Data Summary

 Vote column consists of 2 unique values. Out of which 'Labour' party qis the highest in terms of frequency.
 Minimum and maximum age of voters is 24 and 93 respectively while the average age of voters is 54.
 Female voters are more in number than male.

Skewness:

Table5: Skewness

6|Page GreatLakes Assignment: Machine Learning


Skewness assesses the extent to which a variable distribution is symmetrical.It ranges from -1 to +1. Here 2 variables are
positively or Right skewed i.e. age and Hague and rest are negatively or Left skewed.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers
Univariate Analysis
Categorical Variable:

Observations-

 There are more number of voters who support Labour party compared to Conservative party
 Number of female voters are slighty more than male voters

Numerical Variable:

Fig1: Distribution of numerical variable


7|Page GreatLakes Assignment: Machine Learning
Observations-

 All the numerical variable are approximately normally distributed with multimodal instances
 Outliers are present in two variables i.e. economic.cond.household and economic.cond.national
 In few of the boxplots, min and max values are not clearly visible.

Bivariate Analysis
Categorical variable:

 Vote and gender

Fig2: vote and gender

Female voters are more in number for both parties i.e. Labour and Conservative

Numerical Variable:

 Vote and age

Fig3: Vote and age


Voters who are above around 65 years of age mostly support Conservative Party
8|Page GreatLakes Assignment: Machine Learning
 vote and economic.cond.national

Fig4: vote and economic.cond.national


National Economic condition for Labour party is slightly better than Conservative party

 vote and economic.cond.household

Fig5: vote and economic.cond.household

Household Economic condition for Labour party is slightly better than Conservative party

9|Page GreatLakes Assignment: Machine Learning


 vote and Blair

Fig6: vote and Blair


Leader of Labour party has mostly Blair of 4

 vote and Hague

Fig7: vote and Hague


Leader of Conservative party has mostly Hague of 4.

 vote and Europe

Fig8: vote and Europe


Voters who support Conservative party are more Euroseptic

10 | P a g e GreatLakes Assignment: Machine Learning


 vote and political.knowledge

Fig9: vote and political.knowledge


Labour party supporters have more knowledge of parties' positions on European integration
 Pairplot

Fig10: Pairplot
11 | P a g e GreatLakes Assignment: Machine Learning
Pairplot for this dataset does not gives us a clear idea of the relationship between the variables.

Outlier Check:

Fig11: Outlier check

From the plot we can see outlier is present in both economic.cond.household and economic.cond.national. Since only
one outlier is present in both of the variables and that too on the lower end so we will not do outlier treatment.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split
the data into train and test
Encoding the Data

Machine Learning models do not work with string values. So we need to encode the categorical variables.

From info we have 2 categorical variables- Vote and gender

Vote:

Gender:

Since there are only two values in both the variables and there is no level or order in the subcategory any encoding (i.e.
label encoding or one hot encoding )will give the same result.

So here we will use Label encoding


12 | P a g e GreatLakes Assignment: Machine Learning
Sample data after encoding:

Table6: Sample data after encoding

Scaling the dataset

Scaling is done so that the data which belongs to wide variety of ranges can be brought together in similar relative
range and thus optimizing the model performance.

It is recommended to do feature scaling when we are dealing with distance based models/algorithms(KNN, Regression
etc.) since they are very sensitive to the range of data points . It is very useful in checking and reducing multi-collinearity
in data.

But the tree based methods would not require scaling in general because it uses split method.

Here we will perform scaling on both type of models .

Since most of the variables are in the rage of 0-10 except age, we will scale only the age variable using Z-score method
for scaling.

Data sample after scaling:

Table7: Data sample after scaling

13 | P a g e GreatLakes Assignment: Machine Learning


Data Split: Splitting the data into Train and Test

We need to split the data into train and test so that we can compare the performance of the model in both train and
test datasets. When the target variable is imbalanced we generally split the data into 70:30 ratio of train and test. Here
also we have split the data into the ratio of 70:30.

Shape of the data after splitting where X refers to independent variable and y refers to dependent/target variable:

Train-

X(1067,8)

y(1067,1)

Test-

X(458,7)

y(458,1)

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).


Logistic Regression Model

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most
common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no,
and so on. Multinomial logistic regression can model scenarios where there are more than two possible discrete
outcomes. Logistic regression is a useful analysis method for classification problems, where you are trying to determine
if a new sample fits best into a category.

Feature importance after applying logistic regression

Rsquare on Train:

14 | P a g e GreatLakes Assignment: Machine Learning


Rsquare on test:

LDA(Linear Discriminant Analysis)

Linear Discriminant Analysis as its name suggests is a linear model for classification and dimensionality reduction.  Most
commonly used for feature extraction in pattern classification problems

Accuracy on Train: 0.84

Accuracy on Test: 0.82

Inferences:

 Both of the models perfomed well with both Train and Test dataset.
 There is no underfitting or overfitting as accuracy is also very close

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model

 K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set immediately instead
it stores the dataset and at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data

Accuracy on train data: 0.87

Accuracy on test data: 0.80

Naive Bayes Model

Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems.

Bayes' Theorem:

 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
 The formula for Bayes' theorem is given as:

15 | P a g e GreatLakes Assignment: Machine Learning


Accuracy on train data: 0.83

Accuracy on test data: 0.83

Inferences:

 The results of the KNN model shows that the accuracy on the train and test data are distant apart.
 The results of the Naive Baye's model shows the accuracy on train and test is same but it is less than the Logistic
regression model.
 So, we can conclude these models didn't performed so well.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting
Tuning is the process of maximizing the model's performance without overfitting or creating too high variance. In
machine learning, this is accomplished by selecting appropriate 'hyper-parameters'

Models such as Bagging, Ada Boosting and Gradient Boosting are resistant to overfitting.

Bagging Model (Using Random Forest)

Bootstrap Aggregating, also knows as bagging, is a machine learning ensemble meta-algorithm designed to improve
the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases
the variance and helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach. 

Fig12: Bagging

16 | P a g e GreatLakes Assignment: Machine Learning


Accuracy of Bagging Model on Train: 0.97

Accuracy of Bagging Model on Test: 0.84

Boosting

Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak
classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present in the first model. This procedure is
continued and models are added until either the complete training data set is predicted correctly or the maximum
number of models are added .

Fig13: Boosting

There are many types of Boosting. Out of which we have used two:

AdaBoost minimises loss function related to any classification error and is best used with weak learners. The method
was mainly designed for binary classification problems and can be utilised to boost the performance of decision trees.
Gradient Boosting is used to solve the differentiable loss function problem. The technique can be used for both
classification and regression problems. 

 ADA Boosting model

Accuracy on Train: 0.85


Accuracy on Test: 0.82

 Gradient Boosting model


Accuracy on Train: 0.89
Accuracy on Test: 0.83

Inferences:

17 | P a g e GreatLakes Assignment: Machine Learning


 From the Accuracy of the models, we can say that none of the models performed well with this dataset.
 All these models are overfitted with the train data.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.
1. Logistic Regression
Confusion Matrix on Train

Fig14: Logistic Regression Confusion Matrix on Train

Confusion Matrix on Test

Fig15: Logistic Regression Confusion Matrix on Test

Classification Report on Train

18 | P a g e GreatLakes Assignment: Machine Learning


Table8: Logistic Regression Classification Report on Train

Classification Report on Test

Tabl9: Logistic Regression Classification Report on Test

ROC curve and ROC_AUC score on Train

Fig16: Logistic Regression ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

19 | P a g e GreatLakes Assignment: Machine Learning


Fig17: Logistic Regression ROC curve and ROC_AUC score on Test

2. LDA
Confusion Matrix on Train and test

Fig18: LDA Confusion Matrix on Test and Train

Classification Report on Train and Test

20 | P a g e GreatLakes Assignment: Machine Learning


Table10: LDA Classification report on Train and test

ROC curve and ROC_AUC score on Train and Test

Fig19: LDA ROC curve and ROC_AUC score on Train and Test

3. Naive Baye's

21 | P a g e GreatLakes Assignment: Machine Learning


Confusion Matrix on Train

Fig20: Naive Bayes Confusion matrix on Train

Confusion Matrix on Test

Fig21: NB Confusion matrix on Test

Classification Report on Train

Table11: NB classification report on train

22 | P a g e GreatLakes Assignment: Machine Learning


Classification Report on Test

Table12: NB classification report on test

ROC curve and ROC_AUC score on Train

Fig22: NB ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

Fig23: NB ROC curve and ROC_AUC score on Test

23 | P a g e GreatLakes Assignment: Machine Learning


4. KNN
Confusion Matrix on Train

Fig24: KNN Confusion matrix on Train

Confusion Matrix on Test

Fig25: KNN Confusion matrix on Test

Classification Report on Train

Table13: KNN Classification report on Train


24 | P a g e GreatLakes Assignment: Machine Learning
Classification Report on Test

Table14: KNN Classification report on Test

ROC curve and ROC_AUC score on Train

Fig26: KNN ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

Fig27: KNN ROC curve and ROC_AUC score on Test


25 | P a g e GreatLakes Assignment: Machine Learning
5. Bagging
Confusion Matrix on Train

Fig28: Bagging Confusion matrix on Train

Confusion Matrix on Test

Fig29: Bagging Confusion matrix on Test

Classification Report on Train

Table15: Bagging Classification report on Train

26 | P a g e GreatLakes Assignment: Machine Learning


Classification Report on Test

Table16: Bagging Classification report on Test

ROC curve and ROC_AUC score on Train

Fig30: Bagging ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

Fig31: Bagging ROC curve and ROC_AUC score on Test


27 | P a g e GreatLakes Assignment: Machine Learning
6. Ada Boosting
Confusion Matrix on Train

Fig32: Ada Boosting Confusion matrix on train

Confusion Matrix on Test

Fig33: Ada Boosting Confusion matrix on test

Classification Report on Train

Table17: Ada Boosting Classification report on Train

28 | P a g e GreatLakes Assignment: Machine Learning


Classification Report on Test

Table18: Ada Boosting Classification report on Train

ROC curve and ROC_AUC score on Train

Fig34: Ada Boosting ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

Fig35: Ada Boosting ROC curve and ROC_AUC score on Test

29 | P a g e GreatLakes Assignment: Machine Learning


7. Gradient Boosting
Confusion Matrix on Train

Fig36: Gradient Boosting Confusion matrix on Train

Confusion Matrix on Test

Fig37: Gradient Boosting Confusion matrix on Test

Classification Report on Train

Table19: Gradient Boosting Classification report on Train


30 | P a g e GreatLakes Assignment: Machine Learning
Classification Report on Test

Table20: Gradient Boosting Classification report on Test

ROC curve and ROC_AUC score on Train

Fig38: Gradient Boosting ROC curve and ROC_AUC score on Train

ROC curve and ROC_AUC score on Test

Fig39: Gradient Boosting ROC curve and ROC_AUC score on Test


31 | P a g e GreatLakes Assignment: Machine Learning
Model Comparison

Models are compared on the basis of:

 Accuracy
 AUC
 Recall
 Precision
 F1-Score

From the above we can say that

 On the basis of Accuracy- Logistic Regression, LDA and Naive Bayes performed well
 On the basis of AUC- Logistic Regression and LDA performed well
 On the basis of Recall- Bagging performed well
 On the basis of Precision- Logistic Regression, LDA and Naive Bayes performed well
 On the basis of F1-Score- Logistic Regression performed well

So we could clearly see that Logistic Regression performed well on the basis of multiple parameters. So best
optimized model would be Logistic Regression model.

1.8 Based on these predictions, what are the insights?


Inferences
Top 5 features in Logistic Regression in order of decreasing importance:

Hague : -0.8379785998010736
Blair: 0.5746328767419848
political.knowledge : -0.48267335884793466
economic.cond.national :0.3375643790321967
age: -0.3246813099750099

Insights and Recommendations

 Logistic Regression model performs the best in the particular dataset


 Model tuning using hyper parameters could be helpful in optimizing the model's performance
 Bagging performs well both in train and test dataset
 Increasing the volume of Train dataset sample will help the models in training and thus better
predictions could be expected.

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
32 | P a g e GreatLakes Assignment: Machine Learning
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned documents.
Characters
1. President Franklin D. Roosevelt in 1941: 7571
2. President John F. Kennedy in 1961: 7618
3. President Richard Nixon in 1973: 991

Words
1. President Franklin D. Roosevelt in 1941: 1536
2. President John F. Kennedy in 1961: 1546
3. President Richard Nixon in 1973: 2028

Sentences:
1. President Franklin D. Roosevelt in 1941: 68
2. President John F. Kennedy in 1961: 52
3. President Richard Nixon in 1973: 69

2.2 Remove all the stopwords from all three speeches


 A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to
ignore, both when indexing entries for searching and when retrieving them as the result of a search query.  
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we
can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in
python has a list of stopwords stored in 16 different languages

Stop words were removed from the speeches

 2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords
Results after removing stopwords:
1. President Franklin D. Roosevelt in 1941

top3 words: e, n, r

2. President John F. Kennedy in 1961

top3 words: us, world, Let


33 | P a g e GreatLakes Assignment: Machine Learning
3. President Richard Nixon in 1973

top3 words: us, America,peace

34 | P a g e GreatLakes Assignment: Machine Learning

You might also like