Machine Learning Business Report

Machine Learning Business
Report
Prepared by:
Mitesh Kumar Agrawal
DSBA June’21
1|Page GreatLakes Assignment: Machine Learning

Table of Contents:
List of Tables ..................................................................................................................................................................3
List of Figures..................................................................................................................................................................4
Problem1…………………………………………………………………………………………………………………………………………………………... 5-33
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference on it....5-7
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers................................7-12
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split the data into
train and test...............................................................................................................................................................12-14
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).........................................................................14-15
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results..........................................................................15-16
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.......................................16-18
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models and write inference which
model is best/optimized.............................................................................................................................................18-32
1.8 Based on these predictions, what are the insights?..............................................................................................32-33
Problem 2.............................................................................................................................................................33-34
2.1 Find the number of characters, words, and sentences for the mentioned documents...........................................33
2.2 Remove all the stopwords from all three speeches.................................................................................................33
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords).........................................................................................................................34

List of Tables
Table1: Sample Data 5
Table1: Sample Data 5
Table3: Null Value Check 6
Table4 : Data Summary 6
Table5: Skewness 6
Table6: Sample data after encoding 13
Table7: Data sample after scaling 13
Table8: Logistic Regression Classification Report on Train 19
Tabl9: Logistic Regression Classification Report on Test 19
Table10: LDA Classification report on Train and test 21
Table11: NB classification report on train 22
Table12: NB classification report on test 23
Table13: KNN Classification report on Train 24
Table14: KNN Classification report on Test 25
Table15: Bagging Classification report on Train 26
Table16: Bagging Classification report on Test 27
Table17: Ada Boosting Classification report on Train 28
Table18: Ada Boosting Classification report on Train 29
Table19: Gradient Boosting Classification report on Train 30
Table20: Gradient Boosting Classification report on Test 32

List of Figures
Fig1: Distribution of numerical variable 7
Fig2: vote and gender 8
Fig3: Vote and age 8
Fig4: vote and economic.cond.national 9
Fig5: vote and economic.cond.household 9
Fig6: vote and Blair 10
Fig7: vote and Hague 10
Fig8: vote and Europe 10
Fig9: vote and political.knowledge 11
Fig10: Pairplot 11
Fig11: Outlier check 12
Fig12: Bagging 16
Fig13: Boosting 17
Fig14: Logistic Regression Confusion Matrix on Train..18
Fig15: Logistic Regression Confusion Matrix on Test..18
Fig16: Logistic Regression ROC curve and ROC_AUC score on Train 19
Fig16: Logistic Regression ROC curve and ROC_AUC score on Train 20
Fig19: LDA ROC curve and ROC_AUC score on Train and Test 21
Fig20: Naive Bayes Confusion matrix on Train 22
Fig21: Naive Bayes Confusion matrix on Test 22
Fig22: NB ROC curve and ROC_AUC score on Train 23
Fig23: NB ROC curve and ROC_AUC score on Test 23
Fig24: KNN Confusion matrix on Train 24
Fig25: KNN Confusion matrix on Test 24
Fig26: KNN ROC curve and ROC_AUC score on Train 25
Fig27: KNN ROC curve and ROC_AUC score on Test 25
Fig28: Bagging Confusion matrix on Train 26
Fig29: Bagging Confusion matrix on Test 26
Fig30: Bagging ROC curve and ROC_AUC score on Train 27
Fig31: Bagging ROC curve and ROC_AUC score on Test 27
Fig32: Ada Boosting Confusion matrix on train 28
Fig33: Ada Boosting Confusion matrix on test 28
Fig34: Ada Boosting ROC curve and ROC_AUC score on Train 29
Fig35: Ada Boosting ROC curve and ROC_AUC score on Test 29
Fig36: Gradient Boosting Confusion matrix on Train 30
Fig37: Gradient Boosting Confusion matrix on Test 30
Fig38: Gradient Boosting ROC curve and ROC_AUC score on Train 31
Fig39: Gradient Boosting ROC curve and ROC_AUC score on Test 31

Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was
conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.
Data Ingestion
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference
on it
Sample Data:
Table1: Sample Data
Data Info:
Table2: Data Info
Inferences:
 Dataset consists of 1525 voters and 9 columns. Out of which 2 are object datatype i.e. Vote and gender and rest
7 are of integer datatype

Null Value Check:
Table3: Null Value Check
As we can see from the above that there is 0 null values in the complete dataset.
Data Summary:
Table4 : Data Summary
 Vote column consists of 2 unique values. Out of which 'Labour' party qis the highest in terms of frequency.
 Minimum and maximum age of voters is 24 and 93 respectively while the average age of voters is 54.
 Female voters are more in number than male.
Skewness:
Table5: Skewness

Skewness assesses the extent to which a variable distribution is symmetrical.It ranges from -1 to +1. Here 2 variables are
positively or Right skewed i.e. age and Hague and rest are negatively or Left skewed.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers
Univariate Analysis
Categorical Variable:
Observations-
 There are more number of voters who support Labour party compared to Conservative party
 Number of female voters are slighty more than male voters
Numerical Variable:
Fig1: Distribution of numerical variable

Observations-
 All the numerical variable are approximately normally distributed with multimodal instances
 Outliers are present in two variables i.e. economic.cond.household and economic.cond.national
 In few of the boxplots, min and max values are not clearly visible.
Bivariate Analysis
Categorical variable:
 Vote and gender
Fig2: vote and gender
Female voters are more in number for both parties i.e. Labour and Conservative
Numerical Variable:
 Vote and age
Fig3: Vote and age

Voters who are above around 65 years of age mostly support Conservative Party
 vote and economic.cond.national
Fig4: vote and economic.cond.national

National Economic condition for Labour party is slightly better than Conservative party
 vote and economic.cond.household
Fig5: vote and economic.cond.household
Household Economic condition for Labour party is slightly better than Conservative party

 vote and Blair
Fig6: vote and Blair

Leader of Labour party has mostly Blair of 4
 vote and Hague
Fig7: vote and Hague

Leader of Conservative party has mostly Hague of 4.
 vote and Europe
Fig8: vote and Europe

Voters who support Conservative party are more Euroseptic
10 | P a g e GreatLakes Assignment: Machine Learning

 vote and political.knowledge
Fig9: vote and political.knowledge

Labour party supporters have more knowledge of parties' positions on European integration
 Pairplot
Fig10: Pairplot
Pairplot for this dataset does not gives us a clear idea of the relationship between the variables.
Outlier Check:
Fig11: Outlier check
From the plot we can see outlier is present in both economic.cond.household and economic.cond.national. Since only
one outlier is present in both of the variables and that too on the lower end so we will not do outlier treatment.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split
the data into train and test
Encoding the Data
Machine Learning models do not work with string values. So we need to encode the categorical variables.
From info we have 2 categorical variables- Vote and gender
Vote:
Gender:
Since there are only two values in both the variables and there is no level or order in the subcategory any encoding (i.e.
label encoding or one hot encoding )will give the same result.
So here we will use Label encoding

Sample data after encoding:
Table6: Sample data after encoding
Scaling the dataset
Scaling is done so that the data which belongs to wide variety of ranges can be brought together in similar relative
range and thus optimizing the model performance.
It is recommended to do feature scaling when we are dealing with distance based models/algorithms(KNN, Regression
etc.) since they are very sensitive to the range of data points . It is very useful in checking and reducing multi-collinearity
in data.
But the tree based methods would not require scaling in general because it uses split method.
Here we will perform scaling on both type of models .
Since most of the variables are in the rage of 0-10 except age, we will scale only the age variable using Z-score method
for scaling.
Data sample after scaling:
Table7: Data sample after scaling

Data Split: Splitting the data into Train and Test
We need to split the data into train and test so that we can compare the performance of the model in both train and
test datasets. When the target variable is imbalanced we generally split the data into 70:30 ratio of train and test. Here
also we have split the data into the ratio of 70:30.
Shape of the data after splitting where X refers to independent variable and y refers to dependent/target variable:
Train-
X(1067,8)
y(1067,1)
Test-
X(458,7)
y(458,1)
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression Model
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most
common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no,
and so on. Multinomial logistic regression can model scenarios where there are more than two possible discrete
outcomes. Logistic regression is a useful analysis method for classification problems, where you are trying to determine
if a new sample fits best into a category.
Feature importance after applying logistic regression
Rsquare on Train:

Rsquare on test:
LDA(Linear Discriminant Analysis)
Linear Discriminant Analysis as its name suggests is a linear model for classification and dimensionality reduction. Most
commonly used for feature extraction in pattern classification problems
Accuracy on Train: 0.84
Accuracy on Test: 0.82
Inferences:
 Both of the models perfomed well with both Train and Test dataset.
 There is no underfitting or overfitting as accuracy is also very close
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
 K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set immediately instead
it stores the dataset and at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data
Accuracy on train data: 0.87
Accuracy on test data: 0.80
Naive Bayes Model
Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems.
Bayes' Theorem:
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
 The formula for Bayes' theorem is given as:

Accuracy on train data: 0.83
Accuracy on test data: 0.83
Inferences:
 The results of the KNN model shows that the accuracy on the train and test data are distant apart.
 The results of the Naive Baye's model shows the accuracy on train and test is same but it is less than the Logistic
regression model.
 So, we can conclude these models didn't performed so well.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting
Tuning is the process of maximizing the model's performance without overfitting or creating too high variance. In
machine learning, this is accomplished by selecting appropriate 'hyper-parameters'
Models such as Bagging, Ada Boosting and Gradient Boosting are resistant to overfitting.
Bagging Model (Using Random Forest)
Bootstrap Aggregating, also knows as bagging, is a machine learning ensemble meta-algorithm designed to improve
the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases
the variance and helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach.
Fig12: Bagging

Accuracy of Bagging Model on Train: 0.97
Accuracy of Bagging Model on Test: 0.84
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak
classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present in the first model. This procedure is
continued and models are added until either the complete training data set is predicted correctly or the maximum
number of models are added .
Fig13: Boosting
There are many types of Boosting. Out of which we have used two:
AdaBoost minimises loss function related to any classification error and is best used with weak learners. The method
was mainly designed for binary classification problems and can be utilised to boost the performance of decision trees.
Gradient Boosting is used to solve the differentiable loss function problem. The technique can be used for both
classification and regression problems.
 ADA Boosting model

 Gradient Boosting model

Inferences:

 From the Accuracy of the models, we can say that none of the models performed well with this dataset.
 All these models are overfitted with the train data.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.
1. Logistic Regression
Confusion Matrix on Train
Fig14: Logistic Regression Confusion Matrix on Train
Confusion Matrix on Test
Fig15: Logistic Regression Confusion Matrix on Test
Classification Report on Train

Table8: Logistic Regression Classification Report on Train
Classification Report on Test
Tabl9: Logistic Regression Classification Report on Test
ROC curve and ROC_AUC score on Train
Fig16: Logistic Regression ROC curve and ROC_AUC score on Train
ROC curve and ROC_AUC score on Test

Fig17: Logistic Regression ROC curve and ROC_AUC score on Test
2. LDA
Confusion Matrix on Train and test
Fig18: LDA Confusion Matrix on Test and Train
Classification Report on Train and Test

Table10: LDA Classification report on Train and test
ROC curve and ROC_AUC score on Train and Test
Fig19: LDA ROC curve and ROC_AUC score on Train and Test
3. Naive Baye's

Fig20: Naive Bayes Confusion matrix on Train
Fig21: NB Confusion matrix on Test
Table11: NB classification report on train

Table12: NB classification report on test
Fig22: NB ROC curve and ROC_AUC score on Train
Fig23: NB ROC curve and ROC_AUC score on Test

4. KNN
Fig24: KNN Confusion matrix on Train
Fig25: KNN Confusion matrix on Test
Table13: KNN Classification report on Train

Table14: KNN Classification report on Test
Fig26: KNN ROC curve and ROC_AUC score on Train
Fig27: KNN ROC curve and ROC_AUC score on Test

5. Bagging
Fig28: Bagging Confusion matrix on Train
Fig29: Bagging Confusion matrix on Test
Table15: Bagging Classification report on Train

Table16: Bagging Classification report on Test
Fig30: Bagging ROC curve and ROC_AUC score on Train
Fig31: Bagging ROC curve and ROC_AUC score on Test

6. Ada Boosting
Fig32: Ada Boosting Confusion matrix on train
Fig33: Ada Boosting Confusion matrix on test
Table17: Ada Boosting Classification report on Train

Table18: Ada Boosting Classification report on Train
Fig34: Ada Boosting ROC curve and ROC_AUC score on Train
Fig35: Ada Boosting ROC curve and ROC_AUC score on Test

7. Gradient Boosting
Fig36: Gradient Boosting Confusion matrix on Train
Fig37: Gradient Boosting Confusion matrix on Test
Table19: Gradient Boosting Classification report on Train

Table20: Gradient Boosting Classification report on Test
Fig38: Gradient Boosting ROC curve and ROC_AUC score on Train
Fig39: Gradient Boosting ROC curve and ROC_AUC score on Test

Model Comparison
Models are compared on the basis of:
 Accuracy
 AUC
 Recall
 Precision
 F1-Score
From the above we can say that
 On the basis of Accuracy- Logistic Regression, LDA and Naive Bayes performed well
 On the basis of AUC- Logistic Regression and LDA performed well
 On the basis of Recall- Bagging performed well
 On the basis of Precision- Logistic Regression, LDA and Naive Bayes performed well
 On the basis of F1-Score- Logistic Regression performed well
So we could clearly see that Logistic Regression performed well on the basis of multiple parameters. So best
optimized model would be Logistic Regression model.
1.8 Based on these predictions, what are the insights?

Inferences
Top 5 features in Logistic Regression in order of decreasing importance:
Hague : -0.8379785998010736
Blair: 0.5746328767419848
political.knowledge : -0.48267335884793466
economic.cond.national :0.3375643790321967
age: -0.3246813099750099
Insights and Recommendations
 Logistic Regression model performs the best in the particular dataset

 Model tuning using hyper parameters could be helpful in optimizing the model's performance
 Bagging performs well both in train and test dataset
 Increasing the volume of Train dataset sample will help the models in training and thus better
predictions could be expected.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
Characters
1. President Franklin D. Roosevelt in 1941: 7571
2. President John F. Kennedy in 1961: 7618
3. President Richard Nixon in 1973: 991
Words
Sentences:
2.2 Remove all the stopwords from all three speeches

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to
ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we
can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in
python has a list of stopwords stored in 16 different languages
Stop words were removed from the speeches
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords
Results after removing stopwords:
1. President Franklin D. Roosevelt in 1941
top3 words: e, n, r
2. President John F. Kennedy in 1961
top3 words: us, world, Let

3. President Richard Nixon in 1973
top3 words: us, America,peace

Machine Learning Business Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Business Report

Uploaded by

Copyright:

Available Formats

Machine Learning Business

1|Page GreatLakes Assignment: Machine Learning

2|Page GreatLakes Assignment: Machine Learning

Table4 : Data Summary 6

Table6: Sample data after encoding 13

Table7: Data sample after scaling 13

Table8: Logistic Regression Classification Report on Train 19

Tabl9: Logistic Regression Classification Report on Test 19

Table10: LDA Classification report on Train and test 21

Table11: NB classification report on train 22

Table12: NB classification report on test 23

Table13: KNN Classification report on Train 24

Table14: KNN Classification report on Test 25

Table15: Bagging Classification report on Train 26

Table16: Bagging Classification report on Test 27

Table17: Ada Boosting Classification report on Train 28

Table18: Ada Boosting Classification report on Train 29

Table19: Gradient Boosting Classification report on Train 30

Table20: Gradient Boosting Classification report on Test 32

3|Page GreatLakes Assignment: Machine Learning

4|Page GreatLakes Assignment: Machine Learning

Table1: Sample Data

Table2: Data Info

5|Page GreatLakes Assignment: Machine Learning

Table3: Null Value Check

Table4 : Data Summary

6|Page GreatLakes Assignment: Machine Learning

Fig1: Distribution of numerical variable

 Vote and gender

Fig2: vote and gender

 Vote and age

Fig3: Vote and age

Fig4: vote and economic.cond.national

 vote and economic.cond.household

Fig5: vote and economic.cond.household

9|Page GreatLakes Assignment: Machine Learning

Fig6: vote and Blair

 vote and Hague

Fig7: vote and Hague

 vote and Europe

Fig8: vote and Europe

10 | P a g e GreatLakes Assignment: Machine Learning

Fig9: vote and political.knowledge

Fig11: Outlier check

From info we have 2 categorical variables- Vote and gender

So here we will use Label encoding

Table6: Sample data after encoding

Scaling the dataset

Here we will perform scaling on both type of models .

Data sample after scaling:

Table7: Data sample after scaling

13 | P a g e GreatLakes Assignment: Machine Learning

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Feature importance after applying logistic regression

14 | P a g e GreatLakes Assignment: Machine Learning

LDA(Linear Discriminant Analysis)

Accuracy on Train: 0.84

Accuracy on Test: 0.82

Accuracy on train data: 0.87

Accuracy on test data: 0.80

Naive Bayes Model

15 | P a g e GreatLakes Assignment: Machine Learning

Accuracy on test data: 0.83

Bagging Model (Using Random Forest)