You are on page 1of 69

Page |1

Machine Learning

Prepared By :- Sunira

Content
Page |2

Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was
conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will vote for on
the basis of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.
Data Ingestion: ………………………………………………………………………………………………………………………………………..11 marks
1.1) Read the dataset. Describe the data briefly. Interpret the inferences for each. Initial steps like head() .info(), Data Types,
etc . Null value check, Summary stats, Skewness must be discussed.
1.2) Perform EDA (Check the null values, Data types, shape, Univariate, bivariate analysis). Also check for outliers (4 pts).
Interpret the inferences for each (3 pts) Distribution plots(histogram) or similar plots for the continuous columns. Box
plots, Correlation plots. Appropriate plots for categorical variables. Inferences on each plot. Outliers proportion should
be discussed, and inferences from above used plots should be there. There is no restriction on how the learner wishes to
implement this but the code should be able to represent the correct output and inferences should be logical and correct.
1.3) Encode the data (having string values) for Modelling. Is Scaling necessary here or not?( 2 pts), Data Split: Split the data
into train and test (70:30) (2 pts). The learner is expected to check and comment about the difference in scale of
different features on the bases of appropriate measure for example std dev, variance, etc. Should justify whether there is
a necessity for scaling. Object data should be converted into categorical/numerical data to fit in the models.
(pd.categorical().codes(), pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split should
be discussed.
1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis) (2 pts). Interpret the inferences of both model s (2
pts). Successful implementation of each model. Logical reason behind the selection of different values for the
parameters involved in each model. Calculate Train and Test Accuracies for each model. Comment on the validness of
models (over fitting or under fitting)
1.5) Apply KNN Model and Naïve Bayes Model (2pts). Interpret the inferences of each model (2 pts). Successful
implementation of each model. Logical reason behind the selection of different values for the parameters involved in
each model. Calculate Train and Test Accuracies for each model. Comment on the validness of models (over fitting or
under fitting)
1.6) Model Tuning (4 pts) , Bagging ( 1.5 pts) and Boosting (1.5 pts). Apply grid search on each model (include all models)
and make models on best_params. Define a logic behind choosing particular values for different hyper-parameters for
grid search. Compare and comment on performances of all. Comment on feature importance if applicable. Successful
implementation of both algorithms along with inferences and comments on the model performances.
1.7) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score for each model, classification report (4 pts) Final Model - Compare and
comment on all models on the basis of the performance metrics in a structured tabular manner. Describe on which
model is best/optimized, After comparison which model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.(3 pts)
1.8) Based on your analysis and working on the business problem, detail out appropriate insights and recommendations to
help the management solve the business objective. There should be at least 3-4 Recommendations and insights in total.
Recommendations should be easily understandable and business specific, students should not give any technical
suggestions. Full marks should only be allotted if the
Page |3

Problem 2: ……………………………………………………………………………………………………………………………………………………………..
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be looking at
the following speeches of the Presidents of the United States of America:

President Franklin D. Roosevelt in 1941


President John F. Kennedy in 1961
President Richard Nixon in 1973

2.1) Find the number of characters, words and sentences for the mentioned documents. (Hint: use .words(), .raw(), .sent()
for extracting counts)
2.2) Remove all the stopwords from the three speeches. Show the word count before and after the removal of stopwords.
Show a sample sentence after the removal of stopwords.
2.3) Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords)
2.4) Plot the word cloud of each of the three speeches. (after removing the stopwords)
Page |4

List of Figures :-
Fig 1.1 ………………………………………………………………………………………………………………………….8
Fig 1.2, 1.3, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15 …………………………………………………..11
Fig 1.16 ………………………………………………………………………………………………………………………...12
Fig 1.17, 1.18, 1.19, 1.20 …………………………………………………………………………………………………….13
Fig 1.21, 1.22 ………………………………………………………………………………………………………………….14
Fig 1.23, 1.24 ………………………………………………………………………………………………………………….15
Fig 1.25, 1.26 ………………………………………………………………………………………………………………….16
Fig 1.27, 1.28 ………………………………………………………………………………………………………………….17
Fig 1.29 ………………………………………………………………………………………………………………………...18
Fig 1.30 …………………………………………………………………………………………………………………………20
Fig 1.31 …………………………………………………………………………………………………………………………21
Fig 1.33, 1.34 …………………………………………………………………………………………………………………..22
Fig 1.35, 1.36, 1.37, 1.38 ……………………………………………………………………………………………………..40
Fig 1.39, 1.40 …………………………………………………………………………………………………………………..41
Fig 1.41, 1.42, 1.43, 1.44 ……………………………………………………………………………………………………..42
Fig 1.45, 1.46, 1.47, 1.48 ……………………………………………………………………………………………………..43
Fig 1.49, 1.50 …………………………………………………………………………………………………………………..44
Fig 1.51, 1.52, 1.53, 1.54 ……………………………………………………………………………………………………..45
Fig 1.55, 1.56, 1.57, 1.58 ……………………………………………………………………………………………………..46
Fig 1.59, 1.60 …………………………………………………………………………………………………………………..47
Fig 1.61, 1.62, 1.63, 1.64 ……………………………………………………………………………………………………..48
Fig 1.65, 1.66, 1.67, 1.68 ……………………………………………………………………………………………………..49
Fig 1.69, 1.70 …………………………………………………………………………………………………………………..50
Fig 1.71, 1.72, 1.73, 1.74 ……………………………………………………………………………………………………..51
Fig 1.75, 1.76 …………………………………………………………………………………………………………………..52
Fig 1.77, 1.78, 1.79, 1.80 ……………………………………………………………………………………………………..53
Fig 1.81, 1.82, 1.83, 1.84 ……………………………………………………………………………………………………..54
Fig 1.85, 1.86 …………………………………………………………………………………………………………………..55
Fig 1.87, 1.88, 1.89, 1.90 ……………………………………………………………………………………………………..56
Fig 1.91, 1.92 …………………………………………………………………………………………………………………..57
Fig 1.93, 1.94, 1.95, 1.96 ……………………………………………………………………………………………………..58
Fig 1.97, 1.98, 1.99, 2.00 ……………………………………………………………………………………………………..59
Fig 2.1, 2.2 ……………………………………………………………………………………………………………………..60
Fig 2.3, 2.4, 2.5, 2.6 …………………………………………………………………………………………………………...61
Fig 2.7 …………………………………………………………………………………………………………………………..67
Fig 2.8 …………………………………………………………………………………………………………………………..68
Fig 2.9 …………………………………………………………………………………………………………………………..69
Page |5

List of Tables :-
Table 1.1, 1.2 ……………………………………………………………………………………………………………………7
Table 1.3, 1.4, 1.5 ………………………………………………………………………………………………………………8
Table 1.6, 1.7, 1.8 ………………………………………………………………………………………………………………9
Table 1.9, 1.10 ………………………………………………………………………………………………………………….18
Table 1.11, 1.12 ………………………………………………………………………………………………………………...22
Table 1.13 ……………………………………………………………………………………………………………………….23
Table 1.14, 1.15 ………………………………………………………………………………………………………………...25
Table 1.16, 1.17 ………………………………………………………………………………………………………………...26
Table 1.18 ……………………………………………………………………………………………………………………….27
Table 1.19 ……………………………………………………………………………………………………………………….28
Table 1.20, 1.21 ………………………………………………………………………………………………………………...30
Table 1.22, 1.23 …………………………………………………………………………………………………………………31
Table 1.24 ………………………………………………………………………………………………………………………..33
Table 1.25, 1.26 …………………………………………………………………………………………………………………34
Table 1.27, 1.28 …………………………………………………………………………………………………………………37
Table 1.29 ………………………………………………………………………………………………………………………..38
Table 1.30, 1.31 …………………………………………………………………………………………………………………62
Table 1.32, 1.33 …………………………………………………………………………………………………………………63
Page |6

Business Report – Predictive Modelling Project


By- Shorya Goel

Problem 1- You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null
value condition check. Write an inference on it.

Read the dataset – “Election_Data.xlsx”

Exploratory Data Analysis:


Top 5 entries in the dataset.

“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it is of no use
in the model.

Also, some variables contain ‘.’ operator in their name that can affect the model, so we will replace the ‘.’ With ‘_’
operator.

Shape of the Dataset


Number of rows: 1525
Number of columns: 9
Page |7

Info of the Dataset


Page |8

There are total of 10 variables present in the dataset. 2


Categorical Variables- vote, gender.
7 Numeric type variables-age, economic_cond_national, economic_cond_household, Blair, Hague, Europe,
political_knowledge.

Descriptive Statistics of the Dataset


Numerical Columns-

Categorical Columns-

The above table gives information such as unique values, mean, median, standard deviation, five point summary,
min-max, count, etc. for all the variables present in the dataset.

Check for Null Values-

From the above, it is clear that there are no null values present in the dataset. The
isnull() function is used here to check for missing values.
The sum() function is used in order to get the total number of null values present in a particular variable.
Page |9

Check for Duplicates-


There are total of 8 duplicate rows.

Since, there is no identification or unique code for each row present. We cannot clearly say that this is the same
person or different. So, we will not remove the duplicates in this case.

Skewness of the Dataset

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its
mean.
Only two variables are positively skewed and rest negatively skewed with max skewedness in Blair.

Coefficient of Variation Check

The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation to the
mean (average).

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data


analysis. Check for Outliers.

Univariate Analysis
For Continuous variables
P a g e | 10

We can see that all the numerical Variables are normally distributed (not perfectly normal though and are multi
modal in some instances as well.
There are outliers present in “economic_cond_national” and “economic_cond_household” variables that can be
seen from the boxplots on the right too.
Also from the boxplots the min and max values of the variables are not very clear, we can separately obtain them
while checking for outliers.
P a g e | 11

Bivariate Analysis-

Pairplot-

Pairplot tells us about the interaction of each variable with every other variable present. As such
there is no strong relationship present between the variables.
There is a mixture of positive and negative relationships though which is expected.

Overall, it’s a rough estimate of the interactions, clearer picture can be obtained by heatmap values and also
different kinds of plot.
P a g e | 12

Analysis - Blair and Age

People above the age of 45 yrs generally thinks that Blair is doing a good job.

Analysis - Hague and Age

Hague has slightly more concentration of nuteral points than that of Blair for people above 50 years of age.

Catplot Analysis - Blair (count) on economic_cond_household.


P a g e | 13

Catplot Analysis - Hague (count) on economic_cond_household

Blair has more points in terms of economic households than Hague.

Catplot Analysis - Blair (count) on economic_cond_national


P a g e | 14

Catplot Analysis – Hague (count) on economic_cond_national

Blair has more points in terms of economic national than Hague.


P a g e | 15

Catplot Analysis – Blair (count) on Europe

Catplot Analysis – Hague (count) on Europe

In the whole Europe if we look at the data then Blair is leading.


P a g e | 16

Catplot Analysis – Blair (count) on political_knowledge

Catplot Analysis – Hague (count) on political_knowledge

In terms of political knowledge Blair is considered better.


P a g e | 17

Covariance Matrix-

Correlation Matrix-

Heatmap-

Multicollinearity is an important issue which can harm the model. Heatmap is a good way of identifying this issue. It
gives us a basic idea of relationaship the variables have with each other.
P a g e | 18

Observations-
P a g e | 19

 Highest positive correlation is between “economic_cond_national” and “economic_cond_household” (35%).


But the good thing is that it’s not huge.
 Highest negative correlation is between “Blair” and “Europe” (30%) but this is also not huge.

Thus, Multicollinearity won’t be a issue in this dataset.

Outlier Check/Treatment-

Using boxplot-

There are outliers present in “economic_cond_national” and “economic_cond_household” variables that can be
seen from the boxplots.
We will find the upper and lower limits to get a clear picture of the outliers.
P a g e | 20

The upper and lower limits are in not that distant from each other and the outliers are on the lower side only that too
having value 1 where the lower limit is 1.5.
So it is not advisable to treat the outliers in this case. We
will move forward without treating the outliers.

1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test (70:30).
Encoding the dataset

As many machine learning models cannot work with string values we will encode the categorical variables and
convert their datatypes to integer type.
From the info of the dataset, we know there are 2 categorical type variables, so we need to encode these 2 variables
with the suitable technique.
Those 2 variables are ‘vote’ and ‘gender’. Their distribution is given below.

Gender Distribution-

Vote Distribution-
P a g e | 21

From the above results we can see that both variables contain only two classifications of data in them.
We can use a simple categorical conversion (pd.Categorical() or dummy encoding with drop_first = True, both of
them will work here) This will convert the values into 0 and 1. As there is no level or order in the subcategory any
encoding will give the same result.
The datatype after conversion is int8 format we can convert these to int64 format, it will work even if we don’t
change it to int64.

After encoding-

Info-

Data-

Now, the model can built on this data.

Scaling the dataset


P a g e | 22

Scaling is done so that the data which belongs to wide variety of ranges can be brought together in similar relative
range and thus bringing out the best performance of the model.
Generally, we perform Feature Scaling while dealing with the Gradient Descent Based Algorithms such as Linear and
Logistic Regression as these are very sensitive to the range of data points. In addition, it is very useful in checking
and reducing multi-collinearity in the data. VIF (Variance Inflation Factor) is a value, which indicates the presence of
multicollinearity. This value can be calculated only after building the regression model.
So, it totally depends on the model we building whether scaling is required or not. Usually, the distance based
methods (E.g.: KNN) would require scaling as it is sensitive to extreme difference and can cause a bias. But the
tree-based method (E.g.: Decision Trees) would not require scaling in general as its unnecessary (because it uses
split method).

Here, we will perform scaling on both type of models and will check whether there is a difference in the
performance of the model.
Also, after looking at the data we only need to scale the ‘age’ variable as rest of the variables are in the range 0-
10 at max.
We will use Z-core scaling here to scale the age variable.
After Scaling using z-score or standard scaling in which mean=0 and standard deviation=1.

Data Split: Splitting the data into test and train

Before splitting we need to find the target variable. Here, the target variable is “vote”. Vote
data distribution-

There is a data imbalance in the variable as seen above so we cannot split it in 50:50 ratio instead will split the data
into 70:30 ratio. Also we will use the oversampling technique SMOTE to check whether it improves the model or not.

Here, we will use 2 different train and test sets, one without scaled data and one with scaled data. This will help us in
understanding whether scaling can improve the performance or not.

Now splitting both X and y data in the ratio 70:30, where train data is 70 % and test data is 30%. After
splitting- the shape of the data

Here,
X_train - denotes 70% training dataset with 8 columns (except the target column called “vote”). X_test-
denotes 30% test dataset with 8 columns (except the target column called “vote”). y_train- denotes the
70% training dataset with only the target column called “vote”.
y_test- denotes 30% test dataset with only the target column called “vote”. Similarly,

the data is divided for scaled data and SMOTE oversampling data.
P a g e | 23

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).


Interpret the inferences of both models.
Logistic Regression Model

Before fitting the model it is important to know about the hyper parameters that is involved in model
building. Parameters:
• penalty
• solver
• max_iter
• tol, etc.
To find the best combination among these parameters we will use the “GridSearchCV” method. This method can
perform multiple combinations of these parameters simultaneously and can provide us with the best optimum results.
After performing the search the best parameters came out to be-

Now the results for unscaled data-

Intercept for the model is: [2.83418594]


Feature Importance-

Train Accuracy - 0.8303655107778819


Test Accuracy - 0.8537117903930131

Probabilities on the test set-(0 being preferring Conservative Party and 1 being preferring Labour Party)
P a g e | 24

Now the results for scaled data-

Intercept for the model is: [2.01329492]


Feature Importance-

Train Accuracy - 0.8303655107778819


Test Accuracy - 0.8493449781659389

Probabilities on the test set-


P a g e | 25

Statsmodels can also be used here in building the Logistic regression model to more about the statistics of the model
in the background.

Inferences
Pseudo R2 = 0.3809 shows that the model performs really well, as the value between 0.2 – 0.4 shows that a model
performs well.
Model perform slightly well on the unscaled data.
There is no under-fitting or overfitting present as accuracy for both test and train data are not very different.

Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance model
comparison.

LDA (Linear Discriminant Analysis) Model

Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• solver
• shrinkage
Now after performing the GridSearchCV, the best parameters obtained are-
 shrinkage = 'auto'
 solver = 'lsqr'
P a g e | 26

Now the results for unscaled data-

Intercept for the model is: [3.72460468]


Feature Importance-

Train Accuracy- 0.8284910965323337


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Intercept for the model is: [2.48783541]


Feature Importance-

Train Accuracy- 0.828491096532333


Test Accuracy- 0.851528384279476

Probabilities on the test set-


P a g e | 27

Inferences
The model performed well and the accuracy for both the scaled and unscaled data are same.

Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance model
comparison.

1.5. Apply KNN Model and Naïve Bayes Model. Interpret the inferences of
each model.

K Nearest Neighbours Model

KNN is a distance based supervised machine learning algorithm that can be used to solve both classification and
regression problems. Main disadvantage of this model is it becomes very slow when large volume of data is there
and thus makes it an impractical choice where inferences need to be drawn quickly.

Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• n_neighbors
• weights
• algorithm
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'n_neighbors' = 5,
• 'weights' = uniform,
• 'algorithm' = auto

Now the results for unscaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8165938864628821

Probabilities on the test set-


P a g e | 28

Now the results for scaled data-

Train Accuracy- 0.8603561387066542


Test Accuracy- 0.8384279475982532

Probabilities on the test set-

Inference-
The model performed better with the scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set then for
the test.

Naive Bayes Model

Naive Bayes classifiers is a model based on applying Bayes' theorem with strong (naïve) independent
assumptions between the features. These assumptions however may not be the perfect case in real life scenarios.
P a g e | 29

Bayes Theorem-

Here the method that we are going to use is the GaussianNB() method, also know as BernoulliNB(). This method
requires all the features to be in categorical type. A general assumption in this method is the data is following a normal
or Gaussian distribution.
There are no specific parameters in this model like other, so we will simply fit the model with default parameters.

Now the results for unscaled data-

Train Accuracy- 0.8219306466729147


Test Accuracy- 0.8471615720524017

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8219306466729147


Test Accuracy- 0.8471615720524017

Probabilities on the test set-


P a g e | 30

Inference-
The model performed exactly the same for both Unscaled and Scaled data. This
model performed well on the data no overfitting or under-fitting present.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting.
Model Tuning

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a
variance. In machine learning, this is accomplished by selecting appropriate “hyper-parameters”.

Grid Search is one of the most common methods of optimizing the parameters. In this a set of parameters is defined
and then the performance for each combinations of these parameters is evaluated, using cross validation. Then from
among those

Models such as Bagging, Boosting, Gradient boosting, Cat boosting, etc are prone to under or over fitting of data.
Overfitting means that the model works very well on the Train data but works relatively poor in the test data. Under-
fitting means that the model works very well on the Test data, but works relatively poor on the training data.

Bagging Model (Using Random Forest Classifier)

Bagging is an ensemble technique. Ensemble techniques are the machine learning techniques that combine several
base models to get an optimal model. Bagging is designed to improve the performance of existing machine learning
algorithms used in statistical classification or regression. It is most commonly used with tree-based algorithms. It is a
parallel method.

Each base classifier is trained in parallel with a training set which is generated by randomly drawing, with
replacement, N data from the training .Training set for each of the base classifiers is independent of each other.

Here, we will use random forest as the base classifier. Hyper-parameters that will be used in the model are
• max_depth
• max_features
• min_samples_leaf
• min_samples_split
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values. Now
after performing the “GridSearchCV”, the best parameters obtained are-
• ' max_depth ' = 5,
• ' max_features ' = 7,
• ' min_samples_leaf ' = 25,
• ' min_samples_split ' = 60,
• ' n_estimators ' = 101

Now the results for unscaled data-

Train Accuracy- 0.8303655107778819


Test Accuracy- 0.834061135371179

Probabilities on the test set-


P a g e | 31

Now the results for scaled data-

Train Accuracy- 0.8303655107778819


Test Accuracy- 0.834061135371179

Probabilities on the test set-

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed extremely well on the data no overfitting or under-fitting present.

Boosting Model

Boosting is also an ensemble technique. It converts weak learners to strong learners. Unlike bagging it is a sequential
method where result from one weak learner becomes the input for the another and so on, thus improving the
performance of the model.
Each time base learning algorithm is applied, it generates a new weak learner prediction rule. This is an iterative
process and the boosting algorithm combines these weak rules into a single strong prediction rule.
P a g e | 32

Misclassified input data gain a higher weight and examples that are classified correctly will lose weight. Thus,
future weak learners focus more on the examples that previous weak learners misclassified. They are also tree
based methods.

There are many kinds of Boosting Techniques available and for this project, the following boosting
techniques are to be used.
1. ADA Boost (Adaptive Boosting)
2. Gradient Boosting
3. Extreme Gradient Boosting
4. CAT Boost (Categorical Boosting)

ADA Boosting Model

This model is used to increase the efficiency of binary classifiers, but now used to improve multiclass classifiers
as well. AdaBoost can be applied on top of any classifier method to learn from its issues and bring about a more
accurate model and thus it is called the “best out-of-the-box classifier”.

Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• algorithm
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values. Now
after performing the “GridSearchCV”, the best parameters obtained are-
• ' algorithm ' = ' SAMME',
• ' n_estimators ' = 50

Now the results for unscaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8427947598253275

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8427947598253275
P a g e | 33

Probabilities on the test set-


P a g e | 34

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed extremely well on the data no overfitting or under-fitting present.

Gradient Boosting Model

This model is just like the ADABoosting model. Gradient Boosting works by sequentially adding the misidentified
predictors and under-fitted predictions to the ensemble, ensuring the errors identified previously are corrected. The
major difference lies in the in what it does with the misidentified value of the previous weak learner. This method tries
to fit the new predictor to the residual errors made by the previous one.

Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Criterion
• loss
• n_estimators
• max_features
• min_samples_split

There are other parameters as well but we will use these for gridsearch, rest default values. Now
after performing the “GridSearchCV”, the best parameters obtained are-

• 'criterion' = 'friedman_mse',
• 'loss' = 'exponential',
• 'n_estimators' = 50,
• 'max_features' = 8,
• 'min_samples_split' = 45

Now the results for unscaled data-

Train Accuracy- 0.865979381443299


Test Accuracy- 0.8493449781659389

Probabilities on the test set-


P a g e | 35

Now the results for scaled data-

Train Accuracy- 0.865979381443299


Test Accuracy- 0.8493449781659389

Probabilities on the test set-

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set then for
the test.

XGBoost (eXtreme Gradient Boosting) Model

This model as the name suggests is based on the gradient boosting framework. However, XGBoost improves upon
the base GBM framework through systems optimization and algorithmic enhancements. It uses parallel processing
and RAM optimizations that can improve the working of Gradient Boost method to its peak and thus making the
name “extreme”.
Another advantage is that it automatically treat the null values by passing the parameter “missing = NaN”. Another
difference is that XGB don’t contain the parameter ‘min_sample_split’ .
P a g e | 36

Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Max_depth
• Min_samples_leaf
• n_estimators
• learning_rare
There are other parameters as well but we will use these for gridsearch, rest default values. Now
after performing the “GridSearchCV”, the best parameters obtained are-
• 'max_depth': 4,
• 'min_samples_leaf': 15,
• 'n_estimators': 50,
• 'learning_rate': 0.1

Now the results for unscaled data-

Train Accuracy- 0.8847235238987816


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8847235238987816


Test Accuracy- 0.851528384279476

Probabilities on the test set-


P a g e | 37

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set then for
the test.

CATBoosting Model

CATBoosting (CATegorical Boosting) is a machine learning algorithm that uses gradient boosting on decision trees.
It is an open source library and it’s not available under the usual Sklearn package. We have to separately install the
package. CAT Boost can manage huge amount of categorical data that is usually a problem for majority of the
machine learning algorithm. CATBoost is easy to implement and very powerful. It provides excellent results and is
very fast in executing.

There are plenty of parameters to specify but we are going forward with the default parameters.

Now the results for unscaled data-

Train Accuracy- 0.9381443298969072


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.9381443298969072


Test Accuracy- 0.851528384279476

Probabilities on the test set-


P a g e | 38

Inference-
The model performed exactly the same for both Unscaled and Scaled data. There is a
huge difference between the accuracy values of train and test data. There is
overfitting of data here as accuracy of train is far more then test data.

1.7 Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.
Performance Metrics:

Usually there are many performance metrics that are used in assessing the strength of the model to understand how
the model has performed as well as to take an informed decision on whether to go forward with the model in the real
time scenario or not.

The industrial standards are generally based on the following methods:


• Classification Accuracy.
• Confusion Matrix.
• Classification Report.
• Area Under ROC Curve (visualization) and AUC Score

Logistic Regression

Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8537117903930131

Confusion Matrix-

For Train Data For Test Data

True Negative: 212 False Positive: 111 True Negative: 94 False Positive: 45
False Negative: 70 True Positive: 674 False Negative: 22 True Positive: 297
P a g e | 39

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.916
P a g e | 40

After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8493449781659389

Confusion Matrix-

For Train Data For Test Data

True Negative: 211 False Positive: 112 True Negative: 94 False Positive: 45
False Negative: 69 True Positive: 675 False Negative: 24 True Positive: 295

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.915
P a g e | 41

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8245967741935484 Train Accuracy- 0.8138440860215054
Test Accuracy- 0.8427947598253275 Test Accuracy- 0.8384279475982532

LDA (Linear Discriminant Analysis)

Before Scaling-
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290

Classification Report-
For Train Set-
P a g e | 42

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

LDA (Train) score: 0.877 LDA (Test) score: 0.915

After Scaling
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290

Classification Report-
For Train Set-
P a g e | 43

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

LDA (Train) score: 0.877 LDA (Test) score: 0.915

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8245967741935484 Train Accuracy- 0.8125
Test Accuracy- 0.8427947598253275 Test Accuracy- 0.8296943231441049

KNN (K Nearest Neighbours)

Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8165938864628821

Confusion Matrix-
For Train Data For Test Data

True Negative: 219 False Positive: 104 True Negative: 84 False Positive: 55
False Negative: 70 True Positive: 674 False Negative: 29 True Positive: 290
P a g e | 44

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

KNN (Train) score: 0.915 KNN (Test) score: 0.867

After Scaling-
Train Accuracy- 0.8603561387066542
Test Accuracy- 0.8384279475982532
P a g e | 45

Confusion Matrix-
For Train Data For Test Data

True Negative: 239 False Positive: 84 True Negative: 95 False Positive: 44


False Negative: 65 True Positive: 679 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

KNN (Train) score: 0.933 KNN (Test) score: 0.877


P a g e | 46

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8830645161290323 Train Accuracy- 0.8918010752688172
Test Accuracy- 0.8144104803493449 Test Accuracy- 0.8231441048034934

Naïve Bayes

Before Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017

Confusion Matrix-
For Train Data For Test Data

True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


P a g e | 47

For both Training and Testing:

NB (Train) score: 0.874 NB (Test) score: 0.910

After Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017

Confusion Matrix-
For Train Data For Test Data

True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287

Classification Report-
For Train Set-

For Test Set-


P a g e | 48

Area Under ROC Curve and AUC Score:


For both Training and Testing:

NB (Train) score: 0.874 NB (Test) score: 0.910

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8205645161290323 Train Accuracy- 0.8077956989247311
Test Accuracy- 0.8362445414847162 Test Accuracy- 0.8253275109170306

Bagging

Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179

Confusion Matrix-
For Train Data For Test Data

True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
P a g e | 49

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Bagging (Train) score: 0.891 Bagging (Test) score: 0.900

After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179

Confusion Matrix-
For Train Data For Test Data

True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
P a g e | 50

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Bagging (Train) score: 0.891 Bagging (Test) score: 0.900

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.831989247311828 Train Accuracy- 0.8259408602150538
Test Accuracy- 0.8078602620087336 Test Accuracy- 0.8100436681222707
P a g e | 51

ADA Boosting

Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8427947598253275

Confusion Matrix-
For Train Data For Test Data

True Negative: 224 False Positive: 99 True Negative: 97 False Positive: 42


False Negative: 75 True Positive: 669 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

ADABoost (Train) score: 0.889 ADABoost (Test) score: 0.906


P a g e | 52

After Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8427947598253275

Confusion Matrix-
For Train Data For Test Data

True Negative: 224 False Positive: 99 True Negative: 97 False Positive: 42


False Negative: 75 True Positive: 669 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-


P a g e | 53

Area Under ROC Curve and AUC Score:


For both Training and Testing:

ADABoost (Train) score: 0.889 ADABoost (Test) score: 0.906

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.842741935483871 Train Accuracy- 0.8185483870967742
Test Accuracy- 0.8362445414847162 Test Accuracy- 0.8013100436681223

Gradient Boosting

Before Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389

Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
P a g e | 54

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:
Gradient Boost (Train) score: 0.933 Gradient Boost (Test) score: 0.915

After Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389

Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
P a g e | 55

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Gradient Boost (Train) score: 0.933 Gradient Boost (Test) score: 0.915

SMOTE –
P a g e | 56

Without Scaling With Scaling


Train Accuracy- 0.8716397849462365 Train Accuracy- 0.8595430107526881
Test Accuracy- 0.8296943231441049 Test Accuracy- 0.8296943231441049

XGBoost

Before Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 242 False Positive: 81 True Negative: 96 False Positive: 43


False Negative: 42 True Positive: 702 False Negative: 25 True Positive: 294

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


P a g e | 57

For both Training and Testing:


P a g e | 58

XGBoost (Train) score: 0.941 XGBoost (Test) score: 0.912

After Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 242 False Positive: 81 True Negative: 96 False Positive: 43


False Negative: 42 True Positive: 702 False Negative: 25 True Positive: 294

Classification Report-
For Train Set-

For Test Set-


P a g e | 59

Area Under ROC Curve and AUC Score:


For both Training and Testing:

XGBoost (Train) score: 0.941 XGBoost (Test) score: 0.912

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8803763440860215 Train Accuracy- 0.875
Test Accuracy- 0.8384279475982532 Test Accuracy- 0.8362445414847162

CATBoost

Before Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data
True Negative: 281 False Positive: 42 True Negative: 97 False Positive: 42
False Negative: 24 True Positive: 720 False Negative: 26 True Positive: 293
P a g e | 60

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

CATBoost (Train) score: 0.978 CATBoost (Test) score: 0.914

After Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 281 False Positive: 42 True Negative: 97 False Positive: 42


False Negative: 24 True Positive: 720 False Negative: 26 True Positive: 293
P a g e | 61

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

CATBoost (Train) score: 0.978 CATBoost (Test) score: 0.914

SMOTE –
P a g e | 62

Without Scaling With Scaling


Train Accuracy- 0.9455645161290323 Train Accuracy- 0.9401881720430108
Test Accuracy- 0.834061135371179 Test Accuracy- 0.8318777292576419

Model Comparison-
This is a process through which we will compare all models build and find the best optimised among. There are total
of 9 different kind of model which each model build 4 times in following fashion –
- Without scaling
- With Scaling
- Smote Without Scaling
- Smote With Scale.
So, that makes total of 36 model in all.

The basis on which models are evaluated are known as performance metrics. The metrics on which the model will
be evaluated are-
• Accuracy
• AUC
• Recall
• Precision
• F1-Score

Without Scaling-

From the above-


- Basis on the Accuracy – Logistic Regression performed better than others.
- Basis on the AUC Score – Logistics Regression performed better than others.
- Basis on Recall – Bagging performed slightly better than others.
- Basis on Precision – Naive Bayes performed slightly better than others.
- Basis on F1- Score – Logistic Regression along with some others performed well.

All the models performed well with slight difference ranging from (1-5%).

With Scaling-

From the above-


- Basis on the Accuracy – LDA and XGBoost performed better than others.
- Basis on the AUC Score – Logistics Regression and LDA performed better than others.
- Basis on Recall – Bagging performed slightly better than others.
- Basis on Precision – Naive Bayes performed slightly better than others.
- Basis on F1- Score – Logistic Regression along with some others performed well.
P a g e | 63

Smote Performance Metrics-


Here, the comparison is based on Accuracy values only. This will help in understanding whether using Smote
has positive effect or not.

Smote Without Scaling-

From the above-


- On the basis of Accuracy Logistic Regression performed better than others.

Smote With Scaling-

From the above-


- On the basis of Accuracy Logistic Regression performed better than others.

Observations-
- From the above 4 tables it can be observed that using smote didn’t increase the performance of the
models. Overall models without Smote performed well for both Scaled and Unscaled Data. Thus, there is
no use of applying smote here.
- As for the Scaled and Unscaled Data Models, scaling only improved the performance of the distance
based algorithms for others it slightly decreased the performance overall. Here, only KNN from
Scaled Data Model performed slightly well than the KNN Unscaled Model.
- Best Optimised Model – On the basis of all the comparisons and performance metrics “Logistic
Regression” without scaling performed the best out of all.

1.8) Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve
the business objective.

Inferences
- Logistic Regression performed the best out of all the models build.
- Logistic Regression Equation for the model:
(3.05008) * Intercept + (-0.01891) * age + (0.41855) * economic_cond_national + (0.06714) *
economic_cond_household + (0.62627) * Blair + (-0.83974) * Hague + (- 0.21413) * Europe + (-
0.40331) * political_knowledge + (0.10881) * gender

The above equation help in understanding the model and the feature importance, how each feature
contributes to the predicted output.

Top 5 features in Logistic Regression Model in order of decreasing importance are- 1.


Hague : |-0.8181846212178241|
2. Blair : |0.5460018962250501|
3. economic_cond_national : |0.37700497490783885|
4. political_knowledge : |-0.3459485608005413| 5.
Europe : |-0.19691071679312278|
P a g e | 64

Insights and Recommendations

Our main Business Objective is - “To build a model, to predict which party a voter will vote for on the basis of the
given information, to create an exit poll that will help in predicting overall win and seats covered by a particular
party.”

 Using Logistic Regression Model without scaling for predicting the outcome as it has the best
optimised performance.
 Hyper-parameters tuning is an important aspect of model building. There are limitations to this as to process
these combinations huge amount of processing power is required. But if tuning can be done with many sets
of parameters than we might get even better results.
 Gathering more data will also help in training the models and thus improving their predictive powers.
 Boosting Models can also perform well like CATBoost performed well even without tuning. Thus, if we
perform hyper-parameters tuning we might get better results.
 We can also create a function in which all the models predict the outcome in sequence. This will helps in
better understanding and the probability of what the outcome will be.
P a g e | 65

Problem 2- In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the
mentioned documents.
Characters
Characters in Franklin D. Roosevelt’s speech: 7571
Characters in John F. Kennedy’s speech: 7618

Characters in Richard Nixon’s speech: 9991 Words


Words in Franklin D. Roosevelt’s speech: 1536
Words in John F. Kennedy’s speech: 1546
Words in Richard Nixon’s speech: 2028

Sentences
Sentences in Franklin D. Roosevelt’s speech: 68
Sentences in John F. Kennedy’s speech: 52 Sentences
in Richard Nixon’s speech: 68

2.2 Remove all the stopwords from all three speeches.


To remove the stopwords, there is package called “stopwords” in the nltk.corpus library. So, in
order to do so we need to import following libraries-
- from nltk.corpus import stopwords
- from nltk.stem.porter import PorterStemmer

The stopwords library contains all the stop words like ‘and’, ‘a’, ‘is’, ‘to’, ‘is’, ‘.’, ‘of’, ‘to’ etc., that usually don’t
have any importance in understanding the sentiment or usefullness in machine learning algorithms. These stopwords
present in the package are universally accepted stopwords and we can add using the (.extend()) function or remove
them as per our requirement.

Also, we need to specify the language we are working with before defining the functions, as there are many language
packages. Here, we will use English.

Stemming is a process which helps the processor in understanding the words that have similar meaning. In this the
words are brought down to their base or root level by removing the affixes. It is highly used in search engines. For
e.g. - eating, eats, eaten all these will be reduced to eat after stemming.

Some of the stop words removed are-


P a g e | 66

2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

Results after removing stopwords and stemming.

 For Franklin D. Roosevelt’s speech:

Here ‘spirit’, ‘life’ are on 3rd place because of the same number of occurrences.
Most occurring word: It.

For John F. Kennedy’s speech:


Most occurring word: Us.

For Richard Nixon’s speech:

Most occurring word: Us.

2.4 Plot the word cloud of each of the speeches of the variable.
(after removing the stopwords)
P a g e | 67

Word Cloud is a data visualization technique used for representing text data in which the size of each word
indicates its frequency or importance. For generating word-cloud we need word-cloud package. By default it is not
installed in the kernel, so we have to install it.
After importing the package we will again remove the stopwords but will not perform stemming. As removing
stops words would remove the filter the unwanted words that possibly have no sentiment analysis.

Word Cloud of Roosevelt’s Speech:

We can see some highlighted words like “nation”, ”know”, “people”, etc which we observed as top words in the
previous question. This shows the bigger the size more the frequency.
P a g e | 68

Word Cloud of Kennedy’s Speech:


P a g e | 69

Word Cloud of Nixon’s Speech:

Insights –
 Our objective was to look at all the 3 speeches and analyse them. To find the strength and
sentiment of the speeches.
 Based on the outputs we can see that there are some similar words that are present in all the
speeches.
 These words may the point which inspired the many people and also get them the seat of the
president of United States of America
 Among all the speeches “ nation “ is the word that is significantly highlighted in all three.

You might also like