Davice ML 21-01-2024

MACHINE LEARNING
PROJECT
Statement of Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to create
an exit poll that will help in predicting overall win and seats covered by a particular party.
Executive Summary of problem 1:Introduction to problem 1:

The purpose of this exercise is to go through the dataset and find out best model to
predict 'Vote' and to find out how each attribute affects the decision made by the voter using
list of attributes. The scope of this project is quite exhaustive which focuses much on model
performances with possible parameter alterations and to try out multiple algorithms and
compare them on various performance metrics. The intent is to understand why under a given
circumstance a particular algorithm works better than others.
Data Description of problem 1:

1. vote: Party choice: Conservative or
Labour
2. age: in years
3. economic.cond.national: Assessment of current national economic

conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic

conditions, 1 to 5.
5. Blair: Assessment of the Labour

leader, 1 to 5.
6. Hague: Assessment of the Conservative

leader, 1 to 5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European integra
High scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European

integration, 0 to 3.
9. gender: female or
male.
Sample of the dataset of problem 1:

In Unna economic. economic.
Blai Eur political.kn gen
de med: vote age cond.nati cond.hous Hague
r ope owledge der
x 0 onal ehold
Labo fem
0 1 43 3 3 4 1 2 2
ur ale
Labo mal
1 2 36 4 4 4 4 5 2
ur e
Labo mal
2 3 35 4 4 5 2 3 2
ur e
Labo fem
3 4 24 4 2 2 1 4 0
ur ale
Labo mal
4 5 41 2 2 1 1 6 2
ur e
Table : Sample dataset of problem 1

1.1) Read the dataset. Describe the data briefly. Interpret the inferences for each. Initial
steps like head() .info(), Data Types, etc . Null value check, Summary stats, Skewness must
be discussed.
Exploratory Data Analysis:

Let us check the types of variables in the data frame.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 vote 1525 non-null object
2 age 1525 non-null int64
3 economic.cond.national 1525 non-null int64
4 economic.cond.household 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64
7 Europe 1525 non-null int64
8 political.knowledge 1525 non-null int64
9 gender 1525 non-null object
dtypes: int64(8), object(2)
column name count mean std min 25% 50% 75% max
Unnamed: 0 1525 763 440.374 1 382 763 1144 1525
vote 1525 0.69705 0.45969 0 0 1 1 1
age 1525 54.1823 15.7112 24 41 53 67 93
economic.cond.national
1525 3.2459 0.88097 1 3 3 4 5
economic.cond.household
1525 3.14033 0.92995 1 3 3 4 5
Blair 1525 3.33443 1.17482 1 2 4 4 5
Hague 1525 2.74689 1.2307 1 2 2 4 5
Europe 1525 6.72853 3.29754 1 4 6 10 11
political.knowledge
1525 1.5423 1.08332 0 0 2 2 3
gender_male 1525 0.46754 0.49911 0 0 0 1 1
Observations:
• There are 1525 rows and 10 columns in the dataset.
• It has two 2 categorical variables and 8 numerical variables.
• There are no missing values present in the dataset.
• There are 8 duplicated rows present in the dataset.
• ' Unnamed: 0' column has to be dropped.
• In age column five point summary we can see that the minimum age is 24 and maximum
age is 93. So people with age below 24 have not been surveyed.
• Also female voters are greater in frequency in than male voters.
• The mean values are slightly higher than the median values for some variables. It
indicates that the data may be a bit skewed.
• From the target variable we can see that 70% voters chose Labour party.
1.2 Perform EDA (Check the null values, Data types, shape, Univariate, bivariate analysis).
Also check for outliers (4 pts). Interpret the inferences for each (3 pts) Distribution
plots(histogram) or similar plots for the continuous columns. Box plots, Correlation plots.
Appropriate plots for categorical variables. Inferences on each plot. Outliers proportion
should be discussed, and inferences from above used plots should be there. There is no
restriction on how the learner wishes to implement this but the code should be able to
represent the correct output and inferences should be logical and correct.
Check for missing values in the dataset:
Column name Missing value count

vote 0
age 0
economic.cond.national 0
economic.cond.household 0
Blair 0
Hague 0
Europe 0
political.knowledge 0
gender 0
Table: Missing value count
• There are no missing values in the dataset.
• There are 8 duplicated rows present in the dataset which were removed. These
duplicates need to be dropped because they do not add any value to the study. Now the
row values got reduced to 1517.
• From the above plot the target variables categories are seems to be unbalanced.
• The Labour category has 1063 values while the Conservative category has 462 values.
• If, our types of data inside a columns is in 70-30 ratio we can consider it as not imbalanced
or good spread. So we don't have to do sampling.
gender female male All
vote
Conservative 0.169413 0.133817 0.30323
Labour 0.363217 0.333553 0.69677
All 0.532630 0.467370 1.00000

Table : Crosstab between categorical variables
• From the count plot and crosstab we can see that the female voters count is higher than the
male voters count.
• But the female and male voters ratios are nearly the same for both Labour and Conservative
votes.
Figure : Histogram and box plot of variables in the dataset

• From the above plot we can see that the age variable does not have a normal distribution.
Most of the data points are lying between 40 to 60. So the voters are mostly between this
age group.
• Also the age variable is slightly right skewed as the median is around 53 here.
• Both economic.cond.national and economic.cond.household variables have outliers. But
since there are ordinal values we don't have to treat them.
• Both of these variables have outliers on the lower side of the whiskers. But the box plots
seems symmetrical.
• In both economic.cond.national and economic.cond.household variables number 3 has the
higher data points.
Figure: Histogram and box plot of variables in the dataset
• There are no outliers present in the Blair, Hague, Europe and political.knowledge variables.
• Both Blair and Hague features have symmetrical box plot. But Europe feature is left skewed
and political.knowledge feature is right skewed.
• Since the political.knowledge feature has 0 as genuine values, the left whisker is not visible
as the lower quartile is equal to minimum value.
• In feature Blair number 4 has the highest data points and in feature Hague number 3 has
the highest data points.
• In Feature Europe number 11 has the most number of data points. Europe is an 11-point
scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment. So most of the voters has 'Eurosceptic’ sentiment.
• As per political.knowledge feature number 2 is the highest number. But the second highest
number is 0 which indicate that many voters don't have the Knowledge of parties' positions
on European integration.
• From the above heat map we can see that there is not much correlation between features.
• Europe and age has a slight positive correlation.
• Most of the features have negative correlation with each other.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not?( 2 pts), Data Split: Split the data into train and test (70:30) (2 pts). The learner is
expected to check and comment about the difference in scale of different features on
the bases of appropriate measure for example std dev, variance, etc. Should justify
whether there is a necessity for scaling. Object data should be converted into
categorical/numerical data to fit in the models. (pd.categorical().codes(),
pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split
should be discussed.
Encoding:
• One Hot encoding is done on the only ‘Object’ types variable i.e. ‘gender’.
• A new column is created, with 1 indicating that variable as True and 0 as False.
• For Target variable used categorical label encoding as we are going to use it for classification.
• This is how the extended variable’s data looks:
In economic.c Ha
economic.cond. Bl Eur political.kn gender
de vote age ond.nation gu
household air ope owledge _male
x al e
1 1 43 3 3 4 1 2 2 0
2 1 36 4 4 4 4 5 2 1
3 1 35 4 4 5 2 3 2 1
4 1 24 4 2 2 1 4 0 0
5 1 41 2 2 1 1 6 2 1
Table :Encoded dataset
• After encoding we have split the dataset into train and test variables in the given ratio
70:30 with random state as 1.
• Since there are 1 and 0 values in the dependent variable, we need to ensure that an equal
number of 1 and 0 are split into both Training and Testing datasets.
• This will ensure a balance in the data and will not cause biasness while Training or Testing
the model. Therefore, a function stratify=target is used while splitting.
Index age economic.cond.national economic.cond.household Blair Hague Eur
534 71 3 3 4 2 1
709 57 4 5 4 1
1145 24 3 4 2 4 1
1082 43 4 4 2 3
958 37 3 2 4 2
Table : Sample train dataset of independent variables
economic.c Bl
Ind ag economic.cond Ha Eur political.k gender
ond.nation ai
ex e .household gue ope nowledge _male
al r
275 71 2 3 4 2 11 0 0
768 31 2 2 2 4 5 2 1
417 35 4 3 2 1 7 2 1
103
34 4 4 4 2 7 0 0
4
508 40 3 4 4 2 7 3 1
Table: Sample test dataset of independent variable
Table : Sample test dataset of independent variables
Index vote
534 1
709 1
1145 0
1082 1
958 1
Table : Sample train dataset of dependent variable
Index vote
275 1
768 0
417 1
1034 1
508 1
Table: Sample test dataset of dependent variable
Scaling:
• Scaling is a necessity when using Distance-based models such as KNN, Logistic
regression etc.
• Algorithms like Linear Discriminant Analysis and Naive Bayes do not require feature
scaling.
• So we can use unscaled data for Naive ayes and LDA algorithms and for algorithms like
KNN we can use scaled dataset.
• Especially in this dataset some of the feature values are in different units and ranges.
• For example, 'Age' feature is a continuous variable with wide range of data points
between 24 to 93. So it has a mean of 54 and standard deviation of 15.7, which is totally
in different range from other variables which are ordinal.
• Most of the ordinal variables have 1-5 so their mean is around 3 and standard deviation
is around 1, which is very different from 'Age' variable.
• This might result in inaccuracies in model predictions.
• Therefore, in order for machine learning models to interpret these features on the same
scale, we need to perform feature scaling.
• We can use minmax scaler function as we have both continuous and ordinal variables in
our dataset.
Inde economic.cond. economic.con Blai Hagu Europ political.k gen

vote age
x national d.household r e e nowledge _m
0 1 0.27536 0.5 0.5 0.75 0 0.1 0.66667 0

1 1 0.17441 0.75 0.75 0.75 0.75 0.4 0.66667 1
2 1 0.15942 0.75 0.75 1 0.25 0.2 0.66667 1
3 1 0 0.75 0.25 0.25 0 0.3 0 0
4 1 0.24638 0.25 0.25 0 0 0.5 0.66667 1
Table 1.5 Encoded dataset
1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis) (2 pts). Interpret
the inferences of both model s (2 pts). Successful implementation of each model. Logical
reason behind the selection of different values for the parameters involved in each model.
Calculate Train and Test Accuracies for each model. Comment on the validness of models
(over fitting or under fitting)
Model Building
Logistic Regression Model

After the data preprocessing Logistic Regression model is applied to the Train and Test
datasets with default hyper-parameters and solver considered as to be ‘newton-cg’.
LogisticRegression(max_iter=10000, n_jobs=2, solver='newton-cg', verbose=True)
Logistic regression model(Train data):

Accuracy
0.8417
Classification report
precision recall f1-score support
0.0 0.77 0.68 0.72 323

1.0 0.87 0.91 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067
Logistic regression model(Test data):

Accuracy
0.8289
Classification report
0.0 0.75 0.66 0.70 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.80 0.78 0.79 456
weighted avg 0.82 0.83 0.83 456
Inference
• The model is neither over fitted nor under fitted as the we can see the difference
between the accuracy values of train and test data is not much.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
LDA model(Train data):

Accuracy
0.8370
classification_report
0.0 0.75 0.69 0.72 323

1.0 0.87 0.90 0.89 744
accuracy 0.84 1067

macro avg 0.81 0.79 0.80 1067
weighted avg 0.83 0.84 0.83 1067
LDA model(Test data):
Accuracy
0.8333
0.0 0.75 0.67 0.71 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
Inference
• The LDA model looks a bit better than the Logistic Regression models since the
accuracy value of train and test data are nearly equal.
1.5) Apply KNN Model and Naïve Bayes Model (2pts). Interpret the inferences of each model
(2 pts). Successful implementation of each model. Logical reason behind the selection of
different values for the parameters involved in each model. Calculate Train and Test
Accuracies for each model. Comment on the validness of models (over fitting or under fitting)
KNN Model
As KNN Model is a distance based algorithm we have to use scaled dataset. After the
data preprocessing and scaling KNN model is applied to the Train and Test datasets with default
hyper-parameters. So the n_neighbors parameter value used is 5 here.
KNN model(Train data):

Accuracy
0.8737
0.0 0.81 0.77 0.79 323

1.0 0.90 0.92 0.91 744
accuracy 0.87 1067

macro avg 0.85 0.84 0.85 1067
weighted avg 0.87 0.87 0.87 1067
KNN model(Test data):
Accuracy
0.8158
0.0 0.70 0.70 0.70 138

1.0 0.87 0.87 0.87 318
accuracy 0.82 456

macro avg 0.78 0.78 0.78 456
weighted avg 0.82 0.82 0.82 456
Inference
• The model is not over-fitted or under fitted. But as we can see there is some difference
between train and test data accuracy.
• Though the difference is not much we can still improve the model by identifying the
perfect K value using misclassification error versus K plot.
Figure: KNN model - Misclassification error Vs K

• From the above plot we can see that the lowest points are 17 and 19. So we can
choose any one to build our model.
• We are going to choose 17 as the n_neighbors parameter value to build our KNN
model.
KNN model(Train data):

Accuracy
0.8407
0.0 0.76 0.70 0.73 323

1.0 0.87 0.90 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067
KNN model(Test data):

Accuracy
0.8553
0.0 0.80 0.70 0.74 138

1.0 0.88 0.92 0.90 318
accuracy 0.86 456

macro avg 0.84 0.81 0.82 456
weighted avg 0.85 0.86 0.85 456
Inference
• In the new KNN model test accuracy has increased and train accuracy is decreased. But
the difference between train and test accuracies have reduced. So this is a better fit.
Naive Bayes model

As Naive Bayes algorithm is based on probability not on distance, so it doesn't require
feature scaling. After the data preprocessing and scaling Naive Bayes model is applied to the
Train and Test datasets with default hyper-parameters.
Naive Bayes model(Train data):

Accuracy
0.8323
0.0 0.73 0.71 0.72 323
1.0 0.88 0.88 0.88 744
accuracy 0.83 1067

macro avg 0.80 0.80 0.80 1067
weighted avg 0.83 0.83 0.83 1067
Naive Bayes model(Test data):

Accuracy
0.8202
0.0 0.71 0.69 0.70 138

1.0 0.87 0.88 0.87 318
accuracy 0.82 456

macro avg 0.79 0.78 0.79 456
weighted avg 0.82 0.82 0.82 456
Inference
between the accuracy values of train and test data is very less.
1.6) Model Tuning (4 pts) , Bagging ( 1.5 pts) and Boosting (1.5 pts). Apply grid search on each
model (include all models) and make models on best_params. Define a logic behind choosing
particular values for different hyper-parameters for grid search. Compare and comment on
performances of all. Comment on feature importance if applicable. Successful
implementation of both algorithms along with inferences and comments on the model
performances.
Random Forest Model

In random forests, each tree in the ensemble is built from a sample drawn with
replacement (i.e., a bootstrap sample) from the training set. Here we have used the default
parameters like n_estimators=100, criterion='gini', max-depth=none. Random Forest is a tree-
based model and hence does not require feature scaling.
Random Forest model(Train data):

Accuracy
0.9991
0.0 1.00 1.00 1.00 323

1.0 1.00 1.00 1.00 744
accuracy 1.00 1067

macro avg 1.00 1.00 1.00 1067
weighted avg 1.00 1.00 1.00 1067
Random Forest model(Test data):

Accuracy
0.8443
0 0.77 0.69 0.73 138

1 0.87 0.91 0.89 318
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.84 0.84 0.84 456
Inference
• The model is over-fitted.
• As we can see, the train data has a 99.9% accuracy and test data has 84% accuracy.
The difference is more than 10%. So, we can infer that the Random Forest model is
over-fitted.
• We will use bagging to improve the performance of the model.
Bagging Model
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on
random subsets of the original dataset and then aggregate their individual predictions (either
by voting or by averaging) to form a final prediction. Here we have used the default parameter
values such as n_estimators=10 because increasing it leads to higher over fitting problem.
Bagging model(Train data):

Accuracy
0.9840

0 0.97 0.98 0.97 323
1 0.99 0.99 0.99 744
accuracy 0.98 1067

macro avg 0.98 0.98 0.98 1067
weighted avg 0.98 0.98 0.98 1067
Bagging model(Test data):

Accuracy
0.8421
0 0.76 0.70 0.73 139

1 0.88 0.90 0.89 319
accuracy 0.84 458

macro avg 0.82 0.80 0.81 458
weighted avg 0.84 0.84 0.84 458
Inference
• Even after using bagging classifier the over fitting problem is still there.
• But as we can see both train and test data accuracy has reduced.
ADA Boosting Model

The core principle of AdaBoost is to fit a sequence of weak learners. The number of w
eak learners is controlled by the parameter n_estimators. The learning_rate parameter controls
the contribution of the weak learners in the final combination. The main parameters to tune to
obtain good results are n_estimators. In this case default values are good to go with.
ADA Boosting model(Train data):

Accuracy
0.8398
0 0.75 0.70 0.73 323

1 0.87 0.90 0.89 740
accuracy 0.84 1062

macro avg 0.81 0.80 0.81 1062
weighted avg 0.84 0.84 0.84 1062
ADA Boosting model(Test data):
Accuracy
0.8355
0 0.74 0.73 0.74 139

1 0.88 0.89 0.89 319
accuracy 0.84 457

macro avg 0.81 0.81 0.81 457
weighted avg 0.84 0.84 0.84 457
Inference
• The ADA Boosting model looks better than the Random Forest and Bagging models
since the accuracy value of train and test data are nearly equal.
Gradient Boosting Model

Gradient boosting is a machine learning technique for regression and classification
problems, which produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees. Here we have used the default value for n_estimators to see how
it differs from ADA Boosting.
Gradient Boosting model(Train data):

Accuracy
0.8860
0 0.84 0.78 0.81 323

1 0.91 0.93 0.92 744
accuracy 0.89 1067

macro avg 0.87 0.86 0.86 1067
weighted avg 0.88 0.89 0.88 1067
Gradient Boosting model(Test data):

Accuracy
0.8421
0 0.77 0.69 0.73 138

1 0.87 0.91 0.89 318
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.84 0.84 0.84 456
Inference
between the accuracy values of train and test data is very less.
Logistic Regression Model Tuning:
Best parameters:
{'penalty': 'l2',
'solver': 'sag',
'tol': 0.0001,
max_iter=10000,
n_jobs=2,}
Logistic regression Tuned model(Train data):
Accuracy
0.8398
0.0 0.78 0.66 0.72 323

1.0 0.86 0.92 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.79 0.80 1067
weighted avg 0.84 0.84 0.84 1067
Logistic regression Tuned model(Test data):

Accuracy
0.8224
0.0 0.74 0.64 0.68 138

1.0 0.85 0.90 0.88 318
accuracy 0.82 456

macro avg 0.80 0.77 0.78 456
weighted avg 0.82 0.82 0.82 456
Comparison on performance of both regular and tuned logistic regression model

s:
Tuned Model
Data score Regular Model (%) (%)
Train Accuracy 84.17 83.98
Precision 87 86
Recall 91 92
F1-score 89 89
Test Accuracy 82.89 82.24
Precision 86 85
Recall 90 90
F1-score 88 88
Inference
• As we can see from the above tabular comparison, there is not much difference betwee
n the performance regular logistic regression model and tuned logistic regression model
.
• The difference between the train and test accuracies has also been reduced.
• The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good models.
LDA Model Tuning:

Best parameters:
{'solver': 'svd',
'tol': 0.0001}
LDA Tuned model(Train data):

Accuracy
0.8369
0.0 0.75 0.69 0.72 323
1.0 0.87 0.90 0.89 744
accuracy 0.84 1067

macro avg 0.81 0.79 0.80 1067
weighted avg 0.83 0.84 0.83 1067
LDA Tuned model(Test data):

Accuracy
0.8333
0.0 0.75 0.67 0.71 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
Comparison on performance of both regular and tuned LDA models:
Regular Tuned Model

Data score Model (%) (%)
Precision 86 87
Recall 89 90
F1-score 87 89
Precision 87 86
Recall 92 90
F1-score 90 88
Inference
• As we can see from the above tabular comparison, there is not much difference betwee
n the performance regular LDA model and tuned LDA model.
• The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good models.
KNN Model Tuning:

Best parameters:
{'metric': 'cityblock',
'n_neighbors': 10}
KNN Tuned model(Train data):
Accuracy
0.8511
0.0 0.75 0.77 0.76 323
1.0 0.90 0.89 0.89 744
accuracy 0.85 1067

macro avg 0.82 0.83 0.83 1067
weighted avg 0.85 0.85 0.85 1067
KNN Tuned model(Test data):

Accuracy
0.8266
0.0 0.71 0.72 0.71 138
1.0 0.88 0.87 0.88 318
accuracy 0.83 456

macro avg 0.79 0.80 0.80 456
weighted avg 0.83 0.83 0.83 456
Comparison on performance of both regular and tuned KNN models:
Regular Tuned Model

Data score Model (%) (%)
Precision 87 90
Recall 90 89
F1-score 89 89
Precision 88 88
Recall 92 87
F1-score 90 88
Table 1. Comparison on performance of both regular and tuned KNN models
Inference
• There is no over-fitting or under-fitting in the tuned KNN model. Overall, it is a good mo
del.
• As we can see, the regular KNN model was even better fit than tuned model. So model t
uning has not helped the model much.
• The values are better in the regular or base KNN model.
• Therefore, the regular KNN model is a better model.
Random Forest Model Tuning:

Best parameters:
{'max_depth': 9,
'min_samples_leaf': 50,
'min_samples_split': 300,
'n_estimators': 25}
Random Forest Tuned model(Train data):

Accuracy
0.7832
0.0 0.87 0.34 0.48 323

1.0 0.77 0.98 0.86 744
accuracy 0.78 1067

macro avg 0.82 0.66 0.67 1067
weighted avg 0.80 0.78 0.75 1067
Random Forest Tuned model(Test data):

Accuracy
0.7917
0.0 0.89 0.36 0.51 138

1.0 0.78 0.98 0.87 318
accuracy 0.79 456

macro avg 0.83 0.67 0.69 456
weighted avg 0.81 0.79 0.76 456
Comparison on performance of both regular and tuned Random Forest models:
Regular Model Tuned Model

Data score (%) (%)
Precision 100 77
Recall 100 98
F1-score 100 86
Precision 87 78
Recall 91 98
F1-score 89 87
Table: Comparison on performance of both regular and tuned Random Forest models
Inference
• The regular Random Forest model was very much over fitted. But tuning helped t
he model to overcome the over fitting problem.
• But accuracy is very much decreased. It is very low compared to other classifier
models.
• Recall, Precision and F1-scores has improved a lot after tuning.
Bagging Model Tuning:

Best parameters:
{'max_features': 1.0,
'max_samples': 0.5,
'n_estimators': 20}
Bagging Tuned model(Train data):

Accuracy
0.9268
0.0 0.91 0.84 0.87 323

1.0 0.93 0.96 0.95 744
accuracy 0.93 1067

macro avg 0.92 0.90 0.91 1067
weighted avg 0.93 0.93 0.93 1067
Bagging Tuned model(Test data):

Accuracy
0.8355
0.0 0.76 0.67 0.71 138

1.0 0.86 0.91 0.88 318
accuracy 0.84 456
macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.84 0.83 456
Comparison on performance of both regular and tuned Bagging models:
Data score Regular Model (%) Tuned Model (%)

Precision 99 93
Recall 99 96
F1-score 99 95
Precision 88 86
Recall 90 91
F1-score 89 88
Table : Comparison on performance of both regular and tuned Bagging models
Inference
• The random forest model even after using bagging technique, is still over- fitted.
ADA Boosting Model Tuning:

Best parameters:
{'algorithm': 'SAMME.R',
'learning_rate': 0.1,
'n_estimators': 500}
ADA Boosting Tuned model(Train data):

Accuracy
0.8520
0.0 0.78 0.72 0.75 323

1.0 0.88 0.91 0.90 744
accuracy 0.85 1067

macro avg 0.83 0.81 0.82 1067
weighted avg 0.85 0.85 0.85 1067
ADA Boosting Tuned model(Test data):
Accuracy
0.8136
0.0 0.72 0.64 0.67 138

1.0 0.85 0.89 0.87 318
accuracy 0.81 456

macro avg 0.78 0.76 0.77 456
weighted avg 0.81 0.81 0.81 456
Comparison on performance of both regular and tuned ADA Boosting models:
Regular Model Tuned Model

Data score (%) (%)
Precision 87 88
Recall 90 91
F1-score 89 90
Precision 88 85
Recall 89 89
F1-score 89 87
Table :
Inference
• There is no over-fitting or under-fitting in the tuned ADA Boosting model. Overall, it is a
good model.
• As we can see, the regular ADA Boosting model was even better fit than tuned model. S
o model tuning has not helped the model much.
• The values are better in the regular or base ADA Boosting model.
• Therefore, the regular ADA Boosting model is a better model.
Gradient Boosting Model Tuning:

Best parameters:
{'learning_rate': 0.01,
'max_depth': 4,
'n_estimators': 500,
'subsample': 0.2}
Gradient Boosting Tuned model(Train data):

Accuracy
0.8860

0.0 0.83 0.78 0.81 323
1.0 0.91 0.93 0.92 744
accuracy 0.89 1067

macro avg 0.87 0.86 0.86 1067
weighted avg 0.88 0.89 0.88 1067
Gradient Boosting Tuned model(Test data):

Accuracy
0.8268
0.0 0.74 0.66 0.70 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.80 0.78 0.79 456
weighted avg 0.82 0.83 0.82 456
Comparison on performance of both regular and tuned Gradient Boosting mode

ls:
Tuned Model
Data score Regular Model (%) (%)
Precision 91 91
Recall 93 93
F1-score 92 92
Precision 87 86
Recall 91 90
Inference
• There is no over-fitting or under-fitting in the tuned Gradient Boosting model. Overall, it
is a good model.
• As we can see, the regular Gradient Boosting model was even better fit than tuned mod
el. So model tuning has not helped the model much.
• The values are better in the regular or base Gradient Boosting model.
• Therefore, the regular Gradient Boosting model is a better model.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model,
classification report (4 pts) Final Model - Compare and comment on all models on the basis of
the performance metrics in a structured tabular manner. Describe on which model is
best/optimized, After comparison which model suits the best for the problem in hand on the
basis of different measures. Comment on the final model.(3 pts)
Logistic Regression Model
ROC_AUC score - Train data ROC_AUC score - Test data
0.892 0.882
ROC curve - Train data ROC curve - Test data
classification report - Train data

0.0 0.77 0.68 0.72 323

1.0 0.87 0.91 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067
classification report - Test data
0.0 0.75 0.66 0.70 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.80 0.78 0.79 456
weighted avg 0.82 0.83 0.83 456
Logistic Regression Tuned Model
Accuracy - Train data Accuracy - Test data
0.8398 0.8224
Confusion Matrix - Train data Confusion Matrix - Test data
[[214 108] [[ 88 50]

[ 31 287]] [ 62 677]]
0.892 0.882

0.0 0.78 0.66 0.72 323

1.0 0.86 0.92 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.79 0.80 1067
weighted avg 0.84 0.84 0.84 1067
0.0 0.74 0.64 0.68 138

1.0 0.85 0.90 0.88 318
accuracy 0.82 456

macro avg 0.80 0.77 0.78 456
weighted avg 0.82 0.82 0.82 456
Inference
• Both the models are neither over fitted nor under fitted as the we can see the
difference between the accuracy values of train and test data is not much.
• Accuracy, Precision, Recall for test data is almost in line with training data. This
• ROC_AUC_score and ROC curve indicates no over fitting.
LDA Model
0.8228 0.8531

[[217 105] [ 83 656]] [[ 95 43][ 24 294]]
0.877 0.914

0 0.72 0.67 0.70 323

1 0.86 0.89 0.87 744
accuracy 0.82 1067

macro avg 0.79 0.78 0.79 1067
weighted avg 0.82 0.82 0.82 1067
0 0.80 0.69 0.74 138

1 0.87 0.92 0.90 318
accuracy 0.85 456

macro avg 0.84 0.81 0.82 456
weighted avg 0.85 0.85 0.85 456
LDA Tuned Model
0.8369 0.8333
[[217 105] [[ 95 43]

[ 83 656]] [ 24 294]]
0.877 0.882

0 0.72 0.67 0.70 323

1 0.86 0.89 0.87 744
accuracy 0.82 1067

macro avg 0.79 0.78 0.79 1067
weighted avg 0.82 0.82 0.82 1067
0 0.80 0.69 0.74 138

1 0.87 0.92 0.90 318
accuracy 0.85 456

macro avg 0.84 0.81 0.82 456
weighted avg 0.85 0.85 0.85 456
Inference
KNN Model
0.8407 0.8553
[[224 98] [[ 96 42]

[ 71 668]] [ 24 294]]
0.914 0.874
0.0 0.76 0.70 0.73 323

1.0 0.87 0.90 0.89 744
accuracy 0.84 1067

macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067
0.0 0.80 0.70 0.74 138

1.0 0.88 0.92 0.90 318
accuracy 0.86 456

macro avg 0.84 0.81 0.82 456
weighted avg 0.85 0.86 0.85 456
KNN Tuned Model

0.8228 0.8266

[[217 105] [ 83 656]] [[ 99 39][ 40 278]]
0.922 0.914

0.0 0.75 0.77 0.76 323

1.0 0.90 0.89 0.89 744
accuracy 0.85 1067

macro avg 0.82 0.83 0.83 1067
weighted avg 0.85 0.85 0.85 1067

0.0 0.71 0.72 0.71 138

1.0 0.88 0.87 0.88 318
accuracy 0.83 456

macro avg 0.79 0.80 0.80 456
weighted avg 0.83 0.83 0.83 456
Inference
Random Forest Model
1.0 0.8443
[[323 0] [[ 93 45]
[ 0 744]] [ 26 292]]
1.0 0.895

0 1.00 1.00 1.00 323
1 1.00 1.00 1.00 744
accuracy 1.00 1067

macro avg 1.00 1.00 1.00 1067
weighted avg 1.00 1.00 1.00 1067

0 0.78 0.67 0.72 138

1 0.87 0.92 0.89 318
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.84 0.84 0.84 456
Random Forest Tuned Model
0.7926 0.7763

[[123 199] [ 21 718]] [[ 52 86][ 16 302]]
0.890 0.871

0.0 0.85 0.38 0.53 323

1.0 0.78 0.97 0.87 744
accuracy 0.79 1067

macro avg 0.82 0.68 0.70 1067
weighted avg 0.80 0.79 0.76 1067
0.0 0.71 0.72 0.71 138

1.0 0.88 0.87 0.88 318
accuracy 0.83 456

macro avg 0.79 0.80 0.80 456
weighted avg 0.83 0.83 0.83 456
Inference
• The regular Random Forest model was very much over fitted. But tuning helped the
model to overcome the over fitting problem.
• But accuracy is very much decreased. It is very low compared to other classifier model
s.
• Recall, Precision and F1-scores has improved a lot after tuning.
ROC_AUC_score and ROC curve indicates over fitting.
Bagging Model
0.9840 0.8421
[[315 7] [[ 101 37]

[ 10 729]] [ 35 283]]
0.999 0.876
0 0.97 0.98 0.97 323

1 0.99 0.99 0.99 744
accuracy 0.98 1067

macro avg 0.98 0.98 0.98 1067
weighted avg 0.98 0.98 0.98 1067
0 0.74 0.73 0.74 138
1 0.88 0.89 0.89 318
accuracy 0.84 456

macro avg 0.81 0.81 0.81 456
weighted avg 0.84 0.84 0.84 456
Bagging Tuned Model
0.9246 0.8268

[[272 50] [ 30 709]] [[ 90 48][ 31 287]]
0.983 0.881

0.0 0.90 0.84 0.87 323

1.0 0.93 0.96 0.95 744
accuracy 0.92 1067

macro avg 0.92 0.90 0.91 1067
weighted avg 0.92 0.92 0.92 1067

0.0 0.74 0.65 0.69 138

1.0 0.86 0.90 0.88 318
accuracy 0.83 456

macro avg 0.80 0.78 0.79 456
weighted avg 0.82 0.83 0.82 456
Inference
• Even after using bagging classifier the over fitting problem is still there.
• But as we can see both train and test data accuracy has reduced.
ADA Boosting Model
83.98 0.8355
[[227 95] [[ 93 45]

[ 75 664]] [ 26 292]]
0.900 0.910

0 0.75 0.70 0.73 323

1 0.87 0.90 0.89 744
accuracy 0.84 1067

macro avg 0.81 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067

0 0.78 0.67 0.72 138

1 0.87 0.92 0.89 318
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.84 0.84 0.84 456
ADA Boosting Tuned Model
0.8520 0.8136

[[231 91] [ 66 673]] [[ 88 50][ 35 283]]
0.915 0.862

0.0 0.78 0.72 0.75 323

1.0 0.88 0.91 0.90 744
accuracy 0.85 1067

macro avg 0.83 0.81 0.82 1067
weighted avg 0.85 0.85 0.85 1067

0.0 0.72 0.64 0.67 138

1.0 0.85 0.89 0.87 318
accuracy 0.81 456

macro avg 0.78 0.76 0.77 456
weighted avg 0.81 0.81 0.81 456
Inference
• There is no over-fitting or under-fitting in the tuned ADA Boosting model. Overall, it is a

good model.
• As we can see, the regular ADA Boosting model was even better fit than tuned model. S
o model tuning has not helped the model much.
• The values are better in the regular or base ADA Boosting model.
• Therefore, the regular ADA Boosting model is a better model.
Gradient Boosting Model
0.8860 0.8421

[[227 95] [[ 93 45]
[ 75 664]] [ 26 292]]
0.910 0.904
0 0.84 0.78 0.81 323

1 0.91 0.93 0.92 744
accuracy 0.89 1067

macro avg 0.87 0.86 0.86 1067
weighted avg 0.88 0.89 0.88 1067

0 0.77 0.69 0.73 138

1 0.87 0.91 0.89 318
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.84 0.84 0.84 456
Gradient Boosting Tuned Model
0.9048 0.8289

[[254 68] [ 33 706]] [[ 95 43][ 35 283]]
0.969 0.874

0.0 0.89 0.79 0.83 323

1.0 0.91 0.96 0.93 744
accuracy 0.90 1067

macro avg 0.90 0.87 0.88 1067
weighted avg 0.90 0.90 0.90 1067

0.0 0.73 0.69 0.71 138

1.0 0.87 0.89 0.88 318
accuracy 0.83 456
macro avg 0.80 0.79 0.79 456
weighted avg 0.83 0.83 0.83 456
Inference
• There is no over-fitting or under-fitting in the tuned Gradient Boosting model. Overall, it
is a good model.
• As we can see, the regular Gradient Boosting model was even better fit than tuned mod
el. So model tuning has not helped the model much.
• The values are better in the regular or base Gradient Boosting model.
• Therefore, the regular Gradient Boosting model is a better model.
Comparison of train data of all models in a structured tabular

manner:
F1 -
Model name Accuracy Precision Recall Score AUC
Logistic Regression model 84.17 87 91 89 83.98
Logistic Regression model -
Tuned 83.98 86 92 89 89.2
LDA model 82.8 86 89 87 87.7
LDA model - Tuned 83.69 87 90 89 87.7
KNN model 84.07 87 90 89 91.4
KNN model -Tuned 85.11 90 89 89 92.2
Naïve Bayes model 81.98 87 87 87 87.3
Random Forest model 99.91 100 100 100 100
Random Forest model - Tuned 78.32 77 98 86 89
Bagging model 98.4 99 99 99 99.9
Bagging model - Tuned 92.68 93 96 95 98.3
ADA Boosting model 83.98 87 90 89 90
ADA Boosting model - Tuned 85.2 88 91 90 91.5
Gradient Boosting model 88.6 91 93 92 91
Gradient Boosting model -
Tuned 90.48 100 91 96 96.9
Comparison of test data of all models in a structured tabular
manner:
F1 -
Model name Accuracy Precision Recall Score AUC
Logistic Regression model 82.89 87 91 89 88.2
Logistic Regression model -
Tuned 82.24 85 90 88 91.4
LDA model 85.31 87 92 90 88.2
LDA model - Tuned 83.33 86 90 88 87.4
KNN model 85.53 88 92 90 91.4
KNN model -Tuned 82.66 88 87 88 89.5
Naïve Bayes model 85.75 88 92 90 91.2
Random Forest model 84.43 87 91 89 87.1
Random Forest model - Tuned 79.17 78 98 87 87.6
Bagging model 84.21 88 90 89 82.68
Bagging model - Tuned 83.55 86 91 88 88.1
ADA Boosting model 83.55 88 89 89 91
ADA Boosting model - Tuned 81.36 85 89 87 86.2
Gradient Boosting model 84.21 87 91 89 90.4
Gradient Boosting model - Tuned 82.68 86 90 88 87.4
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.
• All the tuned models have high values and every model is good. But as we can see, the
most consistent tuned model in both train and test data is the KNN model.
• The tuned gradient boost model performs the best with 84% accuracy score in train and
85% accuracy score in test. Also it has the best AUC score of 91.4% in both train and test
data which is the highest of all the models.
• It also has a precision score of 88% and recall of 92% which is also the highest of all
the models. So, we conclude that KNN model is the best/optimized model
Statement of Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1) Find the number of characters, words and sentences for the mentioned documents.
(Hint: use .words(), .raw(), .sent() for extracting counts)
Number of characters:
• President Franklin D. Roosevelt's speech have 7571 characters.
• President John F. Kennedy's speech have 7618 characters.
• President Richard Nixon's speech have 9991 characters.
Number of words:
• There are 1536 words in President Franklin D. Roosevelt's speech.
• There are 1546 words in President John F. Kennedy's speech.
• There are 2028 words in President Richard Nixon's speech.
Number of Sentences:
• There are 68 sentences in President Franklin D. Roosevelt's speech.

• There are 52 sentences in President John F. Kennedy's speech.
• There are 68 sentences in President Richard Nixon's speech.
2.2) Remove all the stopwords from the three speeches. Show the word count before and
after the removal of stopwords. Show a sample sentence after the removal of stopwords.
Before, removing the stop-words, we have changed all the letters to lowercase and we
have removed special characters. Also we have stemmed the words using Snowball Stemmer.
Before the removal of stop-words,
• President Franklin D. Roosevelt's speech have 1334 words.

• President John F. Kennedy's speech have 1362 words.
• President Richard Nixon's speech have 1800 words.
After the removal of stop-words,
• President Franklin D. Roosevelt's speech have 385 words.

• President John F. Kennedy's speech have 418 words.
• President Richard Nixon's speech have 368 words.
2.3) Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
Top 3 words in Roosevelt's speech:

• nation - 11
• know - 10
• peopl - 9
Top 3 words in Roosevelt's speech:

• let - 16
• us - 12
• power - 9
Top 3 words in Nixon's speech:

• us - 26
• let - 22
• america - 21
2.4) Plot the word cloud of each of the three speeches. (after removing the stopwords)
Word cloud of Roosevelt's speech:
Word cloud of Kennedy's speech:
Word cloud of Nixon's speech:

----------------------------------------------------------------------------------------------------

Davice ML 21-01-2024

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Davice ML 21-01-2024

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

Executive Summary of problem 1:Introduction to problem 1:

Data Description of problem 1:

3. economic.cond.national: Assessment of current national economic

4. economic.cond.household: Assessment of current household economic

5. Blair: Assessment of the Labour

6. Hague: Assessment of the Conservative

8. political.knowledge: Knowledge of parties' positions on European

Sample of the dataset of problem 1:

Table : Sample dataset of problem 1

Exploratory Data Analysis:

Check for missing values in the dataset:

Column name Missing value count

Conservative 0.169413 0.133817 0.30323

Labour 0.363217 0.333553 0.69677

All 0.532630 0.467370 1.00000

Figure : Histogram and box plot of variables in the dataset

Figure: Histogram and box plot of variables in the dataset

Table : Sample test dataset of independent variables

Inde economic.cond. economic.con Blai Hagu Europ political.k gen

0 1 0.27536 0.5 0.5 0.75 0 0.1 0.66667 0

Logistic Regression Model

Logistic regression model(Train data):

precision recall f1-score support

0.0 0.77 0.68 0.72 323

accuracy 0.84 1067

Logistic regression model(Test data):

0.0 0.75 0.66 0.70 138

accuracy 0.83 456

LDA model(Train data):

precision recall f1-score support

0.0 0.75 0.69 0.72 323

accuracy 0.84 1067

0.0 0.75 0.67 0.71 138

accuracy 0.83 456

KNN model(Train data):

precision recall f1-score support

0.0 0.81 0.77 0.79 323

accuracy 0.87 1067

precision recall f1-score support

0.0 0.70 0.70 0.70 138

accuracy 0.82 456

Figure: KNN model - Misclassification error Vs K

KNN model(Train data):

0.0 0.76 0.70 0.73 323

accuracy 0.84 1067

KNN model(Test data):

precision recall f1-score support

0.0 0.80 0.70 0.74 138

accuracy 0.86 456

Naive Bayes model

Naive Bayes model(Train data):

accuracy 0.83 1067

Naive Bayes model(Test data):

precision recall f1-score support

0.0 0.71 0.69 0.70 138

accuracy 0.82 456

Random Forest Model

Random Forest model(Train data):

0.0 1.00 1.00 1.00 323

accuracy 1.00 1067

Random Forest model(Test data):