Machine Learning - Final Project Report - Problem 1

1
Final Project Report
Logistic Regression and LDA – Election Analysis
MACHINE LEARNING
Nabeel Ahmed Khan
Sep ‘21
Date: 27/09/2021
2
Table of Contents
Table of Contents..................................................................................................................................2
Table of Figures.....................................................................................................................................3
1. Executive Summary.......................................................................................................................4
2. Introduction...................................................................................................................................4
3. Data Details...................................................................................................................................5
4. Data Ingestion and EDA - Descriptive Statistics, Duplicate/Null value Check, Uni-/Bi-Variate
Analysis, Outlier Check..........................................................................................................................5
4.1 Data Ingestion........................................................................................................................5
Sample of the Dataset...................................................................................................................5
Data Info........................................................................................................................................6
Data Shape & Data Types..............................................................................................................6
Data Description............................................................................................................................7
Duplicate Value Check...................................................................................................................7
NULL Value Check..........................................................................................................................7
Inference.......................................................................................................................................8
4.2 Univariate Analysis................................................................................................................8
Distribution Plots...........................................................................................................................8
Count Plots..................................................................................................................................10
Inference.....................................................................................................................................11
4.3 Transforming Categorical variables to Numeric Variables...................................................11
4.4 Bivariate Analysis.................................................................................................................12
Pairplots.......................................................................................................................................12
Correlation Heatmap...................................................................................................................13
4.5 Outlier Check.......................................................................................................................13
5. Logistic Regression.......................................................................................................................17
5.1. Preparing for Model Formulation........................................................................................17
5.2. Formulating a Logistic Regression Model on the Training data...........................................17
6. Logistic Regression: Performance Metrics...................................................................................17
6.1 LR: Model Score...................................................................................................................17
6.2 LR: Confusion Matrix............................................................................................................17
6.3 LR: Classification Report......................................................................................................18
6.4 LR: AUC Score......................................................................................................................18
6.5 LR: ROC Curve......................................................................................................................19
7. Linear Discriminant Analysis........................................................................................................20
3
7.1 Preparing for Model Formulation........................................................................................20

7.2 Formulating a LDA Model on the Training data...................................................................20
8. LDA: Performance Metrics...........................................................................................................20
8.1 LDA: Model Score................................................................................................................20
8.2 LDA: Confusion Matrix.........................................................................................................20
8.3 LDA: Classification Report....................................................................................................21
8.4 LDA: AUC Score....................................................................................................................21
6.6 LDA: ROC Curve....................................................................................................................22
9. Programming Files.......................................................................................................................23
10. Inferences: Insights & Recommendations...............................................................................23
10.1 Insights.................................................................................................................................23
10.2 Recommendations...............................................................................................................24
Table of Figure
Figure 1: Holiday Package Data Info......................................................................................................6

Figure 2: Data Shape & Data types........................................................................................................6
Figure 3: Data Description of Integer Type Variables............................................................................7
Figure 4: Data Description of Object Type Variables.............................................................................7
Figure 5: Null Value Check.....................................................................................................................7
Figure 6: Univariate Analysis - Distribution Plots...................................................................................9
Figure 7: Univariate Analysis - Count Plots..........................................................................................10
Figure 8: Holiday Package Dataset Pair Plots.......................................................................................12
Figure 9: Holiday Package Dataset Correlation Heatmap....................................................................13
Figure 10: Box Plots for Continuous Variables.....................................................................................15
Figure 11: Box Plots for Continuous Variables post Outlier Treatment...............................................16
Figure 12: LR ROC Curve for Training Data..........................................................................................19
Figure 13: LR ROC Curve for Test Data.................................................................................................19
Figure 12: LR ROC Curve for Training Data..........................................................................................22
Figure 13: LR ROC Curve for Test Data.................................................................................................22
Table of Tables
Table 1: Dataset Sample........................................................................................................................5

4
1. Executive Summary
CNBE is a leading news channel who have shared a dataset from a survey on 1525 voters,
which they want us to conduct an analysis on. Some of these voters have voted for the Labour
party while others have voted for the Conservative party. The dataset has 9 variables which
provides information on the various voters that participated in the survey, along with the data
on which voter voted for which of the 2 parties.
The channel is trying to predict which party a voter may vote for on the basis of the provided
information, so we are expected to build a model which can provide a way of predicting the
action of a voter to vote for a particular party. This will enable them construct an exit poll
that will help in predicting the overall winner of the election and the seats captured by a
particular party. So, as part of this Machine Learning project report, I will try to explore the
various variables available in the dataset and try to understand what is their impact on the
decision of a voter to vote for a particular party.
2. Introduction
The intent for this project is to perform analysis on the Election Analysis dataset. I will try to
explore this dataset by using descriptive statistics, univariate and bivariate analysis,
Exploratory Data Analysis and applying LDA and Logistic Regression. This dataset contains
details on about 1525 voters, and I will try to analyse the various variables available in the
dataset, and determine how they contribute in determining which political party a voter votes
for.
5
3. Data Details
The first column has an index variable, which is simply the serial number of the entry. I
dropped the index column as it is useless for the model. Following are the data attributes in
the Election dataset:
1
2
3
4
5
6
7
8
vote : Which political party a voter has voted for
(Labour/Conservative)
age : Age of the voter in years
economic.cond.national : Age of the employee in years
edu : Years of formal education
no_young_children : The number of young children (younger than 7 years)
no_older_children : The number of older children (7 years or more)
foreign : Foreigner (yes/no)

6
Table 1: Election Dataset Details
4. Data Ingestion and EDA - Descriptive Statistics, Duplicate/Null value
Check, Uni-/Bi-Variate Analysis, Outlier Check
4.1 Data Ingestion
Sample of the Dataset
Table 2: Dataset Sample
As we can see, the holiday package dataset has 7 variables with each employee having the
same set of characteristics. Here, we can say that Holliday_Package is the dependent or target
feature and the rest of the variables are the independent or predictor variables. Based on the
independent variables, the value (yes/no) of the Holliday_Package variable is defined.

7
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 872 non-null int64
1 Holliday_Package 872 non-null object
2 Salary 872 non-null int64
3 age 872 non-null int64
4 educ 872 non-null int64
5 no_young_children 872 non-null int64
6 no_older_children 872 non-null int64
7 foreign 872 non-null object
dtypes: int64(6), object(2)
memory usage: 54.6+ KB
Figure 1: Holiday Package Data Info
Data Shape & Data Types

(872, 7)
Holliday_Package object
Salary int64
age int64
educ int64
no_young_children int64
no_older_children int64
foreign object
dtype: object
Figure 2: Data Shape & Data types
It can be observed that after dropping the serial number column, the dataset has 872
employee records and 7 variables. Of the 7 variables, the target feature Holliday_Package and
foreign are object types while the other are integer type.
8
Data Description
Salary age educ no_young_children no_older_children
count 872.000000 872.000000 872.000000 872.000000 872.000000
mean 47729.172018 39.955275 9.307339 0.311927 0.982798
std 23418.668531 10.551675 3.036259 0.612870 1.086786
min 1322.000000 20.000000 1.000000 0.000000 0.000000
25% 35324.000000 32.000000 8.000000 0.000000 0.000000
50% 41903.500000 39.000000 9.000000 0.000000 1.000000
75% 53469.500000 48.000000 12.000000 0.000000 2.000000
max 236961.000000 62.000000 21.000000 3.000000 6.000000
Figure 3: Data Description of Integer Type Variables
Holliday_Package foreign
Count 872 872
unique 2 2
top no no
freq 471 656
Figure 4: Data Description of Object Type Variables
Duplicate Value Check

When I checked for any duplicate entries, I got the following output:
Duplicate Entries: 0
NULL Value Check

Holliday_Package 0
Salary 0
age 0
educ 0
no_young_children 0
no_older_children 0
foreign 0
dtype: int64
Figure 5: Null Value Check
As we can see, there are no null values in the dataset.

9
Inference
1. Our Target Feature is the Holliday_Package variable. It has a binary value which
equals 0 for employees not opting to purchase the holiday package and 1 for the
employee opting to purchase the holiday package.
2. All other variables except foreign, like age, educ are continuous numeric variables
3. Since, the target feature Holliday_Package and the variable foreign are object type,
we need to encode them and convert them into categorical variables for creating a
model for the Holiday Package dataset
4. The first column contained an index variable, which is simply the serial number of the
entry. I dropped the index column as it is useless for the model
5. There are no Duplicate entries in the dataset
6. There are no NULL values in the dataset
4.2 Univariate Analysis
Distribution Plots
I plotted the numeric data variables below.
10
Figure 6: Univariate Analysis - Distribution Plots

11
Count Plots
I plotted the categorical data variables below.
Figure 7: Univariate Analysis - Count Plots

12
Inference
1. Nearly 46% of employees have opted for the Holiday Package
2. Majority of the employees are native to the country
3. Majority of the employees do not have any children.
4. Majority of the employees who have children have 1 younger children (kids less than
7 years of age) and/or 1-2 older children (kids more than 7 years of age)
5. The average years of education range from 3-17; while the major chunk of employees
have 8-12 years of education
6. Majority of the employees (more than two-thirds) fall in the $25000 to $55000 range
of salary.
7. Nearly all the employees fall in the 20-60 years of age range.
4.3 Transforming Categorical variables to Numeric Variables
I encoded the object type/categorical variables Holliday_Package, and foreign using Panda’s
.codes function
13
4.4 Bivariate Analysis
Before proceeding with Bivariate Analysis, I transformed the categorical variables to numeric
variables.
Pairplots
Figure 8: Holiday Package Dataset Pair Plots

14
Correlation Heatmap
Figure 9: Holiday Package Dataset Correlation Heatmap
From the Pair plots and the Correlation Heatmap, it can be observed that the variables in the
Holiday Package dataset form separated clusters and are not much corelated with one
another. Maximum correlation between two variables is between education is salary and that
too is 0.33 only. Also, whatever limited correlation our target feature Holliday_Package has,
it has that with the foreign variable only.
4.5 Outlier Check
I ran boxplot for the holiday package dataset as follows.

15
16
Figure 10: Box Plots for Continuous Variables
As we can see from the Box Plots above, mostly the variable Salary has outliers, which may impact
the efficacy of the regression model I will build. I have treated the outliers in the dataset
using the 25th and 75th percentiles. Post that, I re-checked for outliers once more (please see
below).
17
Figure 11: Box Plots for Continuous Variables post Outlier Treatment
18
5. Logistic Regression
5.1. Preparing for Model Formulation
I dropped the target feature Holliday_Package and stored it separately. Then I have used the
randomized training and test data splitting function from Sklearn package to split the data
into train and test datasets in the ratio 70:30 (The test data size should be 30% of the total
data).
5.2. Formulating a Logistic Regression Model on the Training data
Then I applied LinearRegression to obtain the bestfit model on training data.
.37)*height + Intercept
6. Logistic Regression: Performance Metrics
6.1 LR: Model Score
1. The Accuracy Score for the Regression Model on Training data is 0.51967
2. The Accuracy Score for the Regression Model on Test data is 0.53053
6.2 LR: Confusion Matrix
1. Confusion Matrix for the Regression Model on Training data
[[294 32]
[261 23]]
2. Confusion Matrix for the Regression Model on Test data
[[129 16]
[107 10]]
19
6.3 LR: Classification Report
1. Classification Report for the Regression Model on Training data
precision recall f1-score support
0 0.53 0.90 0.67 326

1 0.42 0.08 0.14 284
accuracy 0.52 610

macro avg 0.47 0.49 0.40 610
weighted avg 0.48 0.52 0.42 610
2. Classification Report for the Regression Model on Test data
0 0.55 0.89 0.68 145

1 0.38 0.09 0.14 117
accuracy 0.53 262

macro avg 0.47 0.49 0.41 262
weighted avg 0.47 0.53 0.44 262
6.4 LR: AUC Score
AUCTrain: 0.567
AUCTest: 0.627
20
6.5 LR: ROC Curve
Figure 12: LR ROC Curve for Training Data
Figure 13: LR ROC Curve for Test Data

21
7. Linear Discriminant Analysis
I used a separate jupyter notebook for my Linear Discriminant Analysis of the Holiday
Package data.
7.1 Preparing for Model Formulation
I converted the categorical variables to dummy variables using Panda’s get_dummies
function.
I captured the target feature (Holliday_Package_yes after converting to dummy variables)
into separate vectors for Training set and Test set. Then I have used the randomized training
and test data splitting function from Sklearn package to split the data into train and test
datasets in the ratio 70:30 (The test data size should be 30% of the total data).
7.2 Formulating a LDA Model on the Training data
Then I applied LDA to obtain the bestfit model on training data.
8. LDA: Performance Metrics
8.1 LDA: Model Score
1. The Accuracy Score for the LDA model on Training data is 0.6721
2. The Accuracy Score for the LDA model on Test data is 0.6412
8.2 LDA: Confusion Matrix
1. Confusion Matrix for the LDA Model on Training data
[[252 74]
[126 158]]
2. Confusion Matrix for the LDA Model on Test data

22
[[103 42]
[52 65]]
8.3 LDA: Classification Report
1. Classification Report for the Regression Model on Training data
0 0.67 0.77 0.72 326

1 0.68 0.56 0.61 284
accuracy 0.67 610

macro avg 0.67 0.66 0.66 610
1. weighted avg 0.67 0.67 0.67 610
2. Classification Report for the Regression Model on Test data
0 0.66 0.71 0.69 145

1 0.61 0.56 0.58 117
accuracy 0.64 262

macro avg 0.64 0.63 0.63 262
weighted avg 0.64 0.64 0.64 262
8.4 LDA: AUC Score
AUCTrain: 0.742
AUCTest: 0.703
23
8.5 LDA: ROC Curve
Figure 14: LR ROC Curve for Training Data
Figure 15: LR ROC Curve for Test Data

24
9. Programming Files
Predictive_Modellin Predictive_Modellin Predictive Predictive_Modellin

g_Nabeel_Khan_Final_Project_Report_LDA.pdf
g_Nabeel_Khan_Final_Project_Report_LDA.ipynb
Modelling_Nabeel Khan_Final
g_Nabeel_Khan_Final_Project_Report_Logistic_Regression.ipynb
Project Report-Logistic Regression.pdf
10. Inferences: Insights & Recommendations
10.1 Insights
1. The model score for the Logistics Regression’s Training dataset is 51.9% and that for
Test dataset is 53.0%.
2. Classification Report of Logistic Regression Model:
PrecisionLDA_Train = 42% | PrecisionLDA_Test = 38%
RecallLDA_Train = 8% | RecallLDA_Test = 9%
F1LDA_Train = 14% | F1LDA_Test = 14%
3. The AUC for Training data is 56.7% and Test data is 62.7%
4. The Logistic Regression model doesn’t seem to be a good fit and may needs improvement
5. The model score for the LDA’s Training dataset is 67.21% and that for Test dataset is
64.12%. As we can see the accuracy score of LDA model is better than that of Logistic
Regression model
6. Classification Report of LDA Model:
PrecisionLDA_Train = 68% | PrecisionLDA_Test = 61%
RecallLDA_Train = 56% | RecallLDA_Test = 56%
F1LDA_Train = 61% | F1LDA_Test = 58%
Clearly the Classification report for the LDA model is better than that of the Logistic
Regression model
7. The AUC for Training data is 74.2% and Test data is 70.3% which quite similar
Again the AUC for the LDA model is better than that of the Logistic Regression model
25
8. As we can see that the Logistic regression model and the LDA model are able to predict
the behaviour of the employees regarding opting the Holiday Package for around 53%
and 62.7% of the employees with accuracy. Since the accuracy score of LDA model is
better, I will prefer the LDA model.
9. Another thing is that the LDA model can correctly predict the employee behaviour 56%
of the times while the Logistic Regression model can correctly predicts for only 8-9% of
the times.
10. From the above I can say that the LDA model is better for the travel agency to use in an
attempt to improve their bottom line.
10.2 Recommendations
1. The dataset has outliers in the salary variable. We know that Logistic Regression is a
better predictor when outliers are present. Therefore, it is recommended to treat the
outliers before proceeding to use LDA
2. As we saw above, if an employee is a foreigner and the employee does not have any
young children (can be seen using Holliday_Package as hue in bivariate plots), the
probability of an employee to purchase the Holiday Package is higher. Also, many
employees who have older children do not go for the Holiday Package. So, the agency
can devise special promotional programs and discounts to such employees to incentivize
them for opting for the Holiday Package.
3. Moreover, a lot of employees having higher salary are not purchasing the Holiday
Package (again, can be seen using Holliday_Package as hue in bivariate plots). SO the
travel agency can come up with a plan to create more product awareness and introduce
targeted promotions for such employees.

26
4. The age of the employee is not a material in opting for holiday package, so it can be
ignored.
5. It was observed from the correlation coefficients that the target feature Holliday_Package
has a high negative correlation with no_young_children. So, it would go a long way if the
travel agency can tailor their holiday packages so as to make them more appealing to
employees with infants and young children.

Machine Learning - Final Project Report - Problem 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning - Final Project Report - Problem 1

Uploaded by

Copyright:

Available Formats

1

Final Project Report

Logistic Regression and LDA – Election Analysis

Nabeel Ahmed Khan

7.1 Preparing for Model Formulation........................................................................................20

Figure 1: Holiday Package Data Info......................................................................................................6

Table 1: Dataset Sample........................................................................................................................5

on which voter voted for which of the 2 parties.

decision of a voter to vote for a particular party.

the Election dataset:

vote : Which political party a voter has voted for

age : Age of the voter in years

economic.cond.national : Age of the employee in years

edu : Years of formal education

no_young_children : The number of young children (younger than 7 years)

no_older_children : The number of older children (7 years or more)

foreign : Foreigner (yes/no)

Table 1: Election Dataset Details

4. Data Ingestion and EDA - Descriptive Statistics, Duplicate/Null value

Check, Uni-/Bi-Variate Analysis, Outlier Check

4.1 Data Ingestion

Sample of the Dataset

Table 2: Dataset Sample

independent variables, the value (yes/no) of the Holliday_Package variable is defined.

Figure 1: Holiday Package Data Info

Data Shape & Data Types

Figure 2: Data Shape & Data types

count 872.000000 872.000000 872.000000 872.000000 872.000000

mean 47729.172018 39.955275 9.307339 0.311927 0.982798

std 23418.668531 10.551675 3.036259 0.612870 1.086786

min 1322.000000 20.000000 1.000000 0.000000 0.000000

25% 35324.000000 32.000000 8.000000 0.000000 0.000000

50% 41903.500000 39.000000 9.000000 0.000000 1.000000

75% 53469.500000 48.000000 12.000000 0.000000 2.000000

max 236961.000000 62.000000 21.000000 3.000000 6.000000

Figure 3: Data Description of Integer Type Variables

Count 872 872

freq 471 656

Figure 4: Data Description of Object Type Variables

Duplicate Value Check

NULL Value Check

Figure 5: Null Value Check

As we can see, there are no null values in the dataset.

employee opting to purchase the holiday package.

model for the Holiday Package dataset

entry. I dropped the index column as it is useless for the model

5. There are no Duplicate entries in the dataset

6. There are no NULL values in the dataset

4.2 Univariate Analysis

Figure 6: Univariate Analysis - Distribution Plots

Figure 7: Univariate Analysis - Count Plots

2. Majority of the employees are native to the country

3. Majority of the employees do not have any children.

have 8-12 years of education

4.3 Transforming Categorical variables to Numeric Variables

4.4 Bivariate Analysis

Figure 8: Holiday Package Dataset Pair Plots

Figure 9: Holiday Package Dataset Correlation Heatmap

it has that with the foreign variable only.

4.5 Outlier Check

I ran boxplot for the holiday package dataset as follows.

Figure 10: Box Plots for Continuous Variables

5.1. Preparing for Model Formulation