Machine Learning Project: Problem 1

MACHINE LEARNING
PROJECT
PROBLEM 1
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given information,
to create an exit poll that will help in predicting overall win and seats covered by a
particular party.
Data
Dictionary
1. vote: Party choice: Conservative or Labour
2. age: in years
3. Economic.cond.national: Assessment of current national economic conditions, 1 to 5.
4. Economic.cond.household: Assessment of current household economic conditions, 1 to 5.
5. Blair: Assessment of the Labour leader, 1 to 5.
6. Hague: Assessment of the Conservative leader, 1 to 5.

7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment.
8. Political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
9. Gender: female or male.
1.1 Read the dataset. Do the descriptive

statistics and do the null value condition
check. Write an inference on it.
We first import the necessary files, then we upload the excel file in jupyter notebook. Head function is
used to see first 5 rows of dataset.
economic.co political.
economic.co
vote age nd.househol Blair Hague Europe knowled gender
nd.national
d ge
0 Labour 43 3 3 4 1 2 2 female
1 Labour 36 4 4 4 4 5 2 male
2 Labour 35 4 4 5 2 3 2 male
3 Labour 24 4 2 2 1 4 0 female
4 Labour 41 2 2 1 1 6 2 male
The given dataset has a list of 1525 voters with 9 variables.

Check for missing values in the dataset:
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1517 non-null object
1 age 1517 non-null int64
2 economic.cond.national 1517 non-null int64
3 economic.cond.household 1517 non-null int64
4 Blair 1517 non-null int64
5 Hague 1517 non-null int64
6 Europe 1517 non-null int64
7 political.knowledge 1517 non-null int64
8 gender 1517 non-null object
From the above table we can see that the dataset doesn’t have any missing values. The dataset has 2 obj
ect type dataset and 7 integer type dataset.
Let’s check the summary of data set:
count unique top freq mean std min 25% 50% 75% max
count unique top freq mean std min 25% 50% 75% max
vote 1525 2 Labour 1063 NaN NaN NaN NaN NaN NaN NaN
age 1525 NaN NaN NaN 54.1823 15.7112 24 41 53 67 93
economic.cond.national 1525 NaN NaN NaN 3.2459 1 3 3 4 5

0.880969
economic.cond.household 1525 NaN NaN NaN 3.14033 0.929951 1 3 3 4 5
Blair 1525 NaN NaN NaN 3.33443 1.17482 1 2 4 4 5
Hague 1525 NaN NaN NaN 2.74689 1.2307 1 2 2 4 5
Europe 1525 NaN NaN NaN 6.72852 3.29754 1 4 6 10 11
political.knowledge 1525 NaN NaN NaN 1.5423 1.08331 0 0 2 2 3
gender 1525 2 female 812 NaN NaN NaN NaN NaN NaN NaN
As you can see from the above table, the average age of voters is 54 yrs. Most people gave their vote to
labour party.1063 people out of 1525 gave vote to labour party. The average economic.cond.national and
economic.cond.household are nearly same. Voters had an average political knowledge as well. Majority of
the voters were female.
1.2 Perform Univariate and Bivariate Analysis.

Do exploratory data analysis. Check for
Outliers
First let’s check for duplicate rows by using .duplicated command. We have 8 duplicated rows
and we drop the duplicated rows. Now, we have 1517 rows and 9 variable.
UNIVARIENT ANALYSIS:
As you can see from the above graph, economic.cond.national is normally distributed. Blair and Europe is
left skewed. Majority of the voters are 45-65 years old. A lot of voters come from fairly good economic
household condition and has average political knowledge.
BIVARIATE ANALYSIS:
From the above plot we can clearly see that, Labour party is favorite across all ages.
From the above plot we can see that labour party has more leader having assessment 5 than conservative
party. Labour party also has less leader having assessment 1 than conservative party.
Voters of both the party comes from all economic condition. Although labour party has more number of
voters from economic condition 5 than conservative party.
PAIRPLOT:
From the above pairplot we can see a balanced distribution of data.

CORRELATION PLOT:
From the above correlation plot, we see no correlation among variables. Economic.cond.national,
Economic.cond.household and Hague have positive correlation. Rest all has negative correlation.
Check for outliers:

As you can see from the above boxplot graph, only economic.cond.household and
economic.cond.national has got outliers. So we treat the outliers first.
As you can see we have treated all the outliers.

1.3. Encode the data (having string values)
for Modeling. Is Scaling necessary here or
not? Data Split: Split the data into train and
test (70:30).
Encoding of the data having string values has been done in jupyter notebook.
Yes, scaling is necessary in the problem as the variables in the dataset have different unit of
measurement. Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions do not work correctly without normalization. For example, the
majority of classifiers calculate the distance between two points by the distance. If one of the
features has a broad range of values, the distance governs this particular feature. Therefore, the
range of all features should be normalized so that each feature contributes approximately
proportionately to the final distance.
Splitting of data into train and test data in the ratio of 70:30 is done in jypyter notebook.
1.4. Apply Logistic Regression and LDA (linear

discriminant analysis).
To apply logistic regression and LDA we first need to import logistic regression and linear discriminant analysis
from sklearn library. We have applied logistic regression and LDA in jupyter notebook.
LOGISTIC REGRESSION:
Accuracy of test data: 0.8341
Classification report of train data
Precision recall f1-score support
conservative 0.74 0.65 0.69 307
labour 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.80 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061
In [43]:
Accuracy of train data: 0.8289
Classification report of test data
precision recall f1-score support
conservative 0.76 0.73 0.74 153
Labour 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
LINEAR DISCRIMINANT ANALYSIS:
Classification report of test data:
conservative 0.76 0.73 0.74 153
labour 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456
Classification report of train data:
conservative 0.74 0.65 0.69 307
labour 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.80 0.78 0.79 1061
weighted avg 0.83 0.83 0.83 1061
Both linear regression and linear discriminant analysis have nearly same precision, accuracy, recall and F1 score.
1.5. Apply KNN Model and Naïve Bayes Model.

Interpret the results.
To apply KNN model and Naïve Bayes model we first need to import GaussianNB and KNeighborsClassifier from
sklearn library. We have applied GaissianNB and KNeighborsClassifier in jupyter notebook.
KNN
Classification report on train data:
conservative 0.77 0.72 0.75 307
labour 0.89 0.91 0.90 754
accuracy 0.86 1061
macro avg 0.83 0.82 0.82 1061

weighted avg 0.86 0.86 0.86 1061
Classification report on test data:
conservative 0.76 0.71 0.73 153
labour 0.86 0.89 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.80 456
weighted avg 0.82 0.83 0.83 456
In KNN model, the accuracy of test data is slightly more than train dataset
GAUSSIAN NAÏVE BAYES
Classification report on train data:
conservative 0.71 0.69 0.70 307
labour 0.88 0.88 0.88 754
accuracy 0.83 1061
macro avg 0.79 0.79 0.79 1061
weighted avg 0.83 0.83 0.83 1061

Classification report on test data:
conservative 0.74 0.75 0.74 153
labour 0.87 0.87 0.87 303
accuracy 0.83 456
macro avg 0.81 0.81 0.81 456
weighted avg 0.83 0.83 0.83 456
Accuracy of both train and test data set in nearly same. Recall, precision and F1 score for conservative in train data
set is slightly less for conservative in test dataset.
After interpreting both the model, we can say that KNN model is lightly better optimized than Naïve Bayes model.
1.6. Model Tuning, Bagging (Random Forest

should be applied for Bagging), and Boosting.
To apply Bagging and boosting model we first need to import BaggingClassifier and GradientBoostingClassifier
from sklearn library. We have applied BaggingClassifier and GradientBoostingClassifier in jupyter notebook.
BAGGING (RANDOM FOREST)
conservative 0.98 0.90 0.94 307
labour 0.96 0.99 0.98 754
accuracy 0.97 1061
macro avg 0.97 0.95 0.96 1061
weighted avg 0.97 0.97 0.96 1061

conservative 0.78 0.69 0.73 153
labour 0.85 0.90 0.88 303
accuracy 0.83 456
macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
Accuracy, precision, recall and F1 score of train data set is more than that of test data.
BOOSTING
conservative 0.84 0.78 0.81 307
labour 0.91 0.94 0.93 754
accuracy 0.89 1061
macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

conservative 0.80 0.68 0.73 153
labour 0.85 0.91 0.88 303
accuracy 0.84 456

macro avg 0.82 0.80 0.81 456
weighted avg 0.83 0.84 0.83 456
Accuracy, precision, recall and F1 score of train data set is more than that of test data set.
1.7. Performance Metrics: Check the

performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model.
Final Model: Compare the models and write
inference which model is best/optimized.
LOGISTIC REGRESSION
ROC curve for train data:

Area under the curve: 0.890
Confusion matrix for train data:
ROC curve for test data:
Confusion matrix for train data:

LINEAR DISCRMINANT ANALYSIS
Confusion matrix

Confusion matrix:
GAUSSIAN NAÏVE BAYES:

Confusion matrix:
Confusion matrix:
KNN
Confusion matrix:
Confusion matrix:
BAGGING

Confusion matrix:
Confusion matrix:
BOOSTING
Confusion matrix:
Confusion matrix:
After analyzing all the models we can say that bagging (Random forest) is best optimized for this data set.
1.8. Based on these predictions, what are the

insights?
 Parties should focus more in Europe’s integration and its positive effect. As we can clearly see
the impact of Europe’s integration in the voters and leader should try to refrain talking about
negative impact of Europe’s integration.
 Leaders should have an good image in front of public and should work towards it as the data
clearly shows that leaders having better rating are automatically attracting more voters.
 Both parties should focus on capturing votes of young and older population. Conservative party
should specially focus on middle age population.
 Parties should also try to capture votes of people having less political knowledge as they are easy
to manipulate as compared to people having more political knowledge.
 Parties should talk more about topics related to women as there are more number of women
coming to vote as compared to men.

Machine Learning Project: Problem 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Project: Problem 1

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

1. vote: Party choice: Conservative or Labour

3. Economic.cond.national: Assessment of current national economic conditions, 1 to 5.

4. Economic.cond.household: Assessment of current household economic conditions, 1 to 5.

5. Blair: Assessment of the Labour leader, 1 to 5.

6. Hague: Assessment of the Conservative leader, 1 to 5.

represent ‘Eurosceptic’ sentiment.

8. Political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.

9. Gender: female or male.

1.1 Read the dataset. Do the descriptive

The given dataset has a list of 1525 voters with 9 variables.

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 vote 1517 non-null object

1 age 1517 non-null int64

2 economic.cond.national 1517 non-null int64

3 economic.cond.household 1517 non-null int64

4 Blair 1517 non-null int64

5 Hague 1517 non-null int64

6 Europe 1517 non-null int64

7 political.knowledge 1517 non-null int64

8 gender 1517 non-null object

Let’s check the summary of data set:

age 1525 NaN NaN NaN 54.1823 15.7112 24 41 53 67 93

economic.cond.national 1525 NaN NaN NaN 3.2459 1 3 3 4 5

economic.cond.household 1525 NaN NaN NaN 3.14033 0.929951 1 3 3 4 5

Blair 1525 NaN NaN NaN 3.33443 1.17482 1 2 4 4 5

Hague 1525 NaN NaN NaN 2.74689 1.2307 1 2 2 4 5

Europe 1525 NaN NaN NaN 6.72852 3.29754 1 4 6 10 11

political.knowledge 1525 NaN NaN NaN 1.5423 1.08331 0 0 2 2 3

1.2 Perform Univariate and Bivariate Analysis.

From the above pairplot we can see a balanced distribution of data.

Check for outliers:

As you can see we have treated all the outliers.

1.4. Apply Logistic Regression and LDA (linear

Accuracy of test data: 0.8341

Classification report of train data

Precision recall f1-score support

conservative 0.74 0.65 0.69 307

labour 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061

Accuracy of train data: 0.8289

Classification report of test data

precision recall f1-score support

conservative 0.76 0.73 0.74 153

Labour 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

LINEAR DISCRIMINANT ANALYSIS:

Accuracy of test data: 0.8341

Classification report of test data:

precision recall f1-score support

conservative 0.76 0.73 0.74 153

labour 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

Accuracy of train data: 0.8311

Classification report of train data:

precision recall f1-score support

conservative 0.74 0.65 0.69 307

labour 0.86 0.91 0.89 754