You are on page 1of 26

MACHINE LEARNING

PROJECT

PROBLEM 1
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given information,
to create an exit poll that will help in predicting overall win and seats covered by a
particular party.

Data
Dictionary

1. vote: Party choice: Conservative or Labour

2. age: in years

3. Economic.cond.national: Assessment of current national economic conditions, 1 to 5.

4. Economic.cond.household: Assessment of current household economic conditions, 1 to 5.

5. Blair: Assessment of the Labour leader, 1 to 5.

6. Hague: Assessment of the Conservative leader, 1 to 5.


7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores

represent ‘Eurosceptic’ sentiment.

8. Political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.

9. Gender: female or male.

1.1 Read the dataset. Do the descriptive


statistics and do the null value condition
check. Write an inference on it.

We first import the necessary files, then we upload the excel file in jupyter notebook. Head function is
used to see first 5 rows of dataset.

economic.co political.
economic.co
vote age nd.househol Blair Hague Europe knowled gender
nd.national
d ge

0 Labour 43 3 3 4 1 2 2 female

1 Labour 36 4 4 4 4 5 2 male

2 Labour 35 4 4 5 2 3 2 male

3 Labour 24 4 2 2 1 4 0 female

4 Labour 41 2 2 1 1 6 2 male

The given dataset has a list of 1525 voters with 9 variables.


Check for missing values in the dataset:
Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 vote 1517 non-null object

1 age 1517 non-null int64

2 economic.cond.national 1517 non-null int64

3 economic.cond.household 1517 non-null int64

4 Blair 1517 non-null int64

5 Hague 1517 non-null int64

6 Europe 1517 non-null int64

7 political.knowledge 1517 non-null int64

8 gender 1517 non-null object

From the above table we can see that the dataset doesn’t have any missing values. The dataset has 2 obj
ect type dataset and 7 integer type dataset.

Let’s check the summary of data set:

count unique top freq mean std min 25% 50% 75% max
count unique top freq mean std min 25% 50% 75% max

vote 1525 2 Labour 1063 NaN NaN NaN NaN NaN NaN NaN

age 1525 NaN NaN NaN 54.1823 15.7112 24 41 53 67 93

economic.cond.national 1525 NaN NaN NaN 3.2459 1 3 3 4 5


0.880969

economic.cond.household 1525 NaN NaN NaN 3.14033 0.929951 1 3 3 4 5

Blair 1525 NaN NaN NaN 3.33443 1.17482 1 2 4 4 5

Hague 1525 NaN NaN NaN 2.74689 1.2307 1 2 2 4 5

Europe 1525 NaN NaN NaN 6.72852 3.29754 1 4 6 10 11

political.knowledge 1525 NaN NaN NaN 1.5423 1.08331 0 0 2 2 3

gender 1525 2 female 812 NaN NaN NaN NaN NaN NaN NaN

As you can see from the above table, the average age of voters is 54 yrs. Most people gave their vote to
labour party.1063 people out of 1525 gave vote to labour party. The average economic.cond.national and
economic.cond.household are nearly same. Voters had an average political knowledge as well. Majority of
the voters were female.

1.2 Perform Univariate and Bivariate Analysis.


Do exploratory data analysis. Check for
Outliers
First let’s check for duplicate rows by using .duplicated command. We have 8 duplicated rows
and we drop the duplicated rows. Now, we have 1517 rows and 9 variable.
UNIVARIENT ANALYSIS:

As you can see from the above graph, economic.cond.national is normally distributed. Blair and Europe is
left skewed. Majority of the voters are 45-65 years old. A lot of voters come from fairly good economic
household condition and has average political knowledge.

BIVARIATE ANALYSIS:
From the above plot we can clearly see that, Labour party is favorite across all ages.

From the above plot we can see that labour party has more leader having assessment 5 than conservative
party. Labour party also has less leader having assessment 1 than conservative party.
Voters of both the party comes from all economic condition. Although labour party has more number of
voters from economic condition 5 than conservative party.

PAIRPLOT:

From the above pairplot we can see a balanced distribution of data.


CORRELATION PLOT:

From the above correlation plot, we see no correlation among variables. Economic.cond.national,
Economic.cond.household and Hague have positive correlation. Rest all has negative correlation.

Check for outliers:


As you can see from the above boxplot graph, only economic.cond.household and
economic.cond.national has got outliers. So we treat the outliers first.

As you can see we have treated all the outliers.


1.3. Encode the data (having string values)
for Modeling. Is Scaling necessary here or
not? Data Split: Split the data into train and
test (70:30).

Encoding of the data having string values has been done in jupyter notebook.

Yes, scaling is necessary in the problem as the variables in the dataset have different unit of
measurement. Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions do not work correctly without normalization. For example, the
majority of classifiers calculate the distance between two points by the distance. If one of the
features has a broad range of values, the distance governs this particular feature. Therefore, the
range of all features should be normalized so that each feature contributes approximately
proportionately to the final distance.
Splitting of data into train and test data in the ratio of 70:30 is done in jypyter notebook.

1.4. Apply Logistic Regression and LDA (linear


discriminant analysis).
To apply logistic regression and LDA we first need to import logistic regression and linear discriminant analysis
from sklearn library. We have applied logistic regression and LDA in jupyter notebook.

LOGISTIC REGRESSION:

Accuracy of test data: 0.8341

Classification report of train data

Precision recall f1-score support

conservative 0.74 0.65 0.69 307

labour 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061


weighted avg 0.83 0.83 0.83 1061

In [43]:

Accuracy of train data: 0.8289

Classification report of test data

precision recall f1-score support

conservative 0.76 0.73 0.74 153

Labour 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

LINEAR DISCRIMINANT ANALYSIS:

Accuracy of test data: 0.8341

Classification report of test data:

precision recall f1-score support

conservative 0.76 0.73 0.74 153

labour 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456


weighted avg 0.83 0.83 0.83 456

Accuracy of train data: 0.8311

Classification report of train data:

precision recall f1-score support

conservative 0.74 0.65 0.69 307

labour 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Both linear regression and linear discriminant analysis have nearly same precision, accuracy, recall and F1 score.

1.5. Apply KNN Model and Naïve Bayes Model.


Interpret the results.
To apply KNN model and Naïve Bayes model we first need to import GaussianNB and KNeighborsClassifier from
sklearn library. We have applied GaissianNB and KNeighborsClassifier in jupyter notebook.

KNN

Accuracy of train data: 0.857

Classification report on train data:

precision recall f1-score support

conservative 0.77 0.72 0.75 307

labour 0.89 0.91 0.90 754

accuracy 0.86 1061

macro avg 0.83 0.82 0.82 1061


weighted avg 0.86 0.86 0.86 1061

Accuracy of test data: 0.826

Classification report on test data:

precision recall f1-score support

conservative 0.76 0.71 0.73 153

labour 0.86 0.89 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.80 456

weighted avg 0.82 0.83 0.83 456

In KNN model, the accuracy of test data is slightly more than train dataset

GAUSSIAN NAÏVE BAYES

Accuracy of train data: 0.829

Classification report on train data:

precision recall f1-score support

conservative 0.71 0.69 0.70 307

labour 0.88 0.88 0.88 754

accuracy 0.83 1061

macro avg 0.79 0.79 0.79 1061

weighted avg 0.83 0.83 0.83 1061


Accuracy of test data: 0.826

Classification report on test data:

precision recall f1-score support

conservative 0.74 0.75 0.74 153

labour 0.87 0.87 0.87 303

accuracy 0.83 456

macro avg 0.81 0.81 0.81 456

weighted avg 0.83 0.83 0.83 456

Accuracy of both train and test data set in nearly same. Recall, precision and F1 score for conservative in train data
set is slightly less for conservative in test dataset.

After interpreting both the model, we can say that KNN model is lightly better optimized than Naïve Bayes model.

1.6. Model Tuning, Bagging (Random Forest


should be applied for Bagging), and Boosting.
To apply Bagging and boosting model we first need to import BaggingClassifier and GradientBoostingClassifier
from sklearn library. We have applied BaggingClassifier and GradientBoostingClassifier in jupyter notebook.

BAGGING (RANDOM FOREST)

Accuracy of train data: 0.965

Classification report of train data:

precision recall f1-score support

conservative 0.98 0.90 0.94 307

labour 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.96 1061


Accuracy of test data: 0.828

Classification report of test data:

precision recall f1-score support

conservative 0.78 0.69 0.73 153

labour 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

Accuracy, precision, recall and F1 score of train data set is more than that of test data.

BOOSTING

Accuracy of train data: 0.892

Classification report of train data:

precision recall f1-score support

conservative 0.84 0.78 0.81 307

labour 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061


weighted avg 0.89 0.89 0.89 1061

Accuracy of test data: 0.835

Classification report of test data:


precision recall f1-score support

conservative 0.80 0.68 0.73 153

labour 0.85 0.91 0.88 303

accuracy 0.84 456


macro avg 0.82 0.80 0.81 456
weighted avg 0.83 0.84 0.83 456

Accuracy, precision, recall and F1 score of train data set is more than that of test data set.

1.7. Performance Metrics: Check the


performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model.
Final Model: Compare the models and write
inference which model is best/optimized.
LOGISTIC REGRESSION

ROC curve for train data:


Area under the curve: 0.890

Confusion matrix for train data:

ROC curve for test data:

Area under the curve: 0.890

Confusion matrix for train data:


LINEAR DISCRMINANT ANALYSIS

ROC curve for train data:

Area under the curve: 0.890

Confusion matrix

ROC curve for test data:


Area under the curve: 0.890

Confusion matrix:

GAUSSIAN NAÏVE BAYES:

ROC curve for train data:


Area under the curve: 0.889

Confusion matrix:

ROC curve for test data:

Area under the curve: 0.889

Confusion matrix:
KNN

ROC curve for train data:

Area under the curve: 0.930

Confusion matrix:
ROC curve for test data:

Area under the curve: 0.930

Confusion matrix:

BAGGING

ROC curve for train data:


Area under the curve: 0.930

Confusion matrix:

ROC curve for test data:

Area under the curve: 0.930

Confusion matrix:
BOOSTING

ROC curve for train data:

Area under the curve: 0.951

Confusion matrix:
ROC curve for test data:

Area under the curve: 0.951

Confusion matrix:

After analyzing all the models we can say that bagging (Random forest) is best optimized for this data set.

1.8. Based on these predictions, what are the


insights?
 Parties should focus more in Europe’s integration and its positive effect. As we can clearly see
the impact of Europe’s integration in the voters and leader should try to refrain talking about
negative impact of Europe’s integration.
 Leaders should have an good image in front of public and should work towards it as the data
clearly shows that leaders having better rating are automatically attracting more voters.
 Both parties should focus on capturing votes of young and older population. Conservative party
should specially focus on middle age population.
 Parties should also try to capture votes of people having less political knowledge as they are easy
to manipulate as compared to people having more political knowledge.
 Parties should talk more about topics related to women as there are more number of women
coming to vote as compared to men.

You might also like