Professional Documents
Culture Documents
PROJECT
PROBLEM 1
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given information,
to create an exit poll that will help in predicting overall win and seats covered by a
particular party.
Data
Dictionary
2. age: in years
We first import the necessary files, then we upload the excel file in jupyter notebook. Head function is
used to see first 5 rows of dataset.
economic.co political.
economic.co
vote age nd.househol Blair Hague Europe knowled gender
nd.national
d ge
0 Labour 43 3 3 4 1 2 2 female
1 Labour 36 4 4 4 4 5 2 male
2 Labour 35 4 4 5 2 3 2 male
3 Labour 24 4 2 2 1 4 0 female
4 Labour 41 2 2 1 1 6 2 male
From the above table we can see that the dataset doesn’t have any missing values. The dataset has 2 obj
ect type dataset and 7 integer type dataset.
count unique top freq mean std min 25% 50% 75% max
count unique top freq mean std min 25% 50% 75% max
vote 1525 2 Labour 1063 NaN NaN NaN NaN NaN NaN NaN
gender 1525 2 female 812 NaN NaN NaN NaN NaN NaN NaN
As you can see from the above table, the average age of voters is 54 yrs. Most people gave their vote to
labour party.1063 people out of 1525 gave vote to labour party. The average economic.cond.national and
economic.cond.household are nearly same. Voters had an average political knowledge as well. Majority of
the voters were female.
As you can see from the above graph, economic.cond.national is normally distributed. Blair and Europe is
left skewed. Majority of the voters are 45-65 years old. A lot of voters come from fairly good economic
household condition and has average political knowledge.
BIVARIATE ANALYSIS:
From the above plot we can clearly see that, Labour party is favorite across all ages.
From the above plot we can see that labour party has more leader having assessment 5 than conservative
party. Labour party also has less leader having assessment 1 than conservative party.
Voters of both the party comes from all economic condition. Although labour party has more number of
voters from economic condition 5 than conservative party.
PAIRPLOT:
From the above correlation plot, we see no correlation among variables. Economic.cond.national,
Economic.cond.household and Hague have positive correlation. Rest all has negative correlation.
Encoding of the data having string values has been done in jupyter notebook.
Yes, scaling is necessary in the problem as the variables in the dataset have different unit of
measurement. Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions do not work correctly without normalization. For example, the
majority of classifiers calculate the distance between two points by the distance. If one of the
features has a broad range of values, the distance governs this particular feature. Therefore, the
range of all features should be normalized so that each feature contributes approximately
proportionately to the final distance.
Splitting of data into train and test data in the ratio of 70:30 is done in jypyter notebook.
LOGISTIC REGRESSION:
In [43]:
Both linear regression and linear discriminant analysis have nearly same precision, accuracy, recall and F1 score.
KNN
In KNN model, the accuracy of test data is slightly more than train dataset
Accuracy of both train and test data set in nearly same. Recall, precision and F1 score for conservative in train data
set is slightly less for conservative in test dataset.
After interpreting both the model, we can say that KNN model is lightly better optimized than Naïve Bayes model.
Accuracy, precision, recall and F1 score of train data set is more than that of test data.
BOOSTING
Accuracy, precision, recall and F1 score of train data set is more than that of test data set.
Confusion matrix
Confusion matrix:
Confusion matrix:
Confusion matrix:
KNN
Confusion matrix:
ROC curve for test data:
Confusion matrix:
BAGGING
Confusion matrix:
Confusion matrix:
BOOSTING
Confusion matrix:
ROC curve for test data:
Confusion matrix:
After analyzing all the models we can say that bagging (Random forest) is best optimized for this data set.