Professional Documents
Culture Documents
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
voteageeconomic.cond.nationaleconomic.cond.householdBlairHagueEuropepolitical.knowledgegender0
Labourvote age economic.cond.national economic.cond.household Blair Hague
Europe political.knowledge gender
0 Labour 43 3 3 4 1 2 2 female
1 Labour 36 4 4 4 4 5 2 male
2 Labour 35 4 4 5 2 3 2 male
3 Labour 24 4 2 2 1 4 0 female
4 Labour 41 2 2 1 1 6 2
male43334122female1Labour36444452male2Labour35445232male3Labour24422140female4
Labour41221162male
A large number of methods collectively compute descriptive statistics and other
related operations on DataFrame. Most of these are aggregations like sum(),
mean(), but some of them, like sumsum(), produce an object of the same size.
This Dataframe is having two files among second file is giving total data set and wea
re defining the same in the name of Election data set Two classes.
readingCsv.describe()
readingCsv.info()
#readingCsv.isnull().sum()
for column in readingCsv.columns:
if readingCsv[column].dtype != 'object':
mean = readingCsv[column].mean()
readingCsv[column] = readingCsv[column].fillna(mean)
readingCsv.isnull().sum()
vote 0
age 0
economic.cond.national 0
economic.cond.household 0
Blair 0
Hague 0
Europe 0
political.knowledge 0
gender 0
dtype: int64
• Complete removal of data with missing values results in robust and highly accurate model
• Deleting a particular row or a column with no specific information is better, since it does not
have a high weightage.
• This Data frame is having No Null value, which we have already deleted during the dataset
importing.
• We can calculate the mean, median or mode of the feature and replace it with the missing
values. This is an approximation which can add variance to the data se
The Election data is giving certain variables like age, gender, locations and other national and
household details which we can measure the pattern of voting and political knowledge of
Populations.
readingCsv[dups
Number of duplicate rows = 8
Out[49]:
67 Labour 35 4 4 5 2 3 2 male
Conserva fema
983 74 4 3 2 4 8 2
tive le
ag economic.cond.n economic.cond.hou Bla Hag Euro political.know gend
vote
e ational sehold ir ue pe ledge er
123 fema
Labour 36 3 3 2 2 6 2
6 le
124 fema
Labour 29 4 4 4 2 2 2
4 le
143
Labour 40 4 3 4 2 2 2 male
8
#sns.boxplot(data = pd.melt(readingCsv))
#plt.show()
We can easily make out the variables are in this boxplot and the multiple variable can we display in
one index which will help to understand the categorical variable among data frame, this analysis is
also helping us get outlier, which we can easily understand age is one of the outlier in this dataset.
1. A Boxplot plot is good for what multiple types of data.
2. Comparing multiple variables simultaneously is also another useful way to understand your data.
When you have two continuous variables, a scatter plot is usually used. You can use
a boxplot to compare one continuous and one categorical variable.
3. Bivariate analysis:- is performed to find the relationship between each variable in the dataset and
the target variable of interest (or) using 2 variables and finding the relationship between them
plt.figure(figsize=(10,10))
sns.heatmap(readingCsv.corr(), annot=True, fmt='.2f', cmap='Blues')
plt.show()
sns.pairplot(readingCsv)
1. Pairplot visualizes given data to find the relationship between them where the variables can be
continuous or categorical. Plot pairwise relationships in a data-set.
2. Pairplot Parameters: ...
3. Use a different color palette. ...
4. Use different markers for each level of the hue variable: ...
The data set look very good variables are well define and categorical data can easily be define
with the help of different variables and pair plot help us get define data index.
readingCsv['age'] = std_scale.fit_transform(readingCsv[['age']])
readingCsv['economic.cond.national']= std_scale.fit_transform(readingCsv[['
economic.cond.national']])
readingCsv['economic.cond.household']= std_scale.fit_transform(readingCsv[[
'economic.cond.household']])
readingCsv['Blair']= std_scale.fit_transform(readingCsv[['Blair']])
readingCsv['Europe']= std_scale.fit_transform(readingCsv[['Europe']])
readingCsv.head()
Out[55]:
age economic.cond.national economic.cond.household Blair Hague Europe
political.knowledge vote_cat gender_cat
Yes, Scaling is Necessary in this dataset the reason is very clear, age agender and vote is very
important variables in this data set and this data is using for Elections of geography ,Political
Knowledge is one of the variable in this dataset which gives us indication the gender ( Male &
Female )is important variables which we have to convert in Categorical variables (Age+ Gender).
Feature scaling is essential for machine learning algorithms that calculate distances between
data. ... Since the range of values of raw data varies widely, in some machine learning algorithms,
objective functions do not work correctly without normalization .
An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experi
ment may not have been run correctly. If it can be determined that an outlying point is in fact erroneou
s, then the outlying value should be deleted from the analysis (or corrected if possible )
In addition to checking the normality assumption, the lower and upper tails of the normal probability pl
ot can be a useful graphical technique for identifying potential outliers. In particular, the plot can help
determine whether we need to check for a single outlier or whether we need to check for multiple outli
ers.
We have taken few Variable like Age, Gender, Vote in categorical variables. Which will help us get the
correct picture of this data frame in this case outlier removal is very Important.
# Copy all the predictor variables into X dataframe
X = readingCsv.drop(['Hague','gender','vote','vote_cat'], axis=1)
#X = X.drop(['gender_Stdscale'], axis=1, retain=True)
# Copy target into the y dataframe.
y = readingCsv[['vote_cat']]
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1525 non-null float64
1 economic.cond.national 1525 non-null float64
2 economic.cond.household 1525 non-null float64
3 Blair 1525 non-null float64
4 Europe 1525 non-null float64
5 political.knowledge 1525 non-null int64
6 gender_cat 1525 non-null int8
dtypes: float64(5), int64(1), int8(1)
memory usage: 73.1 KB
#X['economic.cond.national_cat'] = round(X['economic.cond.national_cat'],5)
#X['economic.cond.household_cat'] = round(X['economic.cond.household_cat'],
5)
X['Blair'] =
round(X['Blair'],5)
X['Europe'] = round(X['Europe'],5)
#y['vote_cat'] = round(y['vote_cat'],5)
#readingCsv.isnull().sum()[readingCsv.isnull().sum()=='age_Cat','economic.c
ond.national_cat','economic.con.househols_cat']
We have to define categorical variables in X & Y variables which will help to get proper analysis of Dat
a frame.as per question we have to divide this data set in 70:30 ration, on behalf of this data we will s
plit the values of X&Y variables and we will the values of Train and Test Models
In [69]:
# Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 ,
random_state=1)
The Logistic Regression gives us the values for two variable (0,1) 77% and 81% which is quite good
model support it means 77% gender ( Male or female) are giving Vote with 81% Accuracy .Same as w
e talk about recall variable (0) is recall 53% instead Variable (1) recall 93% which shows result of X &
Y Variables would change frequency.F1 score for both the variable’s is 62% &87% sinuously.as far as
model accuracy is concern both the models are very close to 80% where confusion matrix gives accur
acy label of 78% and 79% with weighted average result of 80% commutatively.
LDA
In [74]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clfLDA = LinearDiscriminantAnalysis()
clfLDA.fit(X_train, y_train)
y_pred=clfLDA.predict(X_test)
model_scoreLDA = clfLDA.score(X_test, y_test)
print(model_scoreLDA)
print(metrics.confusion_matrix(y_test, y_pred))
0.7991266375545851
[[ 77 64]
[ 28 289]]
C:\Users\Hp\anaconda3\lib\site-packages\sklearn\utils\validation.py:760: Da
taConversionWarning: A column-vector y was passed when a 1d array was expec
ted. Please change the shape of y to (n_samples, ), for example using ravel
().
y = column_or_1d(y, warn=True)
Weighted average accuracy for the LDA Model is 79% which also very close to Logistic Regression M
odel in that case we can say that Both the model can be place to get better result and confusion matri
x. The value of Y will not affect much the label of accuracy which is certainly close to 80% in LDA Mod
el as well.
Both the Models can be applied in this data set both will the almost same Result.
NNH.fit(X_train, y_train)
C:\Users\Hp\anaconda3\lib\site-packages\ipykernel_launcher.py:3: DataConver
sionWarning: A column-vector y was passed when a 1d array was expected. Ple
ase change the shape of y to (n_samples, ), for example using ravel().
This is separate from the ipykernel package so we can avoid doing imports
until
Out[83]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=7, p=2,
weights='distance')
predicted_labels = NNH.predict(X_test)
NNH.score(X_test, y_test) X_train.head()
Out[100]:
-
145 0.4977 1.2957
-0.279218 -0.150948 1.1362 2 0
3 51 78
25
- -
0.3857
275 0.3299 -0.279218 -0.150948 1.1362 0 0
10
55 25
-
115 0.1794 0.5667
-1.414704 -0.150948 0.2210 2 0
3 02 16
02
- -
117 0.5667
1.9216 0.856268 2.000408 0.2210 0 1
2 16
98 02
• test_size=0.3
▪ 30% of observations to test set
▪ 70% of observations to training set
• data is randomly assigned unless you use random state hyperparameter
▪ If you use random state=4
Our data will be split exactly the same way
Naïve Byes Model
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large
chunk of data. Naive Bayes classifier is successfully used in various applications such as spam
filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of
probability for prediction of unknown class.
from sklearn.model_selection import train_test_split
X_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1525 non-null float64
1 economic.cond.national 1525 non-null float64
2 economic.cond.household 1525 non-null float64
3 Blair 1525 non-null float64
4 Europe 1525 non-null float64
5 political.knowledge 1525 non-null int64
6 gender_cat 1525 non-null int8
dtypes: float64(5), int64(1), int8(1)
memory usage: 73.1 KB
diab_model.fit(X_train, y_train)
C:\Users\Hp\anaconda3\lib\site-packages\sklearn\naive_bayes.py:206: DataCon
versionWarning: A column-vector y was passed when a 1d array was expected.
Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
o Let us consider X observations Y features in the training data set. First, a model from the
training data set is taken randomly with substitution.
o The given steps are repeated, and prediction is given, which is based on the collection of
predictions from n number of trees.
y_pred=clfRF.predict(X_test)
model_scoreRF = clfRF.score(X_test, y_test)
C:\Users\Hp\anaconda3\lib\site-packages\ipykernel_launcher.py:8: DataConver
sionWarning: A column-vector y was passed when a 1d array was expected. Ple
ase change the shape of y to (n_samples,), for example using ravel().
In [49]:
print(model_scoreRF)
print(metrics.confusion_matrix(y_test, y_pred))
0.7925764192139738
[[ 87 43]
[ 52 276]]
Random Forest Method gives Accuracy with 79% with all their other Quadrant with accuracy label 87
% 43% 52% sinuously as we have Divided this data set in 70:30 Model.
Gradient Boosting. Usually, we have to settle for trade-off between precision and recall. It depends
on our use case, do we care more about minimising false negatives (priority on recall) or false p
ositives (priority on precision).
.
# Performance Matrix on train data set
y_train_predict = gbcl.predict(X_train)
model_score = gbcl.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
0.8612933458294283
[[232 100]
[ 48 687]]
precision recall f1-score support
Bagging is a method of merging the same type of predictions. Boosting is a method of merging differ
ent types of predictions. Bagging decreases variance, not bias, and solves over-fitting issues in a mo
del. Boosting decreases bias, not variance.
print("="*100)
• If our plot two ROC curves 'Train' and 'Test' and they do not cross each other, then one
of your classifiers is clearly performing better, because for all possible FPR values you
get a higher TPR. Obviously the area under the ROC will also be greater.
• Now, if they do cross each other, then there is a point where FPR and TPR are the
same for both curves 'Train' and 'Test. You can no longer say that one ROC curve
performs better, as it now depends on what trade-off you prefer.
• Both the Models are good as we can see the Performance of Train and Tested Data is
84% and 83% which is very close.
1.8 Based on these predictions, what are the
insights? (5 marks)
Data
The data used in this article and a Juypter Notebook containing the code listing can be downloaded
Above.
Age Distribution: This category splits the demographic of constituency residents into 9 age groups, (0–
9, 10–19 and so on up until 80+) and has been included as a social indicator.
Clustering
The algorithm describes data points as a node in a network that communicate their clustering
The main metric used to determine the signal magnitudes is the similarity score which is calculated as
the negative Euclidean squared distance between two data points². Using Affinity Propagation, the
smallest and most discrete clusters are identified and labelled first. This explains the dark-to-light
colour gradian.
Classification
I used Random Forest and Adaboost techniques to predict the voting direction of each constituency.
The Decision Tree Classifier and Random Forest both consistently achieved an accuracy score of 79%
on the training data and between 83–84% accuracy on the test data.
The predictive power of each attribute was extracted from the Random Forest and is shown in Figure
AdaBoost performed relatively good, often using combinations of logical attributes to make good
classifications. This is because AdaBoost does work well with outlying data.
Conclusions
Using Random Forest, it was possible to extract the predictive power of each attribute and demonstrate
that they all provide a meaningful contribution to the decision process. Classifications were achieved