23-11-2021 (Machine Learning Project Business Report)

Machine Learning Project Business Report
NAME : SHUBHAM PADOLE

PGD-DSBA
Online
Date:23/11/2021
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to
build a model, to predict which party a voter will vote for on the basis of the given
information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Data Set used : Election_data.xlsx

1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
 Importing library
 Load the dataset
 Head of the dataset and describetion
Since the ‘vote’ variable is the target, we therefore have ‘vote’ as the dependent and rest 8
variables as the independent or predictor variables.
Looking at the first 5 records of the dataset gives the following:
 Check the nullvalue and data_info
Unnamed column has been dropped from the data frame, now we left with only 8 columns in it. All the
independent continuous columns has a integer datatype although some of the category columns have object
datatype and that can be handled using one hot coding. The dependent column has object datatype
 Dumping the duplicate and data shape

Checking the value count
The male and female voters are briefly divided across “Labour” and “Conservative” parties. People prefer Labour
party more over the Conservative parity. Although female voter count over pass the number of male voters.
Viewing dtypes
1.2Perform Univariate and Bivariate Analysis.
Do exploratory data analysis. Check for Outliers.
Univariate Analysis
Bivariate and Multivariate Analysis
Distribution of Male and Female voters across the age is almost in equal proportion with Females slighter higher in
number in comparison to Males, also both female and male voters age lies between 40 to 70 years.
JD
- -
- -
! 6 -
- ·-
- -
Loboor
Qll' f'II
J.(I
-
H
lil
..L..
;;
R
L
- -..
5
(I
-
1..11><>.,
wi:,
-
C'fin.s.tf'll.ih'II
Pair plots
Heat map
Box Plot
Observation:
 can be easily observed that relatively younger people have voted for “Labour” party in comparison to that
of older people who voted for “Conservative” party.
 There is an evenly distributed number of people when it comes to thier knowledge about their party's
position on European integration.
 Majority of European people have voted for “Labour” party
 There exists an outlier for economic.cond.household and economic.cond.national.
1.3 Encode the data (having string values) for

Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test
(70:30)
1 data_drf. d,escribe{). T
count me,a:n sti:I 11,1i11125,% §0% 75% max
age ·1525.0 54.18.22B5 Hi. 1 209 24.0 41.0 53.0 67.0 93.0
econam'ic.c,ondJrmio:nal ·1525.0 3..245902 .2801IB9 1.0 3.0 3..0 4.0 5.
eco:nomic.co:ndLhousellolcl ·1525. 3.141J32& .la29tl5 1. 3.0 J.O 4. 5.
Elair ·1525. 3.334425 u 4:8.24 1. 2.0 4.0 4. 5.
Hague ·1525. 2. 46S25 1.230' 00 1. 2.0 2.0 4. 5.
Europe ·1525. 6.- 8525 3.29 538 1. 4.0 6..0 11 1.
polilical.lrnowleoge ·1525. 1.5422B5 1. 833·15 0.0 2.0 2.........3.
1 cat! ['vote', 'gender'
l 1 df =p-i:.l. get_dumm.:.es(data_dr'', columris=cat1,drop_flt st:=11riue)
11 df .,head()
age eco:nomiic..conrdlnati.:mal economic..e.ond.lhouseh[)ld Blair IHaEJu:e Europe politicall.klno·wlecflQv,ofl,!_Lahour g ntf .r_mal'e
0 43 3 3 4 2 2 0
11 35 4 4 4 4 5 2
2' 35 4 4 5 2 J 2
3 24 4 2 2 4 D 0
4 4·1 2 2 6 2
Observation :
 Vote’ and ‘Gender’ variables are encoded. Also ‘Age’ variable has been categorized into several bins of
age groups. Since Age variable is the only numeric variable and by itself can't be meaningfully used, we
plan to categorize the variable into groups, basis the First, Second, Third and Fourth Quartile values.
Is Scaling necessary here or not?

 Scaling doesn't seems to be required here as there are only categorical independent variables in the
Dataset.
1.4 Apply Logistic Regression and LDA

(linear discriminant analysis).
Discriminant Analysis
AUC ROC curve for LDA Train
AUC ROC curve for LDA Test
Logistic Regression
AUC ROC curve for Logistic Regression Train
LO ,,,.
o.a
0.6,
0.4
0.2
OJJ
OJJ 0.2 0.6, 0.8 LO
1 y_test_predi.ct=Lcgistic_,ode . pred· ct(x_test)

2 ogi stic,_ d,el_,s c re=Lcgis ti c_model . score (X_test, Y_test)
:l print(Logistic_mcdel-'sc re)
4
5 print m( etrics,.ccnifusion_mat · x(Y_test.,y_test_pre,dict) )

6 print(me""rics,.c assif.icatio _repo t{Y_te:st,y_fi:est_p edict))
0 .,,823!L441M8.034934
[[ 3S 415]
[ 36 292]]
pr,ecisicn recall f!L-sccr,e SIUµ' Ort
10 0 ..70 10.65 0.68, 130

'!L 0 .. 37 10,89 0,,88. 328
accuracy 0.,82 4158

aero av;g 0 ..73 10. 77 0.78. 4158
weigjhted av;g 0 .. 32. 10,82 0,,82 4158
1 y_test_prcD=Logistic_mcdel .. pre,i:lict_priotiia(X_tes )
2 pd., ata ra e(y_test_prcb). headl()
0 1
0 .93364S 0.006352
11 .600194 0.310006
2 .333480 0.006520
3, 0.5225,)3
4 .15 152 0.842848

AUC ROC curve for Logistic Regression Test
Observation :
 The model is defined with above parameters and fitted on Train and Test data. The model
gives an accuracy score as 82.66% for Train Data and 83.80% on Test Data.
Apply KNN Model and Naïve Bayes

1.5
Model. Interpret the results.
KNN
X=df.drop("IsLabo r_or_not", axis=t)
Y=df["IsLabour_or_not"]
r 1 x.headO
.age, eco:nomic.icon.dlna iol'ilal economic.cond.househO:ld Blair Hague Europe, po[illiic.all.knowledge lsMale_or_not
3 2
4 5
5 2 J
2 2 4
2 6
l fro111sc ipy. stats import zscor-e
[
[ 1
x[['age', 'econo ic. cond. national', 'economic. cond. houSehold', 'Blair',' Hague',' EU rope',' political. kno,1ledge', 'rw,ale_or_not']]
x, head(10)
a!Je ecornomic.om,d.nali,mal e,c1mDmic.cond.lmusehold Blair Hague Europe polilical.krno.. .tedge lsMale_o:r_n<lt

a, -0.7-11(1 3 -0.279218 .15004S 0.5667·16 -1.4HIB88 -1.434428 0.4226.fl -0.937059
-1:157001 0.856268 _g24 30 Oli667·16 1.0185414 -0.52435& 0.42264'3 1.08 100
2' -·1..221331 0.855268 o.g24730 l.4·181&7 -0.607078 - . 3'1010 0.42264'3 1.087100

3, -1.9216"'8 0.855268 -1.220025 -1.135225 -1.4Hl886 -0.82 14 - .4241-l!& -0..937059
4 - .fl:}Ql 3 -1.4147i!l4 -1.220025 -1.987695 -l.4HIB88 -0.22·1002 0.42264'3 1.087159
" _fll M;:7?Gi:: 11 '?70?'\'IQ 0 11!7".U"I flS::AA7·tfi: 11 flilQ:J::A!.1. J'IIQ"J 711.:1 .d.r')'Jir:.t!'l. 1 R.7\'IM
AUC ROC Curve KNN Train
AUC ROC Curve KNN Test
the auc curve 0.870
0 ...835•0785340314136
[ ,8. 27]
36 235]]
priecisii:m recall fl-sccrie support
!0 0..70 !0.76 0.73. 111

l 0 ..90 !0.87 0 ..8& 271
a1ccu aqr 0 ..84 382

a1c 0 avg 0 ...80 !0. 81 0 ..80 382
weiglhted avg 0 ...84 !0.84 0 ..84 382
1 y_trai _predict=K N_mode .predict(x_train)

2 KNN_ dlel_sc re=K N_mode .scor,e(x_·,ain,.y_train)
3 print{KNN_ del. sc re)
4 print{metr i cs .. confusion_ a trix(1r_tr a in, y_ ,ain_p rie,dict))
5 print{ etrics.. classi•"icati.on._r,ep-ort(y_tra.in,y_train_predi.ct))
0 ..,86.78.91513S608e.ag,
[[2J6.3 88]
[ J6.3 729]]
pr,ecisicn recall f!L-sccrie support
!0 0 ..,81 !0. 75 0,78, 351

!L 0.,89 !0. 92 0.91. 792
ace acy 0 ,,87' 1143

a1c 0 avg 0 ,,85 !0. 83 0.,84 1143
weiglhted avg 0 . .87 !0. 87 0,,87' 1143
1 y_test_p edict= N_model. predict(x_test)

2 KNN_ odel_score=K N_model.scc,e(:,::_t,est,y_test)
3 pririt(KNN_ odel_sc re)
4 pririt (me·,ics.. ccn1fusion_, at ·x(y_fi:est,y,_test_p e,dict))
5 pririt (met,·cs .. c: assi f.ication_;eport(y_test,y_test_predict))
0 ,.,82. 607329•842932
[ [ .81 30]
[ 37 234]
p r,ecisicni ecall f!L-sccrie s1uppo
10 0 .. J6.9 e.73 0.71 111

!l 0 ,,89 e.86 0.,87 271
accuracy 0 . .82 382

aero avg 0 .. 79 e.s0 0.79 382
weiglhted avg 0 ,,83 e.82 ,0,83 382
AUC ROC curve after n classifier for train data set
the auc curve 0.904
AUC ROC curve after n classifier for test data set

the auc curve 0.900
Naive Bayes
the auc 0.886
the auc curve 0.885
1.6Model Tuning, Bagging (Random Forest

should be applied for Bagging), and
Boosting
Bagging Train
AUC _ROC Curve Bagging Train
AUC: 1.000
Bagging Test
AUC _ROC Curve Bagging Test
AUC: 0.877
1.7 Performance Metrics: Check the performance
of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model:
Compare the models and write inference which
model is best/optimized.
AUC: 0.913
Gradient Boosting
AUC _ROC Curve Boosting Train
ADA Boosting Test

AUC _ROC Curve Boosting Test
AUC: 0.879
Gradient Boosting Test

Gradient Boosting AUC_ROC Curve Test
AUC: 0.904
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
We have imported inaugural module of nltk.corpus library to get speeches of the Presidents of the
United States of America.
After that we have used inaugural modules raw method to extract the speech of President Franklin
D. Roosevelt, President John F. Kennedy and President Richard Nixon as below
We have used python’s len function on each row speech to identify Number of characters in speech
Roosevelt= inauguraLra· ('1941-Roosevelt.txt')
2 Kenne.cJ'y= inaugural. raL<( "1%1-Kennedy. txt')
Nixcn =inaugural. rao,( "1973-Nixon. txt •)
A.fter importing lhe text file, 1ve woul<I rsl count tne tolal number ol characlers,in eaoh lile separately.
number _of _characters = len( cosevelt)

2 priflt ( 'tiu ber of ctiaracter in Roosevelt file: ', number _of _characters)
number _of _characters = len(Kennedy}
" priflt( 'nu ber of ctiaracters in Kennedy file:',, number _of_characters)
number _of _characters =le,n(Nixcn)
6 print('tiu ber of ctiaracters in Nixon file :' ,nu ber_of_characters)
Nu ber of c1haracter .in Roosevelt file: 7571
nu ber of characters i.n Ken e.cJy fil€: 7618
riu ber of c aracters in Nixon file : 9991
Number ol words in each text file: Below •1e a.re ,counling lhe lolal number of words from each file separately, Here we are using lhe sp!itOlo split up l e words
based on space between eaah worn a d"e are counting the lotal numberof •iords by using the lenO iunclion.
number of words,in Kennedy
1 x = inaugural.. ra•,,1( '1961-Kennedy. txt')

2 ,mrds = x. split()
print( 'Nu ber of rds ifl Kennedy file:', len(1 ords))
IIU ber of
nun, ber of words in Nixon
X=inaugural. rai,(' 1973-Wixan,t:ct ')

2 ,mrds = x. split(}
3 print(' Nur.'ber of MJrds in Nixon file:
Nu ber ofwords in Nixcn f.ile: 1819
nun, ber of words in Roosevell
X=inaugural. rai,(' 1941- Roosevelt. txt')

2 ,mrds = x. split()
print('Nur.'ber of MJrds in Roosevelt file: len(words))
Nu be·r of 1sJords in oosevel t fi.le : 1360

Number ol Sentences.
Below we are counting lhe tolal number ofsente.n,ce in eaah lex! file, b using lambda function. We are using pd.Da.lalmme to move the dala as dictionary and
lhen v1ith lambda funclton we are clhecking each sentece which ,ends w[th "_• Using endswithOftmotion and the below code and oulput is as below
#number af sentence in Nixon
y = rd.DataFrame({'Text' :inaugural. rao,( '19H-Nixon. txt')}, index = [0])

4 y[' sentences']= y[ 'Text'],.app•ly(lamt>da x: len([x for x in x. split()if x. endsL<ith(' , '}]))
5 y
Tiex.t :5€-lltE!nce-s
O, • .,r.Vice President, Mr. SpE!alo:e;r, "-'.Ch "JLJi;____________68

2.2 Remove all the stopwords from all three speeches.
Remove all the stopwords from the Kennedy speeches.

Remove all the stopwords from the Nixon speeches
2.3 Which word occurs the most number of times
in his inaugural address for each president?
Mention the top three words. (after removing the
stopwords)
We have already removed the stopwatch in previous code using stopwords. Now we are loop to look
for any word and count the total number of occurrences. And we see from Roosevelt file we have the
below words which are highly used in during the speech by president. Top 3 Words : Nation , Know,
Spirit.
For roosevelt
For Kennedy
For Nixon

23-11-2021 (Machine Learning Project Business Report)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

23-11-2021 (Machine Learning Project Business Report)

Uploaded by

Copyright:

Available Formats

Machine Learning Project Business Report

NAME : SHUBHAM PADOLE

Data Set used : Election_data.xlsx

 Load the dataset

 Head of the dataset and describetion

 Dumping the duplicate and data shape

1.3 Encode the data (having string values) for

count me,a:n sti:I 11,1i11125,% §0% 75% max

econam'ic.c,ondJrmio:nal ·1525.0 3..245902 .2801IB9 1.0 3.0 3..0 4.0 5.

eco:nomic.co:ndLhousellolcl ·1525. 3.141J32& .la29tl5 1. 3.0 J.O 4. 5.

Elair ·1525. 3.334425 u 4:8.24 1. 2.0 4.0 4. 5.

Hague ·1525. 2. 46S25 1.230' 00 1. 2.0 2.0 4. 5.

Europe ·1525. 6.- 8525 3.29 538 1. 4.0 6..0 11 1.

polilical.lrnowleoge ·1525. 1.5422B5 1. 833·15 0.0 2.0 2.........3.

1 cat! ['vote', 'gender'

l 1 df =p-i:.l. get_dumm.:.es(data_dr'', columris=cat1,drop_flt st:=11riue)

age eco:nomiic..conrdlnati.:mal economic..e.ond.lhouseh[)ld Blair IHaEJu:e Europe politicall.klno·wlecflQv,ofl,!_Lahour g ntf .r_mal'e

Is Scaling necessary here or not?

1.4 Apply Logistic Regression and LDA

1 y_test_predi.ct=Lcgistic_,ode . pred· ct(x_test)

5 print m( etrics,.ccnifusion_mat · x(Y_test.,y_test_pre,dict) )

10 0 ..70 10.65 0.68, 130

accuracy 0.,82 4158

4 .15 152 0.842848

Apply KNN Model and Naïve Bayes

Model. Interpret the results.

.age, eco:nomic.icon.dlna iol'ilal economic.cond.househO:ld Blair Hague Europe, po[illiic.all.knowledge lsMale_or_not

l fro111sc ipy. stats import zscor-e

a!Je ecornomic.om,d.nali,mal e,c1mDmic.cond.lmusehold Blair Hague Europe polilical.krno.. .tedge lsMale_o:r_n<lt

2' -·1..221331 0.855268 o.g24730 l.4·181&7 -0.607078 - . 3'1010 0.42264'3 1.087100

4 - .fl:}Ql 3 -1.4147i!l4 -1.220025 -1.987695 -l.4HIB88 -0.22·1002 0.42264'3 1.087159

!0 0..70 !0.76 0.73. 111

a1ccu aqr 0 ..84 382

1 y_trai _predict=K N_mode .predict(x_train)

!0 0 ..,81 !0. 75 0,78, 351

ace acy 0 ,,87' 1143

1 y_test_p edict= N_model. predict(x_test)

10 0 .. J6.9 e.73 0.71 111

accuracy 0 . .82 382

the auc curve 0.904

AUC ROC curve after n classifier for test data set

1.6Model Tuning, Bagging (Random Forest

AUC _ROC Curve Bagging Test

ADA Boosting Test

Gradient Boosting Test

number _of _characters = len( cosevelt)

number of words,in Kennedy

1 x = inaugural.. ra•,,1( '1961-Kennedy. txt')

nun, ber of words in Nixon

X=inaugural. rai,(' 1973-Wixan,t:ct ')

nun, ber of words in Roosevell

X=inaugural. rai,(' 1941- Roosevelt. txt')

Nu be·r of 1sJords in oosevel t fi.le : 1360

#number af sentence in Nixon

y = rd.DataFrame({'Text' :inaugural. rao,( '19H-Nixon. txt')}, index = [0])

O, • .,r.Vice President, Mr. SpE!alo:e;r, "-'.Ch "JLJi;____________68

Remove all the stopwords from the Kennedy speeches.

You might also like