You are on page 1of 43

Machine Learning Project Business Report

NAME : SHUBHAM PADOLE


PGD-DSBA
Online
Date:23/11/2021
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to
build a model, to predict which party a voter will vote for on the basis of the given
information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.

Data Set used : Election_data.xlsx


1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)

 Importing library

 Load the dataset

 Head of the dataset and describetion

Since the ‘vote’ variable is the target, we therefore have ‘vote’ as the dependent and rest 8
variables as the independent or predictor variables.
Looking at the first 5 records of the dataset gives the following:
 Check the nullvalue and data_info
Unnamed column has been dropped from the data frame, now we left with only 8 columns in it. All the
independent continuous columns has a integer datatype although some of the category columns have object
datatype and that can be handled using one hot coding. The dependent column has object datatype

 Dumping the duplicate and data shape


Checking the value count

The male and female voters are briefly divided across “Labour” and “Conservative” parties. People prefer Labour
party more over the Conservative parity. Although female voter count over pass the number of male voters.

Viewing dtypes
1.2Perform Univariate and Bivariate Analysis.
Do exploratory data analysis. Check for Outliers.
Univariate Analysis
Bivariate and Multivariate Analysis

Distribution of Male and Female voters across the age is almost in equal proportion with Females slighter higher in
number in comparison to Males, also both female and male voters age lies between 40 to 70 years.
JD
- -
- -
! 6 -
- ·-
- -
Loboor
Qll' f'II

J.(I
-
H

lil

..L..
;;

R
L
- -..
5

(I

-
1..11><>.,
wi:,
-
C'fin.s.tf'll.ih'II
Pair plots
Heat map
Box Plot
Observation:

 can be easily observed that relatively younger people have voted for “Labour” party in comparison to that
of older people who voted for “Conservative” party.
 There is an evenly distributed number of people when it comes to thier knowledge about their party's
position on European integration.
 Majority of European people have voted for “Labour” party
 There exists an outlier for economic.cond.household and economic.cond.national.

1.3 Encode the data (having string values) for


Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test
(70:30)
1 data_drf. d,escribe{). T

count me,a:n sti:I 11,1i11125,% §0% 75% max

age ·1525.0 54.18.22B5 Hi. 1 209 24.0 41.0 53.0 67.0 93.0

econam'ic.c,ondJrmio:nal ·1525.0 3..245902 .2801IB9 1.0 3.0 3..0 4.0 5.

eco:nomic.co:ndLhousellolcl ·1525. 3.141J32& .la29tl5 1. 3.0 J.O 4. 5.

Elair ·1525. 3.334425 u 4:8.24 1. 2.0 4.0 4. 5.

Hague ·1525. 2. 46S25 1.230' 00 1. 2.0 2.0 4. 5.

Europe ·1525. 6.- 8525 3.29 538 1. 4.0 6..0 11 1.

polilical.lrnowleoge ·1525. 1.5422B5 1. 833·15 0.0 2.0 2.........3.

1 cat! ['vote', 'gender'

l 1 df =p-i:.l. get_dumm.:.es(data_dr'', columris=cat1,drop_flt st:=11riue)

11 df .,head()

age eco:nomiic..conrdlnati.:mal economic..e.ond.lhouseh[)ld Blair IHaEJu:e Europe politicall.klno·wlecflQv,ofl,!_Lahour g ntf .r_mal'e

0 43 3 3 4 2 2 0

11 35 4 4 4 4 5 2
2' 35 4 4 5 2 J 2
3 24 4 2 2 4 D 0
4 4·1 2 2 6 2
Observation :

 Vote’ and ‘Gender’ variables are encoded. Also ‘Age’ variable has been categorized into several bins of
age groups. Since Age variable is the only numeric variable and by itself can't be meaningfully used, we
plan to categorize the variable into groups, basis the First, Second, Third and Fourth Quartile values.

Is Scaling necessary here or not?


 Scaling doesn't seems to be required here as there are only categorical independent variables in the
Dataset.

1.4 Apply Logistic Regression and LDA


(linear discriminant analysis).
Discriminant Analysis
AUC ROC curve for LDA Train
AUC ROC curve for LDA Test

Logistic Regression
AUC ROC curve for Logistic Regression Train
LO ,,,.

o.a

0.6,

0.4

0.2

OJJ
OJJ 0.2 0.6, 0.8 LO

1 y_test_predi.ct=Lcgistic_,ode . pred· ct(x_test)


2 ogi stic,_ d,el_,s c re=Lcgis ti c_model . score (X_test, Y_test)
:l print(Logistic_mcdel-'sc re)
4

5 print m( etrics,.ccnifusion_mat · x(Y_test.,y_test_pre,dict) )


6 print(me""rics,.c assif.icatio _repo t{Y_te:st,y_fi:est_p edict))

0 .,,823!L441M8.034934
[[ 3S 415]
[ 36 292]]
pr,ecisicn recall f!L-sccr,e SIUµ' Ort

10 0 ..70 10.65 0.68, 130


'!L 0 .. 37 10,89 0,,88. 328

accuracy 0.,82 4158


aero av;g 0 ..73 10. 77 0.78. 4158
weigjhted av;g 0 .. 32. 10,82 0,,82 4158

1 y_test_prcD=Logistic_mcdel .. pre,i:lict_priotiia(X_tes )
2 pd., ata ra e(y_test_prcb). headl()

0 1

0 .93364S 0.006352
11 .600194 0.310006

2 .333480 0.006520
3, 0.5225,)3

4 .15 152 0.842848


AUC ROC curve for Logistic Regression Test

Observation :

 The model is defined with above parameters and fitted on Train and Test data. The model
gives an accuracy score as 82.66% for Train Data and 83.80% on Test Data.

Apply KNN Model and Naïve Bayes


1.5

Model. Interpret the results.

KNN
X=df.drop("IsLabo r_or_not", axis=t)

Y=df["IsLabour_or_not"]

r 1 x.headO

.age, eco:nomic.icon.dlna iol'ilal economic.cond.househO:ld Blair Hague Europe, po[illiic.all.knowledge lsMale_or_not

3 2

4 5

5 2 J

2 2 4

2 6

l fro111sc ipy. stats import zscor-e

[
[ 1
x[['age', 'econo ic. cond. national', 'economic. cond. houSehold', 'Blair',' Hague',' EU rope',' political. kno,1ledge', 'rw,ale_or_not']]

x, head(10)

a!Je ecornomic.om,d.nali,mal e,c1mDmic.cond.lmusehold Blair Hague Europe polilical.krno.. .tedge lsMale_o:r_n<lt


a, -0.7-11(1 3 -0.279218 .15004S 0.5667·16 -1.4HIB88 -1.434428 0.4226.fl -0.937059
-1:157001 0.856268 _g24 30 Oli667·16 1.0185414 -0.52435& 0.42264'3 1.08 100

2' -·1..221331 0.855268 o.g24730 l.4·181&7 -0.607078 - . 3'1010 0.42264'3 1.087100


3, -1.9216"'8 0.855268 -1.220025 -1.135225 -1.4Hl886 -0.82 14 - .4241-l!& -0..937059

4 - .fl:}Ql 3 -1.4147i!l4 -1.220025 -1.987695 -l.4HIB88 -0.22·1002 0.42264'3 1.087159

" _fll M;:7?Gi:: 11 '?70?'\'IQ 0 11!7".U"I flS::AA7·tfi: 11 flilQ:J::A!.1. J'IIQ"J 711.:1 .d.r')'Jir:.t!'l. 1 R.7\'IM
AUC ROC Curve KNN Train
AUC ROC Curve KNN Test
the auc curve 0.870
0 ...835•0785340314136
[ ,8. 27]
36 235]]
priecisii:m recall fl-sccrie support

!0 0..70 !0.76 0.73. 111


l 0 ..90 !0.87 0 ..8& 271

a1ccu aqr 0 ..84 382


a1c 0 avg 0 ...80 !0. 81 0 ..80 382
weiglhted avg 0 ...84 !0.84 0 ..84 382

1 y_trai _predict=K N_mode .predict(x_train)


2 KNN_ dlel_sc re=K N_mode .scor,e(x_·,ain,.y_train)
3 print{KNN_ del. sc re)
4 print{metr i cs .. confusion_ a trix(1r_tr a in, y_ ,ain_p rie,dict))
5 print{ etrics.. classi•"icati.on._r,ep-ort(y_tra.in,y_train_predi.ct))

0 ..,86.78.91513S608e.ag,
[[2J6.3 88]
[ J6.3 729]]
pr,ecisicn recall f!L-sccrie support

!0 0 ..,81 !0. 75 0,78, 351


!L 0.,89 !0. 92 0.91. 792

ace acy 0 ,,87' 1143


a1c 0 avg 0 ,,85 !0. 83 0.,84 1143
weiglhted avg 0 . .87 !0. 87 0,,87' 1143

1 y_test_p edict= N_model. predict(x_test)


2 KNN_ odel_score=K N_model.scc,e(:,::_t,est,y_test)
3 pririt(KNN_ odel_sc re)
4 pririt (me·,ics.. ccn1fusion_, at ·x(y_fi:est,y,_test_p e,dict))
5 pririt (met,·cs .. c: assi f.ication_;eport(y_test,y_test_predict))

0 ,.,82. 607329•842932
[ [ .81 30]
[ 37 234]
p r,ecisicni ecall f!L-sccrie s1uppo

10 0 .. J6.9 e.73 0.71 111


!l 0 ,,89 e.86 0.,87 271

accuracy 0 . .82 382


aero avg 0 .. 79 e.s0 0.79 382
weiglhted avg 0 ,,83 e.82 ,0,83 382
AUC ROC curve after n classifier for train data set

the auc curve 0.904

AUC ROC curve after n classifier for test data set


the auc curve 0.900
Naive Bayes
the auc 0.886
the auc curve 0.885

1.6Model Tuning, Bagging (Random Forest


should be applied for Bagging), and
Boosting
Bagging Train
AUC _ROC Curve Bagging Train

AUC: 1.000
Bagging Test

AUC _ROC Curve Bagging Test

AUC: 0.877
1.7 Performance Metrics: Check the performance
of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model:
Compare the models and write inference which
model is best/optimized.
AUC: 0.913

Gradient Boosting
AUC _ROC Curve Boosting Train

ADA Boosting Test


AUC _ROC Curve Boosting Test

AUC: 0.879

Gradient Boosting Test


Gradient Boosting AUC_ROC Curve Test

AUC: 0.904
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned documents.
We have imported inaugural module of nltk.corpus library to get speeches of the Presidents of the
United States of America.
After that we have used inaugural modules raw method to extract the speech of President Franklin
D. Roosevelt, President John F. Kennedy and President Richard Nixon as below

We have used python’s len function on each row speech to identify Number of characters in speech
Roosevelt= inauguraLra· ('1941-Roosevelt.txt')
2 Kenne.cJ'y= inaugural. raL<( "1%1-Kennedy. txt')
Nixcn =inaugural. rao,( "1973-Nixon. txt •)

A.fter importing lhe text file, 1ve woul<I rsl count tne tolal number ol characlers,in eaoh lile separately.

number _of _characters = len( cosevelt)


2 priflt ( 'tiu ber of ctiaracter in Roosevelt file: ', number _of _characters)
number _of _characters = len(Kennedy}
" priflt( 'nu ber of ctiaracters in Kennedy file:',, number _of_characters)
number _of _characters =le,n(Nixcn)
6 print('tiu ber of ctiaracters in Nixon file :' ,nu ber_of_characters)
Nu ber of c1haracter .in Roosevelt file: 7571
nu ber of characters i.n Ken e.cJy fil€: 7618
riu ber of c aracters in Nixon file : 9991

Number ol words in each text file: Below •1e a.re ,counling lhe lolal number of words from each file separately, Here we are using lhe sp!itOlo split up l e words
based on space between eaah worn a d"e are counting the lotal numberof •iords by using the lenO iunclion.

number of words,in Kennedy

1 x = inaugural.. ra•,,1( '1961-Kennedy. txt')


2 ,mrds = x. split()
print( 'Nu ber of rds ifl Kennedy file:', len(1 ords))

IIU ber of

nun, ber of words in Nixon

X=inaugural. rai,(' 1973-Wixan,t:ct ')


2 ,mrds = x. split(}
3 print(' Nur.'ber of MJrds in Nixon file:
Nu ber ofwords in Nixcn f.ile: 1819

nun, ber of words in Roosevell

X=inaugural. rai,(' 1941- Roosevelt. txt')


2 ,mrds = x. split()
print('Nur.'ber of MJrds in Roosevelt file: len(words))

Nu be·r of 1sJords in oosevel t fi.le : 1360


Number ol Sentences.

Below we are counting lhe tolal number ofsente.n,ce in eaah lex! file, b using lambda function. We are using pd.Da.lalmme to move the dala as dictionary and
lhen v1ith lambda funclton we are clhecking each sentece which ,ends w[th "_• Using endswithOftmotion and the below code and oulput is as below

#number af sentence in Nixon

y = rd.DataFrame({'Text' :inaugural. rao,( '19H-Nixon. txt')}, index = [0])


4 y[' sentences']= y[ 'Text'],.app•ly(lamt>da x: len([x for x in x. split()if x. endsL<ith(' , '}]))
5 y

Tiex.t :5€-lltE!nce-s

O, • .,r.Vice President, Mr. SpE!alo:e;r, "-'.Ch "JLJi;____________68


2.2 Remove all the stopwords from all three speeches.

Remove all the stopwords from the Kennedy speeches.


Remove all the stopwords from the Nixon speeches
2.3 Which word occurs the most number of times
in his inaugural address for each president?
Mention the top three words. (after removing the
stopwords)

We have already removed the stopwatch in previous code using stopwords. Now we are loop to look
for any word and count the total number of occurrences. And we see from Roosevelt file we have the
below words which are highly used in during the speech by president. Top 3 Words : Nation , Know,
Spirit.
For roosevelt

For Kennedy
For Nixon

You might also like