Professional Documents
Culture Documents
23-11-2021 (Machine Learning Project Business Report)
23-11-2021 (Machine Learning Project Business Report)
Importing library
Since the ‘vote’ variable is the target, we therefore have ‘vote’ as the dependent and rest 8
variables as the independent or predictor variables.
Looking at the first 5 records of the dataset gives the following:
Check the nullvalue and data_info
Unnamed column has been dropped from the data frame, now we left with only 8 columns in it. All the
independent continuous columns has a integer datatype although some of the category columns have object
datatype and that can be handled using one hot coding. The dependent column has object datatype
The male and female voters are briefly divided across “Labour” and “Conservative” parties. People prefer Labour
party more over the Conservative parity. Although female voter count over pass the number of male voters.
Viewing dtypes
1.2Perform Univariate and Bivariate Analysis.
Do exploratory data analysis. Check for Outliers.
Univariate Analysis
Bivariate and Multivariate Analysis
Distribution of Male and Female voters across the age is almost in equal proportion with Females slighter higher in
number in comparison to Males, also both female and male voters age lies between 40 to 70 years.
JD
- -
- -
! 6 -
- ·-
- -
Loboor
Qll' f'II
J.(I
-
H
lil
..L..
;;
R
L
- -..
5
(I
-
1..11><>.,
wi:,
-
C'fin.s.tf'll.ih'II
Pair plots
Heat map
Box Plot
Observation:
can be easily observed that relatively younger people have voted for “Labour” party in comparison to that
of older people who voted for “Conservative” party.
There is an evenly distributed number of people when it comes to thier knowledge about their party's
position on European integration.
Majority of European people have voted for “Labour” party
There exists an outlier for economic.cond.household and economic.cond.national.
age ·1525.0 54.18.22B5 Hi. 1 209 24.0 41.0 53.0 67.0 93.0
11 df .,head()
0 43 3 3 4 2 2 0
11 35 4 4 4 4 5 2
2' 35 4 4 5 2 J 2
3 24 4 2 2 4 D 0
4 4·1 2 2 6 2
Observation :
Vote’ and ‘Gender’ variables are encoded. Also ‘Age’ variable has been categorized into several bins of
age groups. Since Age variable is the only numeric variable and by itself can't be meaningfully used, we
plan to categorize the variable into groups, basis the First, Second, Third and Fourth Quartile values.
Logistic Regression
AUC ROC curve for Logistic Regression Train
LO ,,,.
o.a
0.6,
0.4
0.2
OJJ
OJJ 0.2 0.6, 0.8 LO
0 .,,823!L441M8.034934
[[ 3S 415]
[ 36 292]]
pr,ecisicn recall f!L-sccr,e SIUµ' Ort
1 y_test_prcD=Logistic_mcdel .. pre,i:lict_priotiia(X_tes )
2 pd., ata ra e(y_test_prcb). headl()
0 1
0 .93364S 0.006352
11 .600194 0.310006
2 .333480 0.006520
3, 0.5225,)3
Observation :
The model is defined with above parameters and fitted on Train and Test data. The model
gives an accuracy score as 82.66% for Train Data and 83.80% on Test Data.
KNN
X=df.drop("IsLabo r_or_not", axis=t)
Y=df["IsLabour_or_not"]
r 1 x.headO
3 2
4 5
5 2 J
2 2 4
2 6
[
[ 1
x[['age', 'econo ic. cond. national', 'economic. cond. houSehold', 'Blair',' Hague',' EU rope',' political. kno,1ledge', 'rw,ale_or_not']]
x, head(10)
" _fll M;:7?Gi:: 11 '?70?'\'IQ 0 11!7".U"I flS::AA7·tfi: 11 flilQ:J::A!.1. J'IIQ"J 711.:1 .d.r')'Jir:.t!'l. 1 R.7\'IM
AUC ROC Curve KNN Train
AUC ROC Curve KNN Test
the auc curve 0.870
0 ...835•0785340314136
[ ,8. 27]
36 235]]
priecisii:m recall fl-sccrie support
0 ..,86.78.91513S608e.ag,
[[2J6.3 88]
[ J6.3 729]]
pr,ecisicn recall f!L-sccrie support
0 ,.,82. 607329•842932
[ [ .81 30]
[ 37 234]
p r,ecisicni ecall f!L-sccrie s1uppo
AUC: 1.000
Bagging Test
AUC: 0.877
1.7 Performance Metrics: Check the performance
of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model:
Compare the models and write inference which
model is best/optimized.
AUC: 0.913
Gradient Boosting
AUC _ROC Curve Boosting Train
AUC: 0.879
AUC: 0.904
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
We have imported inaugural module of nltk.corpus library to get speeches of the Presidents of the
United States of America.
After that we have used inaugural modules raw method to extract the speech of President Franklin
D. Roosevelt, President John F. Kennedy and President Richard Nixon as below
We have used python’s len function on each row speech to identify Number of characters in speech
Roosevelt= inauguraLra· ('1941-Roosevelt.txt')
2 Kenne.cJ'y= inaugural. raL<( "1%1-Kennedy. txt')
Nixcn =inaugural. rao,( "1973-Nixon. txt •)
A.fter importing lhe text file, 1ve woul<I rsl count tne tolal number ol characlers,in eaoh lile separately.
Number ol words in each text file: Below •1e a.re ,counling lhe lolal number of words from each file separately, Here we are using lhe sp!itOlo split up l e words
based on space between eaah worn a d"e are counting the lotal numberof •iords by using the lenO iunclion.
IIU ber of
Below we are counting lhe tolal number ofsente.n,ce in eaah lex! file, b using lambda function. We are using pd.Da.lalmme to move the dala as dictionary and
lhen v1ith lambda funclton we are clhecking each sentece which ,ends w[th "_• Using endswithOftmotion and the below code and oulput is as below
Tiex.t :5€-lltE!nce-s
We have already removed the stopwatch in previous code using stopwords. Now we are loop to look
for any word and count the total number of occurrences. And we see from Roosevelt file we have the
below words which are highly used in during the speech by president. Top 3 Words : Nation , Know,
Spirit.
For roosevelt
For Kennedy
For Nixon