You are on page 1of 12

MACHINE LEARNING PROJECT

Sapan Parikh
BABIO – SEP’19   
Contents
Machine Learning Project...............................................................................................................................................2
EDA.............................................................................................................................................................................. 2
Univariate Analysis..................................................................................................................................................3
Bi-variate Analysis...................................................................................................................................................4
Outlier..................................................................................................................................................................... 7
Multi-collinearity.....................................................................................................................................................7
Data preparation – SMOTE..........................................................................................................................................8
Model building............................................................................................................................................................8
Logistic regression...................................................................................................................................................8
KNN Model..............................................................................................................................................................9
Naïve Bayes.............................................................................................................................................................9
Model validation:..................................................................................................................................................10
Bagging...................................................................................................................................................................... 10
Actionable insights and recommendations:..............................................................................................................10

1
Machine Learning Project
The problem involves predicting the mode of commute by employees where they will be taking a car to the office.

EDA
#importing data

cars_data = read.csv("Cars.csv", header = TRUE)

At first glance, we can observe that the data has 444 observations providing data of 9 variables.

Data Dictionary

Age Age of the Employee in Years


Gender Gender of the Employee
Engineer For Engineer =1 , Non Engineer =0
MBA For MBA =1 , Non MBA =0
Work Exp Experience in years
Salary Salary in Lakhs per Annum
Distance Distance in Kms from Home to Office
license If Employee has Driving Licence -1, If not, then 0
Transport Mode of Transport

Data Structure
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

Data Summary
Age Gender Engineer MBA Work.Exp
Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000 Min. : 0.0
1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 3.0
Median :27.00 Median :1.0000 Median :0.0000 Median : 5.0
Mean :27.75 Mean :0.7545 Mean :0.2528 Mean : 6.3
3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.0
Max. :43.00 Max. :1.0000 Max. :1.0000 Max. :24.0
NA's :1
Salary Distance license Transport
Min. : 6.50 Min. : 3.20 Min. :0.0000 2Wheeler : 83
1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000 Car : 61
Median :13.60 Median :11.00 Median :0.0000 Public Transport:300
Mean :16.24 Mean :11.32 Mean :0.2342
3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
Max. :57.00 Max. :23.40 Max. :1.0000

Checking # of unique values in each column


sapply(cars_data_2,function(x) length(unique(x)))

2
Age Gender Engineer MBA Work.Exp Salary Distance license
25 2 2 2 24 122 137 2
Transport
3

13.74% of total transport by Cars.

Univariate Analysis
One-by-one numeric variable analysis:

Age

Salary

Work Exp.

3
Distance

Bi-variate Analysis
Transport Vs. Age

This shows, people driving Car to office have median age significantly higher than other two modes.

Trasnport Vs. Distance

4
As can be observed, employees driving car to work have comparatively high distance to cover on median basis.

Transport Vs. Salary

The above diagram clearly states that median salary of those taking their car to the office is multi-fold higher than
those taking other modes of transport to work.

Transport Vs. Work Exp.

Again, we can say based on the graph above that high experienced employees usually take Car to work. Which is
logical also.

5
The above were categorical Vs. numeric bi-variate analysis.

Now let us look at categorical Vs. categorical bi-variate analysis.

Transport Vs. Engineering

We can sae that for car segment, the ratio of engineers to non-engineers is the highest, close to 5:1.

Transport Vs. Gender

Again, same as above, compared to other segments, Car has higher male:female ratio.

Transport Vs. MBA

6
The behaviour remains the same almost across all three categories.

Transport Vs. License

Here, it is evident that unlike other two segments, car drivers have much higher probability of owning a license.

Outlier
Code format for capturing outliers for numeric variables:

IQRAge = IQR(cars_numeric$Age)

LLAge = quantile(cars_numeric$Age, 0.25) - 1.5*IQRAge

ULAge = quantile(cars_numeric$Age, 0.75) + 1.5*IQRAge

AgeOut = subset(cars_numeric, cars_numeric$Age < LLAge | cars_numeric$Age > ULAge)

dim(AgeOut)

Based upon that, below is the count of outliers, variable-vise.

 Age- 25
 Work exp. – 38
 Salary – 59
 Distance – 9

Multi-collinearity
Correlation Plot amongst numeric variables
7
As we can see, there is a very high positive correlation between Work Exp. and Age as well as Work Exp. and Salary.
The same shall be dealt later on.

Data preparation – SMOTE


SMOTE is Synthetic Minority Oversampling Technique. As we can see here, employees taking car to work -> that
group is just around 13.5%, indeed a minority.

We need to oversample the car commuters in the data. Also, in order to simplify, we are going to create dummy
dependent variable ‘Transport’, coding “Car” -> 1 and rest as “0”.

cars_train_balanced <- SMOTE(Transport1 ~., cars_train, perc.over = 5000, k = 5, perc.under = 100)


Output:

table(cars_train_balanced$Transport)

0 1
2150 2193
0 = Public Transport and Two-wheeler

1 = Car

As we can see, now in the data, number of people taking car to work is almost similar to those choosing other modes
of transport.

Model building
Logistic regression
Code:

cars_logistic <- glm(Transport1 ~ ., data = cars_train_balanced, family = binomial(link = "logit"))

Summary:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -86.37896 4.95131 -17.446 < 2e-16 ***
Age 2.87489 0.16841 17.071 < 2e-16 ***
GenderMale -0.68311 0.22855 -2.989 0.00280 **
Engineer1 0.66629 0.24383 2.733 0.00628 **
MBA1 -1.16348 0.23072 -5.043 4.59e-07 ***
Work.Exp -1.59100 0.11557 -13.767 < 2e-16 ***
Salary 0.26796 0.02415 11.095 < 2e-16 ***
Distance 0.46442 0.04925 9.430 < 2e-16 ***
license1 1.35984 0.20723 6.562 5.31e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)


8
Null deviance: 6020.25 on 4342 degrees of freedom
Residual deviance: 759.41 on 4334 degrees of freedom
AIC: 777.41

Number of Fisher Scoring iterations: 9

Checking for multi-collinearity


vif(cars_logistic)
Age Gender Engineer MBA Work.Exp Salary Distance license
8.633221 1.028315 1.067926 1.161522 12.328141 3.303098 1.909529 1.052074

As ‘Work Exp.’ is >10 -> it indicates the presence of multicollinearity in the data. So we can drop work exp. variable.

Confusion Matrix
table(cars_test$Transport1 ,cars_test$log.pred>0.5)

FALSE TRUE
0 110 5
1 0 18

Sensitivity = 100%

Specificity = 95%

KNN Model
First we normalize the data and then proceed with the model building.

Code:

scale = preProcess(cars_train_balanced, method = "range")

cars_train_normalized <- predict(scale, cars_train_balanced)

cars_test_normalized <- predict(scale, cars_test)

knn_fit = train(Transport1 ~ ., data=cars_train_normalized, method = "knn", trControl = trainControl(method = "cv",


number = 5),

tuneLength = 10)

Confusion matrix:
Sensitivity : 1.0000
Specificity : 0.9217

Naïve Bayes
Code:

cars_nb<-naiveBayes(x=cars_train_balanced[,-9], y=as.factor(cars_train_balanced[,9]))

#do make sure that your dependent variable is a factor

pred_nb<-predict(cars_nb,newdata = cars_test[,-9])

9
table(cars_test[,9],pred_nb)

Confusion Matrix:
pred_nb
0 1
0 111 4
1 1 17

With such a high sensitivity, we can say Naïve Bayes is applicable on this data-set.

Model validation:
According to the above 3 models and their interpretation, we can say that KNN model performed the best.

Bagging
As one model output might be biased, we have methods like bagging and boosting to nullify the inherent errors in
the models.

Summary of bagging:
$btree
n= 4343

node), split, n, loss, yval, (yprob)


* denotes terminal node

1) root 4343 2136 1 (0.491825927 0.508174073)


2) Age< 30.00368 1993 6 0 (0.996989463 0.003010537) *
3) Age>=30.00368 2350 149 1 (0.063404255 0.936595745)
6) Salary< 15.90422 108 21 0 (0.805555556 0.194444444) *
7) Salary>=15.90422 2242 62 1 (0.027653880 0.972346120) *

attr(,"class")
class
"sclass"

$OOB
[1] FALSE

$comb
[1] FALSE

$call
bagging.data.frame(formula = Transport1 ~ . - Work.Exp, data = cars_train_balanced,
control = rpart.control(maxdepth = 5, minsplit = 4))

attr(,"class")

Confusion Matrix:
table(cars_test$Transport1,cars_test$pred.bagging)

0 1
0 112 3
1 2 16

Here we can see that the performance has been significantly improved in terms of sensitivity and specificity.

Actionable insights and recommendations:


 Data has evidence of multicollinearity which needs to be dealt with.
10
 Data has an issue of minority sampling which needs to be addressed by apply SMOTE algorithm
 Amongst the three models – logistic regression, KNN and Naïve Bayes, KNN seems to be the best model with
wonderful sensitivity and specificity.
 However, by applying bagging, we can get much more realistic and reliable output.
 The increase in sensitivity can lead to improved efficiency and conversion in customer reach-out campaigns.

11

You might also like