Machine Learning Project

MACHINE LEARNING PROJECT
Sapan Parikh
BABIO – SEP’19
Contents
Machine Learning Project...............................................................................................................................................2
EDA.............................................................................................................................................................................. 2
Univariate Analysis..................................................................................................................................................3
Bi-variate Analysis...................................................................................................................................................4
Outlier..................................................................................................................................................................... 7
Multi-collinearity.....................................................................................................................................................7
Data preparation – SMOTE..........................................................................................................................................8
Model building............................................................................................................................................................8
Logistic regression...................................................................................................................................................8
KNN Model..............................................................................................................................................................9
Naïve Bayes.............................................................................................................................................................9
Model validation:..................................................................................................................................................10
Bagging...................................................................................................................................................................... 10
Actionable insights and recommendations:..............................................................................................................10
1
Machine Learning Project
The problem involves predicting the mode of commute by employees where they will be taking a car to the office.
EDA
#importing data
cars_data = read.csv("Cars.csv", header = TRUE)
At first glance, we can observe that the data has 444 observations providing data of 9 variables.
Data Dictionary
Age Age of the Employee in Years

Gender Gender of the Employee
Engineer For Engineer =1 , Non Engineer =0
MBA For MBA =1 , Non MBA =0
Work Exp Experience in years
Salary Salary in Lakhs per Annum
Distance Distance in Kms from Home to Office
license If Employee has Driving Licence -1, If not, then 0
Transport Mode of Transport
Data Structure
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
Data Summary
Age Gender Engineer MBA Work.Exp
Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000 Min. : 0.0
1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 3.0
Median :27.00 Median :1.0000 Median :0.0000 Median : 5.0
Mean :27.75 Mean :0.7545 Mean :0.2528 Mean : 6.3
3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.0
Max. :43.00 Max. :1.0000 Max. :1.0000 Max. :24.0
NA's :1
Salary Distance license Transport
Min. : 6.50 Min. : 3.20 Min. :0.0000 2Wheeler : 83
1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000 Car : 61
Median :13.60 Median :11.00 Median :0.0000 Public Transport:300
Mean :16.24 Mean :11.32 Mean :0.2342
3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
Max. :57.00 Max. :23.40 Max. :1.0000
Checking # of unique values in each column

sapply(cars_data_2,function(x) length(unique(x)))
2
Age Gender Engineer MBA Work.Exp Salary Distance license
25 2 2 2 24 122 137 2
Transport
3
13.74% of total transport by Cars.
Univariate Analysis
One-by-one numeric variable analysis:
Age
Salary
Work Exp.
3
Distance
Bi-variate Analysis
Transport Vs. Age
This shows, people driving Car to office have median age significantly higher than other two modes.
Trasnport Vs. Distance
4
As can be observed, employees driving car to work have comparatively high distance to cover on median basis.
Transport Vs. Salary
The above diagram clearly states that median salary of those taking their car to the office is multi-fold higher than
those taking other modes of transport to work.
Transport Vs. Work Exp.
Again, we can say based on the graph above that high experienced employees usually take Car to work. Which is
logical also.
5
The above were categorical Vs. numeric bi-variate analysis.
Now let us look at categorical Vs. categorical bi-variate analysis.
Transport Vs. Engineering
We can sae that for car segment, the ratio of engineers to non-engineers is the highest, close to 5:1.
Transport Vs. Gender
Again, same as above, compared to other segments, Car has higher male:female ratio.
Transport Vs. MBA
6
The behaviour remains the same almost across all three categories.
Transport Vs. License
Here, it is evident that unlike other two segments, car drivers have much higher probability of owning a license.
Outlier
Code format for capturing outliers for numeric variables:
IQRAge = IQR(cars_numeric$Age)
LLAge = quantile(cars_numeric$Age, 0.25) - 1.5*IQRAge
ULAge = quantile(cars_numeric$Age, 0.75) + 1.5*IQRAge
AgeOut = subset(cars_numeric, cars_numeric$Age < LLAge | cars_numeric$Age > ULAge)
dim(AgeOut)
Based upon that, below is the count of outliers, variable-vise.
 Age- 25
 Work exp. – 38
 Salary – 59
 Distance – 9
Multi-collinearity
Correlation Plot amongst numeric variables
7
As we can see, there is a very high positive correlation between Work Exp. and Age as well as Work Exp. and Salary.
The same shall be dealt later on.
Data preparation – SMOTE

SMOTE is Synthetic Minority Oversampling Technique. As we can see here, employees taking car to work -> that
group is just around 13.5%, indeed a minority.
We need to oversample the car commuters in the data. Also, in order to simplify, we are going to create dummy
dependent variable ‘Transport’, coding “Car” -> 1 and rest as “0”.
cars_train_balanced <- SMOTE(Transport1 ~., cars_train, perc.over = 5000, k = 5, perc.under = 100)

Output:
table(cars_train_balanced$Transport)
0 1
2150 2193
0 = Public Transport and Two-wheeler
1 = Car
As we can see, now in the data, number of people taking car to work is almost similar to those choosing other modes
of transport.
Model building
Logistic regression
Code:
cars_logistic <- glm(Transport1 ~ ., data = cars_train_balanced, family = binomial(link = "logit"))
Summary:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -86.37896 4.95131 -17.446 < 2e-16 ***
Age 2.87489 0.16841 17.071 < 2e-16 ***
GenderMale -0.68311 0.22855 -2.989 0.00280 **
Engineer1 0.66629 0.24383 2.733 0.00628 **
MBA1 -1.16348 0.23072 -5.043 4.59e-07 ***
Work.Exp -1.59100 0.11557 -13.767 < 2e-16 ***
Salary 0.26796 0.02415 11.095 < 2e-16 ***
Distance 0.46442 0.04925 9.430 < 2e-16 ***
license1 1.35984 0.20723 6.562 5.31e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)

8
Null deviance: 6020.25 on 4342 degrees of freedom
Residual deviance: 759.41 on 4334 degrees of freedom
AIC: 777.41
Number of Fisher Scoring iterations: 9
Checking for multi-collinearity

vif(cars_logistic)
Age Gender Engineer MBA Work.Exp Salary Distance license
8.633221 1.028315 1.067926 1.161522 12.328141 3.303098 1.909529 1.052074
As ‘Work Exp.’ is >10 -> it indicates the presence of multicollinearity in the data. So we can drop work exp. variable.
Confusion Matrix
table(cars_test$Transport1 ,cars_test$log.pred>0.5)
FALSE TRUE
0 110 5
1 0 18
Sensitivity = 100%
Specificity = 95%
KNN Model
First we normalize the data and then proceed with the model building.
Code:
scale = preProcess(cars_train_balanced, method = "range")
cars_train_normalized <- predict(scale, cars_train_balanced)
cars_test_normalized <- predict(scale, cars_test)
knn_fit = train(Transport1 ~ ., data=cars_train_normalized, method = "knn", trControl = trainControl(method = "cv",

number = 5),
tuneLength = 10)
Confusion matrix:
Sensitivity : 1.0000
Specificity : 0.9217
Naïve Bayes
Code:
cars_nb<-naiveBayes(x=cars_train_balanced[,-9], y=as.factor(cars_train_balanced[,9]))
#do make sure that your dependent variable is a factor
pred_nb<-predict(cars_nb,newdata = cars_test[,-9])
9
table(cars_test[,9],pred_nb)
Confusion Matrix:
pred_nb
0 1
0 111 4
1 1 17
With such a high sensitivity, we can say Naïve Bayes is applicable on this data-set.
Model validation:
According to the above 3 models and their interpretation, we can say that KNN model performed the best.
Bagging
As one model output might be biased, we have methods like bagging and boosting to nullify the inherent errors in
the models.
Summary of bagging:
$btree
n= 4343
node), split, n, loss, yval, (yprob)

* denotes terminal node
1) root 4343 2136 1 (0.491825927 0.508174073)

2) Age< 30.00368 1993 6 0 (0.996989463 0.003010537) *
3) Age>=30.00368 2350 149 1 (0.063404255 0.936595745)
6) Salary< 15.90422 108 21 0 (0.805555556 0.194444444) *
7) Salary>=15.90422 2242 62 1 (0.027653880 0.972346120) *
attr(,"class")
class
"sclass"
$OOB
[1] FALSE
$comb
[1] FALSE
$call
bagging.data.frame(formula = Transport1 ~ . - Work.Exp, data = cars_train_balanced,
control = rpart.control(maxdepth = 5, minsplit = 4))
attr(,"class")
Confusion Matrix:
table(cars_test$Transport1,cars_test$pred.bagging)
0 1
0 112 3
1 2 16
Here we can see that the performance has been significantly improved in terms of sensitivity and specificity.
Actionable insights and recommendations:

 Data has evidence of multicollinearity which needs to be dealt with.
10
 Data has an issue of minority sampling which needs to be addressed by apply SMOTE algorithm
 Amongst the three models – logistic regression, KNN and Naïve Bayes, KNN seems to be the best model with
wonderful sensitivity and specificity.
 However, by applying bagging, we can get much more realistic and reliable output.
 The increase in sensitivity can lead to improved efficiency and conversion in customer reach-out campaigns.
11

Machine Learning Project - Sapan Parikh

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Project - Sapan Parikh

Uploaded by

Copyright:

Available Formats

cars_data = read.csv("Cars.csv", header = TRUE)

Age Age of the Employee in Years

Checking # of unique values in each column

13.74% of total transport by Cars.

Trasnport Vs. Distance

Transport Vs. Salary

Transport Vs. Work Exp.

Now let us look at categorical Vs. categorical bi-variate analysis.

Transport Vs. Engineering

Transport Vs. Gender

Transport Vs. MBA

Transport Vs. License

LLAge = quantile(cars_numeric$Age, 0.25) - 1.5*IQRAge

ULAge = quantile(cars_numeric$Age, 0.75) + 1.5*IQRAge

AgeOut = subset(cars_numeric, cars_numeric$Age < LLAge | cars_numeric$Age > ULAge)

Based upon that, below is the count of outliers, variable-vise.

Data preparation – SMOTE

cars_train_balanced <- SMOTE(Transport1 ~., cars_train, perc.over = 5000, k = 5, perc.under = 100)

cars_logistic <- glm(Transport1 ~ ., data = cars_train_balanced, family = binomial(link = "logit"))

(Dispersion parameter for binomial family taken to be 1)

Number of Fisher Scoring iterations: 9

Checking for multi-collinearity

scale = preProcess(cars_train_balanced, method = "range")

cars_train_normalized <- predict(scale, cars_train_balanced)

cars_test_normalized <- predict(scale, cars_test)

knn_fit = train(Transport1 ~ ., data=cars_train_normalized, method = "knn", trControl = trainControl(method = "cv",

#do make sure that your dependent variable is a factor

node), split, n, loss, yval, (yprob)

1) root 4343 2136 1 (0.491825927 0.508174073)

Actionable insights and recommendations:

You might also like