Professional Documents
Culture Documents
Sapan Parikh
BABIO – SEP’19
Contents
Machine Learning Project...............................................................................................................................................2
EDA.............................................................................................................................................................................. 2
Univariate Analysis..................................................................................................................................................3
Bi-variate Analysis...................................................................................................................................................4
Outlier..................................................................................................................................................................... 7
Multi-collinearity.....................................................................................................................................................7
Data preparation – SMOTE..........................................................................................................................................8
Model building............................................................................................................................................................8
Logistic regression...................................................................................................................................................8
KNN Model..............................................................................................................................................................9
Naïve Bayes.............................................................................................................................................................9
Model validation:..................................................................................................................................................10
Bagging...................................................................................................................................................................... 10
Actionable insights and recommendations:..............................................................................................................10
1
Machine Learning Project
The problem involves predicting the mode of commute by employees where they will be taking a car to the office.
EDA
#importing data
At first glance, we can observe that the data has 444 observations providing data of 9 variables.
Data Dictionary
Data Structure
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
Data Summary
Age Gender Engineer MBA Work.Exp
Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000 Min. : 0.0
1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 3.0
Median :27.00 Median :1.0000 Median :0.0000 Median : 5.0
Mean :27.75 Mean :0.7545 Mean :0.2528 Mean : 6.3
3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.0
Max. :43.00 Max. :1.0000 Max. :1.0000 Max. :24.0
NA's :1
Salary Distance license Transport
Min. : 6.50 Min. : 3.20 Min. :0.0000 2Wheeler : 83
1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000 Car : 61
Median :13.60 Median :11.00 Median :0.0000 Public Transport:300
Mean :16.24 Mean :11.32 Mean :0.2342
3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
Max. :57.00 Max. :23.40 Max. :1.0000
2
Age Gender Engineer MBA Work.Exp Salary Distance license
25 2 2 2 24 122 137 2
Transport
3
Univariate Analysis
One-by-one numeric variable analysis:
Age
Salary
Work Exp.
3
Distance
Bi-variate Analysis
Transport Vs. Age
This shows, people driving Car to office have median age significantly higher than other two modes.
4
As can be observed, employees driving car to work have comparatively high distance to cover on median basis.
The above diagram clearly states that median salary of those taking their car to the office is multi-fold higher than
those taking other modes of transport to work.
Again, we can say based on the graph above that high experienced employees usually take Car to work. Which is
logical also.
5
The above were categorical Vs. numeric bi-variate analysis.
We can sae that for car segment, the ratio of engineers to non-engineers is the highest, close to 5:1.
Again, same as above, compared to other segments, Car has higher male:female ratio.
6
The behaviour remains the same almost across all three categories.
Here, it is evident that unlike other two segments, car drivers have much higher probability of owning a license.
Outlier
Code format for capturing outliers for numeric variables:
IQRAge = IQR(cars_numeric$Age)
dim(AgeOut)
Age- 25
Work exp. – 38
Salary – 59
Distance – 9
Multi-collinearity
Correlation Plot amongst numeric variables
7
As we can see, there is a very high positive correlation between Work Exp. and Age as well as Work Exp. and Salary.
The same shall be dealt later on.
We need to oversample the car commuters in the data. Also, in order to simplify, we are going to create dummy
dependent variable ‘Transport’, coding “Car” -> 1 and rest as “0”.
table(cars_train_balanced$Transport)
0 1
2150 2193
0 = Public Transport and Two-wheeler
1 = Car
As we can see, now in the data, number of people taking car to work is almost similar to those choosing other modes
of transport.
Model building
Logistic regression
Code:
Summary:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -86.37896 4.95131 -17.446 < 2e-16 ***
Age 2.87489 0.16841 17.071 < 2e-16 ***
GenderMale -0.68311 0.22855 -2.989 0.00280 **
Engineer1 0.66629 0.24383 2.733 0.00628 **
MBA1 -1.16348 0.23072 -5.043 4.59e-07 ***
Work.Exp -1.59100 0.11557 -13.767 < 2e-16 ***
Salary 0.26796 0.02415 11.095 < 2e-16 ***
Distance 0.46442 0.04925 9.430 < 2e-16 ***
license1 1.35984 0.20723 6.562 5.31e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As ‘Work Exp.’ is >10 -> it indicates the presence of multicollinearity in the data. So we can drop work exp. variable.
Confusion Matrix
table(cars_test$Transport1 ,cars_test$log.pred>0.5)
FALSE TRUE
0 110 5
1 0 18
Sensitivity = 100%
Specificity = 95%
KNN Model
First we normalize the data and then proceed with the model building.
Code:
tuneLength = 10)
Confusion matrix:
Sensitivity : 1.0000
Specificity : 0.9217
Naïve Bayes
Code:
cars_nb<-naiveBayes(x=cars_train_balanced[,-9], y=as.factor(cars_train_balanced[,9]))
pred_nb<-predict(cars_nb,newdata = cars_test[,-9])
9
table(cars_test[,9],pred_nb)
Confusion Matrix:
pred_nb
0 1
0 111 4
1 1 17
With such a high sensitivity, we can say Naïve Bayes is applicable on this data-set.
Model validation:
According to the above 3 models and their interpretation, we can say that KNN model performed the best.
Bagging
As one model output might be biased, we have methods like bagging and boosting to nullify the inherent errors in
the models.
Summary of bagging:
$btree
n= 4343
attr(,"class")
class
"sclass"
$OOB
[1] FALSE
$comb
[1] FALSE
$call
bagging.data.frame(formula = Transport1 ~ . - Work.Exp, data = cars_train_balanced,
control = rpart.control(maxdepth = 5, minsplit = 4))
attr(,"class")
Confusion Matrix:
table(cars_test$Transport1,cars_test$pred.bagging)
0 1
0 112 3
1 2 16
Here we can see that the performance has been significantly improved in terms of sensitivity and specificity.
11