You are on page 1of 27

Transport

Car Transport Prediction Modelling

R Venkataraman

21th June 2020



PGP BABI

Group 5
Index

Project Description & Objective ………………………………………………………………. 3

Project Report……………………………………………………………………………………..... 4-26

Reference……………………………………………………………………………………………….. 27
Car Transport Prediction Assessment

Project Description

This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data 'Cars.csv' includes employee information about
their mode of transport as well as their personal and professional details like age, salary, work
exp. We need to predict whether or not an employee will use Car as a mode of transport. Also,
which variables are a significant predictor behind this decision?

Project Objective

• EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
and missing values and check the summary of the dataset
• EDA - Illustrate the insights based on EDA
• EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
• Data Preparation (SMOTE)
• Applying Logistic Regression & Interpret results
• Applying KNN Model & Interpret results
• Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
it is not applicable, how can you build an NB model in this case?)
• Confusion matrix interpretation
• Remarks on Model validation exercise <Which model performed the best>
• Bagging
• Boosting
• Actionable Insights and Recommendations

3|Page
Car Transport Prediction Assessment

Project Report

EDA - Basic data summary, Univariate, Bivariate analysis, graphs

With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:

We have 8 independent variables and 1 dependent variable (‘Transport’) in the given data set.
We have 444 rows which can be split into test & train dataset for various model building.

Data Description:

Age Age of the Employee in Years


Gender Gender of the Employee
Engineer For Engineer =1 , Non Engineer =0
MBA For MBA =1 , Non MBA =0
Work Exp Experience in years
Salary Salary in Lakhs per Annum
Distance Distance in Kms from Home to Office
license If Employee has Driving Licence -1, If not, then 0
Transport Mode of Transport

Initial summary of the data

4|Page
Car Transport Prediction Assessment

Data structure just after loading

Missing values checked and results as below:

MBA field has missing values (one record). Treated as below:

No missing values now.

Final summary of the dataset

5|Page
Car Transport Prediction Assessment

Univariate Analysis

Age:

Basic summary of Mean=27.75 with Std.dev=4.416 & skew = .9488. Age group between 24 to
32 has maximum count with 64%

No of outliers: 25

Work Experience:

Basic summary of Mean = 6.3 with Std.dev=5.11 and skew=1.344. Work experience between
0-4 years and 4-8 years have maximum count with 45% & 32% respectively.

6|Page
Car Transport Prediction Assessment

No of outliers : 38

Salary:

Basic summary of mean = 16.24 and Std.dev=10.45 with Skew=2.031. Salary grouping of 0-10
Lakhs and 10-20 lakhs have maximum count of 28% & 55% respectively.

No of outliers : 59

Distance:

Basic Summary of mean=11.32km and Std.dev=3.606 with skew=0.536.

7|Page
Car Transport Prediction Assessment

Distance grouping of 6-12km & 12-18km has maximum count of 56% and 35% respectively.

Outliers : 9

Gender:

We have 29% of female and 71% of male persons in the dataset.

Engineer:

We have 75% of engineers and 25% of non-engineers in the dataset.

8|Page
Car Transport Prediction Assessment

MBA:

We have 25% of MBA graduates and 75% non-MBA in the dataset.

License:

We have 76% of the people having no driving license and only remaining 24% have license.

9|Page
Car Transport Prediction Assessment

Transport:

This is the dependent variable to be predicted. Actual data shows 19% uses 2Wheeler, 14%
uses Car and remaining 67% uses public transport.

Treating of outliers not carried out in this exercise and it is not explicitly asked for.

Bi-Variate Analysis

We will analyze how the independent variable(Transport) trends with the dependent
variables.

Age Vs Transport:

We can notice that more Car


usage in the 32-40 years of
age-group.

10 | P a g e
Car Transport Prediction Assessment

Work Experience Vs Transport:

People with 8 years or


more experience starts
using Car as the
probability increases with
experience.

Salary Vs Transport:

Probability of using Car


increases once the salary per
annum crosses 30lakhs per
annum.

11 | P a g e
Car Transport Prediction Assessment

Distance Vs Transport:

The probability of using Car


increases as the distance also
increases.

Gender Vs Transport:

We can notice that more males use


Car than females.

Engineer Vs Transport:

Probability of using Car as transport is


maximum when the person is an
Engineer.

12 | P a g e
Car Transport Prediction Assessment

MBA Vs Transport:

Having MBA degree doesn’t influence


the usage of Car.

License Vs Transport

People having license have more


probability of using Car.

13 | P a g e
Car Transport Prediction Assessment

Multicollinearity

We will check for multicollinearity among the independent variables. For this exercise, we will
remove the Gender variable.

Both the correlation coefficient and elliptical portrayal of the same is pasted below:

Age & Work experience are highly correlated


Salary & Age are highly correlated
Salary & Work experience are highly correlated.

We will treat the outliers using the VIF function.

Work experience can be removed from the models.


14 | P a g e
Car Transport Prediction Assessment

Data Preparation (SMOTE):

The original dataset is modified with another variable “tpt” with the logic as below. This is
done as the current scope is to predict the usage of Car as transport.

Transport tpt (new variable)


Car 1
2Wheeler 0
Public Transport 0

The dataset is split into training & test dataset with 70% & 30% respectively.

We can notice that training & test dataset are equally split between 0 & 1.

We will apply SMOTE using DMwR library to even out the class imbalance in the training
dataset as the model will be trained on this dataset.

By tuning the perc.over and perc.under, the desired results are achieved as follows:

The imbalance in the class has been rectified by the SMOTE method. Now the values of 0 & 1
are equally present in the dataset for any modelling.

15 | P a g e
Car Transport Prediction Assessment

Logistic Regression:

Logistic regression is performed by not considering the work experience variable(due to


multicollinearity) on the training dataset.

Summary of the model:

Variables of Importance:

16 | P a g e
Car Transport Prediction Assessment

The model is applied on the test data set and the results as below:

AUC=91% & GINI = 82%

KS value = 82

This model performed well on the training and test data set. The accuracy and specificity are
extremely good and model performance measures are truly good.

We can notice that the variables, Age,Distance,license, MBA are very significant. Age,distance
and license have positive coefficient and MBA has a negative one.

17 | P a g e
Car Transport Prediction Assessment

KNN Model:

For KNN, we will normalize the data as distance, salary, experience are in different scales
along with other factor variables.

After normalization, training and test data sets will be created. To overcome the imbalance in
the classes, SMOTE will be applied on the training set.

With the best tune of k=5, the training set is modelled to predict the values in the test.set

mycarkpred=knn(mycarknn_trainSM[,c(1:8)],mycarknn_test[,c(1:8)],mycarknn_trainSM$Tr
ansport,k=5)

AUC=87.35% GINI=75%

18 | P a g e
Car Transport Prediction Assessment

KS=74.7

Interpretation: While this model has good accuracy and sensitivity, it loses out to glm on the
specificity. The main disadvantage is that it doesn’t learn from the training set, but simply
uses the training set for classification.

NAÏVE BAYES MODEL:

We will use the same training and test data set used in the logistic regression for this NB
model. The same training set which was balanced using SMOTE will be used.

Applying the NB prediction(trained on the SMOTE training set) to the test data set we get:

19 | P a g e
Car Transport Prediction Assessment

AUC=97% GINI = 94

KS = 80

Interpretation:

While the model has excellent confusion matrix values, AUC,GINI etc, the following are the
reasons why this cannot be used in this case:

This algorithm makes a very strong assumption about the data having features independent of
each other, while in reality it is not so. In our case, Age and work experience are closely
related. We have also seen qualifications like Engineer or MBA gets more salary and Salary
influences Car.

Not all the input variables are categorical to take full advantage of Naïve Bayes as we have
salary, work experience, distance etc which are numerical. In that case, the model assumes
that they are normally distributed whereas we have seen skewness in Age, work experience,
salary etc...

So this model cannot be used in this exercise.

20 | P a g e
Car Transport Prediction Assessment

Confusion Matrix Interpretation & Model validation:

Logistic KNN Naïve


Measures Regression Model Bayes
Confusion Matrix
Accuracy 96% 95% 94%
Sensitivity 83% 97% 96%
Specificity 98% 78% 78%
Balanced Accuracy 91% 87% 87%

AUC 91% 87% 97%

GINI 82% 75% 94%

KS 82% 77% 80%

It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most
importantly AUC-ROC Curve.

True Positive: You predicted positive and it’s true.

True Negative: You predicted negative and it’s true.

False Positive (Type 1 Error): You predicted positive and it’s false.

False Negative (Type 2 Error): You predicted negative and it’s false.

In our exercise, we have used ‘0’ for 2wheeler/Public transport and ‘1’ for Car transport. To
evaluate the models, the specificity is the parameter which decides how correctly we have
predicted ‘CAR’ usage.

Being a classification problem, AUC is the best parameter for model validation closely
followed by KS parameter.

In our case, logistic regression performs best on this(with NB not considered for non-
independence of data). Also we have made it a Car Vs no Car to be a binary classification
rather than a multi classification. Logistic regression works well in these cases.

Logistic Regression is the model to use for this exercise:

21 | P a g e
Car Transport Prediction Assessment

Bagging & Boosting:


We will use the ensemble methods for fine tuning the prediction:
Bagging is a way to decrease the variance in the prediction by generating additional data for
training from dataset using combinations with repetitions to produce multi-sets of the
original data.
Boosting is an iterative technique which adjusts the weight of an observation based on the last
classification.
Bagging: We will use ipred library in R for this function.

mycar.bagging=bagging(tpt~.,data=mycartrain,
control=rpart.control(maxdepth=5, minsplit=4))

mycartest$pred.tpt=predict(mycar.bagging,mycartest)

AUC=89% & GINI = 78%

KS = 78%

22 | P a g e
Car Transport Prediction Assessment

Boosting: We will use gbm library to perform gradient boosting.

mycargbm.fit = gbm(
formula = tpt ~ .,
distribution = "bernoulli",
data = mycartrain,
n.trees = 3000,
interaction.depth = 1,
shrinkage = 0.001,
n.cores = NULL,
verbose = FALSE
)

AUC=93% GINI = 86%

KS = 86

23 | P a g e
Car Transport Prediction Assessment

Xtreme Boosting:

###XG boost works with all numeric, so we will convert any factor to numeric

### we will create test and train

After converting to numeric(matrix), xgboost is run:

carxgb.fit <- xgboost(


data = gd_features_train,
label = gd_label_train,
eta = 0.001,
max_depth = 3,
min_child_weight = 3,
nrounds = 5000,
nfold = 5,
objective = "binary:logistic",
verbose = 1,
early_stopping_rounds = 10
)

It is then tuned for the best values of eta, maxdepth & nrounds. The above is run by fixing the
other two values constant and running one value in a loop.

lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)


md<-c(1,3,5,7,9,15)
nr<-c(10, 50, 100, 500, 1000,3000,5000,7500,10000)

The best fit for these values: 0.1 5 100

24 | P a g e
Car Transport Prediction Assessment

AUC=89 GINI=78

KS = 78

While XGboost has improved the accuracy, sensitivity and specificity there is a drop in AUC
by 4%.

Overall summary of all models

Logistic KNN Naïve


Measures Regression Model Bayes Bagging GBM XGBoost
Confusion Matrix
Accuracy 96% 95% 94% 97% 96% 97%
Sensitivity 83% 97% 96% 97% 98% 97%
Specificity 98% 78% 78% 100% 84% 100%
Balanced Accuracy 91% 87% 87% 98% 91% 98%

AUC 91% 87% 97% 89% 93% 89%

GINI 82% 75% 94% 78% 86% 78%

KS 82% 77% 80% 77% 86% 78%

25 | P a g e
Car Transport Prediction Assessment

Actionable Insights & Recommendations

Bivariate analysis and various models have shown, Age as an important parameter for
choosing Car as transport. Probability of choosing car is high after the age of 32 and there is a
big population between 24 & 32 years of age (64%) who can be targeted. This can be still
filtered by using the other key variables (distance, license etc..)

The probability of using Car increases if the distance also increases beyond 12km. There is a
big population in the 6-12 km group which can be targeted thru Marketing/advertisement
campaigns.

Probability of using car is high once the salary reaches 30Lakhs. Again there is a big chunk of
population between 10-30 lakhs who can be filtered based on other variables and can be
targeted.

There is a gender disparity in the Car usage with more men, but it is understandable with the
71% of male in the dataset.

Logistic regression has also shown Age and distance have positive coefficient and these two
variables can be used for key targets.

26 | P a g e
Car Transport Prediction Assessment

References:

Great Learning Videos & Course Materials

CRAN package documentation

27 | P a g e

You might also like