Car Transport Prediction

Transport
Car Transport Prediction Modelling
R Venkataraman
21th June 2020

—
PGP BABI
—
Group 5
Index
Project Description & Objective ………………………………………………………………. 3
Project Report……………………………………………………………………………………..... 4-26
Reference……………………………………………………………………………………………….. 27
Car Transport Prediction Assessment
Project Description
This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data 'Cars.csv' includes employee information about
their mode of transport as well as their personal and professional details like age, salary, work
exp. We need to predict whether or not an employee will use Car as a mode of transport. Also,
which variables are a significant predictor behind this decision?
Project Objective
• EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
and missing values and check the summary of the dataset
• EDA - Illustrate the insights based on EDA
• EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
• Data Preparation (SMOTE)
• Applying Logistic Regression & Interpret results
• Applying KNN Model & Interpret results
• Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
it is not applicable, how can you build an NB model in this case?)
• Confusion matrix interpretation
• Remarks on Model validation exercise <Which model performed the best>
• Bagging
• Boosting
• Actionable Insights and Recommendations
3|Page
Project Report
EDA - Basic data summary, Univariate, Bivariate analysis, graphs
With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:
We have 8 independent variables and 1 dependent variable (‘Transport’) in the given data set.
We have 444 rows which can be split into test & train dataset for various model building.
Data Description:
Age Age of the Employee in Years

Gender Gender of the Employee
Engineer For Engineer =1 , Non Engineer =0
MBA For MBA =1 , Non MBA =0
Work Exp Experience in years
Salary Salary in Lakhs per Annum
Distance Distance in Kms from Home to Office
license If Employee has Driving Licence -1, If not, then 0
Transport Mode of Transport
Initial summary of the data
4|Page
Data structure just after loading
Missing values checked and results as below:
MBA field has missing values (one record). Treated as below:
No missing values now.
Final summary of the dataset
5|Page
Univariate Analysis
Age:
Basic summary of Mean=27.75 with Std.dev=4.416 & skew = .9488. Age group between 24 to
32 has maximum count with 64%
No of outliers: 25
Work Experience:
Basic summary of Mean = 6.3 with Std.dev=5.11 and skew=1.344. Work experience between
0-4 years and 4-8 years have maximum count with 45% & 32% respectively.
6|Page
No of outliers : 38
Salary:
Basic summary of mean = 16.24 and Std.dev=10.45 with Skew=2.031. Salary grouping of 0-10
Lakhs and 10-20 lakhs have maximum count of 28% & 55% respectively.
No of outliers : 59
Distance:
Basic Summary of mean=11.32km and Std.dev=3.606 with skew=0.536.
7|Page
Distance grouping of 6-12km & 12-18km has maximum count of 56% and 35% respectively.
Outliers : 9
Gender:
We have 29% of female and 71% of male persons in the dataset.
Engineer:
We have 75% of engineers and 25% of non-engineers in the dataset.
8|Page
MBA:
We have 25% of MBA graduates and 75% non-MBA in the dataset.
License:
We have 76% of the people having no driving license and only remaining 24% have license.
9|Page
Transport:
This is the dependent variable to be predicted. Actual data shows 19% uses 2Wheeler, 14%
uses Car and remaining 67% uses public transport.
Treating of outliers not carried out in this exercise and it is not explicitly asked for.
Bi-Variate Analysis
We will analyze how the independent variable(Transport) trends with the dependent
variables.
Age Vs Transport:
We can notice that more Car

usage in the 32-40 years of
age-group.
10 | P a g e
Work Experience Vs Transport:
People with 8 years or

more experience starts
using Car as the
probability increases with
experience.
Salary Vs Transport:
Probability of using Car

increases once the salary per
annum crosses 30lakhs per
annum.
11 | P a g e
Distance Vs Transport:
The probability of using Car

increases as the distance also
increases.
Gender Vs Transport:
We can notice that more males use

Car than females.
Engineer Vs Transport:
Probability of using Car as transport is

maximum when the person is an
Engineer.
12 | P a g e
MBA Vs Transport:
Having MBA degree doesn’t influence

the usage of Car.
License Vs Transport
People having license have more

probability of using Car.
13 | P a g e
Multicollinearity
We will check for multicollinearity among the independent variables. For this exercise, we will
remove the Gender variable.
Both the correlation coefficient and elliptical portrayal of the same is pasted below:
Age & Work experience are highly correlated

Salary & Age are highly correlated
Salary & Work experience are highly correlated.
We will treat the outliers using the VIF function.
Work experience can be removed from the models.

14 | P a g e
Data Preparation (SMOTE):
The original dataset is modified with another variable “tpt” with the logic as below. This is
done as the current scope is to predict the usage of Car as transport.
Transport tpt (new variable)

Car 1
2Wheeler 0
Public Transport 0
The dataset is split into training & test dataset with 70% & 30% respectively.
We can notice that training & test dataset are equally split between 0 & 1.
We will apply SMOTE using DMwR library to even out the class imbalance in the training
dataset as the model will be trained on this dataset.
By tuning the perc.over and perc.under, the desired results are achieved as follows:
The imbalance in the class has been rectified by the SMOTE method. Now the values of 0 & 1
are equally present in the dataset for any modelling.
15 | P a g e
Logistic Regression:
Logistic regression is performed by not considering the work experience variable(due to

multicollinearity) on the training dataset.
Summary of the model:
Variables of Importance:
16 | P a g e
The model is applied on the test data set and the results as below:
AUC=91% & GINI = 82%
KS value = 82
This model performed well on the training and test data set. The accuracy and specificity are
extremely good and model performance measures are truly good.
We can notice that the variables, Age,Distance,license, MBA are very significant. Age,distance
and license have positive coefficient and MBA has a negative one.
17 | P a g e
KNN Model:
For KNN, we will normalize the data as distance, salary, experience are in different scales
along with other factor variables.
After normalization, training and test data sets will be created. To overcome the imbalance in
the classes, SMOTE will be applied on the training set.
With the best tune of k=5, the training set is modelled to predict the values in the test.set
mycarkpred=knn(mycarknn_trainSM[,c(1:8)],mycarknn_test[,c(1:8)],mycarknn_trainSM$Tr
ansport,k=5)
AUC=87.35% GINI=75%
18 | P a g e
KS=74.7
Interpretation: While this model has good accuracy and sensitivity, it loses out to glm on the
specificity. The main disadvantage is that it doesn’t learn from the training set, but simply
uses the training set for classification.
NAÏVE BAYES MODEL:
We will use the same training and test data set used in the logistic regression for this NB
model. The same training set which was balanced using SMOTE will be used.
Applying the NB prediction(trained on the SMOTE training set) to the test data set we get:
19 | P a g e
AUC=97% GINI = 94
KS = 80
Interpretation:
While the model has excellent confusion matrix values, AUC,GINI etc, the following are the
reasons why this cannot be used in this case:
This algorithm makes a very strong assumption about the data having features independent of
each other, while in reality it is not so. In our case, Age and work experience are closely
related. We have also seen qualifications like Engineer or MBA gets more salary and Salary
influences Car.
Not all the input variables are categorical to take full advantage of Naïve Bayes as we have
salary, work experience, distance etc which are numerical. In that case, the model assumes
that they are normally distributed whereas we have seen skewness in Age, work experience,
salary etc...
So this model cannot be used in this exercise.
20 | P a g e
Confusion Matrix Interpretation & Model validation:
Logistic KNN Naïve

Measures Regression Model Bayes
Confusion Matrix
Accuracy 96% 95% 94%
Sensitivity 83% 97% 96%
Specificity 98% 78% 78%
Balanced Accuracy 91% 87% 87%
AUC 91% 87% 97%
GINI 82% 75% 94%
KS 82% 77% 80%
It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most
importantly AUC-ROC Curve.
True Positive: You predicted positive and it’s true.
True Negative: You predicted negative and it’s true.
False Positive (Type 1 Error): You predicted positive and it’s false.
False Negative (Type 2 Error): You predicted negative and it’s false.
In our exercise, we have used ‘0’ for 2wheeler/Public transport and ‘1’ for Car transport. To
evaluate the models, the specificity is the parameter which decides how correctly we have
predicted ‘CAR’ usage.
Being a classification problem, AUC is the best parameter for model validation closely
followed by KS parameter.
In our case, logistic regression performs best on this(with NB not considered for non-
independence of data). Also we have made it a Car Vs no Car to be a binary classification
rather than a multi classification. Logistic regression works well in these cases.
Logistic Regression is the model to use for this exercise:
21 | P a g e
Bagging & Boosting:

We will use the ensemble methods for fine tuning the prediction:
Bagging is a way to decrease the variance in the prediction by generating additional data for
training from dataset using combinations with repetitions to produce multi-sets of the
original data.
Boosting is an iterative technique which adjusts the weight of an observation based on the last
classification.
Bagging: We will use ipred library in R for this function.
mycar.bagging=bagging(tpt~.,data=mycartrain,
control=rpart.control(maxdepth=5, minsplit=4))
mycartest$pred.tpt=predict(mycar.bagging,mycartest)
AUC=89% & GINI = 78%
KS = 78%
22 | P a g e
Boosting: We will use gbm library to perform gradient boosting.
mycargbm.fit = gbm(
formula = tpt ~ .,
distribution = "bernoulli",
data = mycartrain,
n.trees = 3000,
interaction.depth = 1,
shrinkage = 0.001,
n.cores = NULL,
verbose = FALSE
)
AUC=93% GINI = 86%
KS = 86
23 | P a g e
Xtreme Boosting:
###XG boost works with all numeric, so we will convert any factor to numeric
### we will create test and train
After converting to numeric(matrix), xgboost is run:
carxgb.fit <- xgboost(

data = gd_features_train,
label = gd_label_train,
eta = 0.001,
max_depth = 3,
min_child_weight = 3,
nrounds = 5000,
nfold = 5,
objective = "binary:logistic",
verbose = 1,
early_stopping_rounds = 10
)
It is then tuned for the best values of eta, maxdepth & nrounds. The above is run by fixing the
other two values constant and running one value in a loop.
lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)

md<-c(1,3,5,7,9,15)
nr<-c(10, 50, 100, 500, 1000,3000,5000,7500,10000)
The best fit for these values: 0.1 5 100
24 | P a g e
AUC=89 GINI=78
KS = 78
While XGboost has improved the accuracy, sensitivity and specificity there is a drop in AUC
by 4%.
Overall summary of all models
Logistic KNN Naïve

Measures Regression Model Bayes Bagging GBM XGBoost
Confusion Matrix
Accuracy 96% 95% 94% 97% 96% 97%
Sensitivity 83% 97% 96% 97% 98% 97%
Specificity 98% 78% 78% 100% 84% 100%
Balanced Accuracy 91% 87% 87% 98% 91% 98%
AUC 91% 87% 97% 89% 93% 89%
GINI 82% 75% 94% 78% 86% 78%
KS 82% 77% 80% 77% 86% 78%
25 | P a g e
Actionable Insights & Recommendations
Bivariate analysis and various models have shown, Age as an important parameter for
choosing Car as transport. Probability of choosing car is high after the age of 32 and there is a
big population between 24 & 32 years of age (64%) who can be targeted. This can be still
filtered by using the other key variables (distance, license etc..)
The probability of using Car increases if the distance also increases beyond 12km. There is a
big population in the 6-12 km group which can be targeted thru Marketing/advertisement
campaigns.
Probability of using car is high once the salary reaches 30Lakhs. Again there is a big chunk of
population between 10-30 lakhs who can be filtered based on other variables and can be
targeted.
There is a gender disparity in the Car usage with more men, but it is understandable with the
71% of male in the dataset.
Logistic regression has also shown Age and distance have positive coefficient and these two
variables can be used for key targets.
26 | P a g e
References:
Great Learning Videos & Course Materials
CRAN package documentation
27 | P a g e

Car Transport Prediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Car Transport Prediction

Uploaded by

Copyright:

Available Formats

Transport

Car Transport Prediction Modelling

21th June 2020

Project Description & Objective ………………………………………………………………. 3

Project Report……………………………………………………………………………………..... 4-26

EDA - Basic data summary, Univariate, Bivariate analysis, graphs

Age Age of the Employee in Years

Initial summary of the data

Data structure just after loading

Missing values checked and results as below:

MBA field has missing values (one record). Treated as below:

No missing values now.

Final summary of the dataset

Basic Summary of mean=11.32km and Std.dev=3.606 with skew=0.536.

We have 29% of female and 71% of male persons in the dataset.

We have 75% of engineers and 25% of non-engineers in the dataset.

We have 25% of MBA graduates and 75% non-MBA in the dataset.

We can notice that more Car

Work Experience Vs Transport:

People with 8 years or

Probability of using Car

The probability of using Car

We can notice that more males use

Probability of using Car as transport is

Having MBA degree doesn’t influence

People having license have more

Age & Work experience are highly correlated

We will treat the outliers using the VIF function.

Work experience can be removed from the models.

Data Preparation (SMOTE):

Transport tpt (new variable)

Logistic regression is performed by not considering the work experience variable(due to

Summary of the model:

AUC=91% & GINI = 82%

NAÏVE BAYES MODEL:

So this model cannot be used in this exercise.

Confusion Matrix Interpretation & Model validation:

Logistic KNN Naïve

AUC 91% 87% 97%

GINI 82% 75% 94%

KS 82% 77% 80%

True Positive: You predicted positive and it’s true.

True Negative: You predicted negative and it’s true.

Logistic Regression is the model to use for this exercise:

Bagging & Boosting:

AUC=89% & GINI = 78%

Boosting: We will use gbm library to perform gradient boosting.

AUC=93% GINI = 86%

### we will create test and train

After converting to numeric(matrix), xgboost is run:

carxgb.fit <- xgboost(

lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)

The best fit for these values: 0.1 5 100

Overall summary of all models

Logistic KNN Naïve

AUC 91% 87% 97% 89% 93% 89%

GINI 82% 75% 94% 78% 86% 78%

KS 82% 77% 80% 77% 86% 78%

Actionable Insights & Recommendations

Great Learning Videos & Course Materials

CRAN package documentation

You might also like