Professional Documents
Culture Documents
R Venkataraman
Reference……………………………………………………………………………………………….. 27
Car Transport Prediction Assessment
Project Description
This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data 'Cars.csv' includes employee information about
their mode of transport as well as their personal and professional details like age, salary, work
exp. We need to predict whether or not an employee will use Car as a mode of transport. Also,
which variables are a significant predictor behind this decision?
Project Objective
• EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
and missing values and check the summary of the dataset
• EDA - Illustrate the insights based on EDA
• EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
• Data Preparation (SMOTE)
• Applying Logistic Regression & Interpret results
• Applying KNN Model & Interpret results
• Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
it is not applicable, how can you build an NB model in this case?)
• Confusion matrix interpretation
• Remarks on Model validation exercise <Which model performed the best>
• Bagging
• Boosting
• Actionable Insights and Recommendations
3|Page
Car Transport Prediction Assessment
Project Report
With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:
We have 8 independent variables and 1 dependent variable (‘Transport’) in the given data set.
We have 444 rows which can be split into test & train dataset for various model building.
Data Description:
4|Page
Car Transport Prediction Assessment
5|Page
Car Transport Prediction Assessment
Univariate Analysis
Age:
Basic summary of Mean=27.75 with Std.dev=4.416 & skew = .9488. Age group between 24 to
32 has maximum count with 64%
No of outliers: 25
Work Experience:
Basic summary of Mean = 6.3 with Std.dev=5.11 and skew=1.344. Work experience between
0-4 years and 4-8 years have maximum count with 45% & 32% respectively.
6|Page
Car Transport Prediction Assessment
No of outliers : 38
Salary:
Basic summary of mean = 16.24 and Std.dev=10.45 with Skew=2.031. Salary grouping of 0-10
Lakhs and 10-20 lakhs have maximum count of 28% & 55% respectively.
No of outliers : 59
Distance:
7|Page
Car Transport Prediction Assessment
Distance grouping of 6-12km & 12-18km has maximum count of 56% and 35% respectively.
Outliers : 9
Gender:
Engineer:
8|Page
Car Transport Prediction Assessment
MBA:
License:
We have 76% of the people having no driving license and only remaining 24% have license.
9|Page
Car Transport Prediction Assessment
Transport:
This is the dependent variable to be predicted. Actual data shows 19% uses 2Wheeler, 14%
uses Car and remaining 67% uses public transport.
Treating of outliers not carried out in this exercise and it is not explicitly asked for.
Bi-Variate Analysis
We will analyze how the independent variable(Transport) trends with the dependent
variables.
Age Vs Transport:
10 | P a g e
Car Transport Prediction Assessment
Salary Vs Transport:
11 | P a g e
Car Transport Prediction Assessment
Distance Vs Transport:
Gender Vs Transport:
Engineer Vs Transport:
12 | P a g e
Car Transport Prediction Assessment
MBA Vs Transport:
License Vs Transport
13 | P a g e
Car Transport Prediction Assessment
Multicollinearity
We will check for multicollinearity among the independent variables. For this exercise, we will
remove the Gender variable.
Both the correlation coefficient and elliptical portrayal of the same is pasted below:
The original dataset is modified with another variable “tpt” with the logic as below. This is
done as the current scope is to predict the usage of Car as transport.
The dataset is split into training & test dataset with 70% & 30% respectively.
We can notice that training & test dataset are equally split between 0 & 1.
We will apply SMOTE using DMwR library to even out the class imbalance in the training
dataset as the model will be trained on this dataset.
By tuning the perc.over and perc.under, the desired results are achieved as follows:
The imbalance in the class has been rectified by the SMOTE method. Now the values of 0 & 1
are equally present in the dataset for any modelling.
15 | P a g e
Car Transport Prediction Assessment
Logistic Regression:
Variables of Importance:
16 | P a g e
Car Transport Prediction Assessment
The model is applied on the test data set and the results as below:
KS value = 82
This model performed well on the training and test data set. The accuracy and specificity are
extremely good and model performance measures are truly good.
We can notice that the variables, Age,Distance,license, MBA are very significant. Age,distance
and license have positive coefficient and MBA has a negative one.
17 | P a g e
Car Transport Prediction Assessment
KNN Model:
For KNN, we will normalize the data as distance, salary, experience are in different scales
along with other factor variables.
After normalization, training and test data sets will be created. To overcome the imbalance in
the classes, SMOTE will be applied on the training set.
With the best tune of k=5, the training set is modelled to predict the values in the test.set
mycarkpred=knn(mycarknn_trainSM[,c(1:8)],mycarknn_test[,c(1:8)],mycarknn_trainSM$Tr
ansport,k=5)
AUC=87.35% GINI=75%
18 | P a g e
Car Transport Prediction Assessment
KS=74.7
Interpretation: While this model has good accuracy and sensitivity, it loses out to glm on the
specificity. The main disadvantage is that it doesn’t learn from the training set, but simply
uses the training set for classification.
We will use the same training and test data set used in the logistic regression for this NB
model. The same training set which was balanced using SMOTE will be used.
Applying the NB prediction(trained on the SMOTE training set) to the test data set we get:
19 | P a g e
Car Transport Prediction Assessment
AUC=97% GINI = 94
KS = 80
Interpretation:
While the model has excellent confusion matrix values, AUC,GINI etc, the following are the
reasons why this cannot be used in this case:
This algorithm makes a very strong assumption about the data having features independent of
each other, while in reality it is not so. In our case, Age and work experience are closely
related. We have also seen qualifications like Engineer or MBA gets more salary and Salary
influences Car.
Not all the input variables are categorical to take full advantage of Naïve Bayes as we have
salary, work experience, distance etc which are numerical. In that case, the model assumes
that they are normally distributed whereas we have seen skewness in Age, work experience,
salary etc...
20 | P a g e
Car Transport Prediction Assessment
It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most
importantly AUC-ROC Curve.
False Positive (Type 1 Error): You predicted positive and it’s false.
False Negative (Type 2 Error): You predicted negative and it’s false.
In our exercise, we have used ‘0’ for 2wheeler/Public transport and ‘1’ for Car transport. To
evaluate the models, the specificity is the parameter which decides how correctly we have
predicted ‘CAR’ usage.
Being a classification problem, AUC is the best parameter for model validation closely
followed by KS parameter.
In our case, logistic regression performs best on this(with NB not considered for non-
independence of data). Also we have made it a Car Vs no Car to be a binary classification
rather than a multi classification. Logistic regression works well in these cases.
21 | P a g e
Car Transport Prediction Assessment
mycar.bagging=bagging(tpt~.,data=mycartrain,
control=rpart.control(maxdepth=5, minsplit=4))
mycartest$pred.tpt=predict(mycar.bagging,mycartest)
KS = 78%
22 | P a g e
Car Transport Prediction Assessment
mycargbm.fit = gbm(
formula = tpt ~ .,
distribution = "bernoulli",
data = mycartrain,
n.trees = 3000,
interaction.depth = 1,
shrinkage = 0.001,
n.cores = NULL,
verbose = FALSE
)
KS = 86
23 | P a g e
Car Transport Prediction Assessment
Xtreme Boosting:
###XG boost works with all numeric, so we will convert any factor to numeric
It is then tuned for the best values of eta, maxdepth & nrounds. The above is run by fixing the
other two values constant and running one value in a loop.
24 | P a g e
Car Transport Prediction Assessment
AUC=89 GINI=78
KS = 78
While XGboost has improved the accuracy, sensitivity and specificity there is a drop in AUC
by 4%.
25 | P a g e
Car Transport Prediction Assessment
Bivariate analysis and various models have shown, Age as an important parameter for
choosing Car as transport. Probability of choosing car is high after the age of 32 and there is a
big population between 24 & 32 years of age (64%) who can be targeted. This can be still
filtered by using the other key variables (distance, license etc..)
The probability of using Car increases if the distance also increases beyond 12km. There is a
big population in the 6-12 km group which can be targeted thru Marketing/advertisement
campaigns.
Probability of using car is high once the salary reaches 30Lakhs. Again there is a big chunk of
population between 10-30 lakhs who can be filtered based on other variables and can be
targeted.
There is a gender disparity in the Car usage with more men, but it is understandable with the
71% of male in the dataset.
Logistic regression has also shown Age and distance have positive coefficient and these two
variables can be used for key targets.
26 | P a g e
Car Transport Prediction Assessment
References:
27 | P a g e