You are on page 1of 34

PREFERRED MODE

OF
TRANSPORTATION

Predictive Modelling : ML

Submitted By : Surabhi Sood


OBJECTIVE
The objective of the report is to explore
the given cars data set (“Cars_edited.csv”)
in R and generate insights from the data
set. This report will consist of the
following:

Importing the dataset in R


Understanding the structure of dataset
Statistical exploration
Insights from the dataset

The objective is to use the


dataset Cars_edited.csv to
build an appropriate model that will help
correctly predict which customers
prefer to use car as mode of
transportation. Also, which variables are a
significant predictor behind this
decision
ASSUMPTIONS
Following are the assumptions we have
taken into consideration:
 
Logistic Regression
Ø Binary Logistic Regression requires the
dependent variable to be
binary/dichotomous.
Ø For Logistic Regression, predictor
variables should be independent.
There should be little or no
multicollinearity among the predictor
variables.
Ø  Logistic regression assumes
linear relationship between each of the
independent variables and logit of the
outcome.
Ø  There should be no
influential values (extreme values or
outliers) in the continuous predictors.

Naïve Bayes
Ø Naïve Bayes works under the
assumption that all the predictor
variables are
independent(probabilistically) of each
other.

k Nearest Neighbour
kNN is a non-parametric algorithm, it
makes no
assumptions on the distribution but it
performs best when all independent
variables are at a comparable scale.
STATISTICAL
DATA
ANALYSIS
The below steps
were followed to drive the analysis of the
given CARS dataset:
 
1.  Data Exploration and data
preparation (SMOTE)

2. Logistic Regression

3. KNN Modelling Algorithm

4. Naive Bayes model

5. Bagging and Boosting

6. Inferences
ENVIRONMENT
SET UP AND
DATA IMPORT
library(dplyr)
install.packages('mice')
library(mice)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(ppcor)
library(caTools)
install.packages('DMwR')
library(DMwR)
library(car)
library(class)
library(e1071)
install.packages('gbm')
library(gbm)
install.packages('xgboost')
library(xgboost)
library(caret)
library(ipred)
library(rpart)
library(ROCR)
library(ineq)
The packages were installed, and the
libraries invoked to create the
model.
SETTING UP
THE
DIRECTORY
Setting a working directory on starting of
the R session
makes importing and exporting data files
and code files easier. Basically,
working directory is the location/ folder
on the PC where you have the data,
codes etc. related to the project.
The working directory is set using the
setwd command like below:

setwd("C:/Users/surabhi1.arora/Downloads
")
EXPLORATORY
DATA
ANALYSIS
empdata <-
read.csv("C:/Users/surabhi1.arora/Downlo
ads/Cars.csv" , header = T)

Checking the
basic stats of the data:
 
summary(empdata)
dim(empdata)
str(empdata)
names(empdata)
describe(empdata)
Seeing the
output, below are the observations:
 
Observations:
• Number of Rows in the data set: 444
• Number of columns: 9
• Dependent Variable ‘Transport’ is a
ternary variable. To make binary
predictions for the
employee mode of transport i.e. to
predict if employee uses car or not, we
will have to make a new categorical
variable where, the value is 1 if employee
uses car and 0 if employee uses other
mode of transport.
• Number of Predictor Variables: 8
• Gender, Engineer, MBA, License and
Transport are categorical variables.
• Age, Work.Exp, Salary and Distance
are continuous variables
• Gender is a factor variable with levels
– female and male. 

 
Missing Values :
 
sapply(empdata,function(x)
sum(is.na(x)))
 
 
This shows that
the MBA column has 1 missing value.
 
We will have to
fill that value in order to get the whole
dataset to run our further analysis. We
impute this missing value using the
mice package. The value imputed
thus puts
MBA = 0 in place of NA.
 
 
 
After
imputation, the data is complete and
there are no missing values.
Analyzing the categorical predictor
Variables we see :

1.      Gender

There are 128


females and 316 males in our data.

 
2.      Engineer

There are 335 engineers


and 109 non-engineers in our data.

 
3.      MBA

There are 112 MBAs


and 332 not MBAs in our data.

 
4.      License

104 people have


a license while 340 do not have a
license. The chances of
people with a license driving a car
looks significantly higher than the
ones without a license.

 
Univariate Analysis of Continuous
Variable - Histogram
and Boxplots of the variables

1. AGE

The distribution
is slightly left skewed and
consequently there are a few outliers.
These
outliers are well within the expected
values for employee age and there is
nothing odd about them.

 
2. WORK EXPERIENCE

This distribution is left skewed and


there are outliers towards the higher
end.
These outliers too are well within the
expected range. The senior employees
in the company are generally have
more work experience.

 
3. SALARY

The salary
distribution is left skewed in most
organizations and hence these outliers
do
not indicate anything unusual. None of
the values is abruptly high or low.

 
4. DISTANCE of COMMUTE

The distribution
of distance between employees’
homes from the office is fairly normal.
There
are a few outliers, but they aren’t
abruptly high or low and hence do not
require any action.

 
MultiCollinearity

To find the
correlation between the different
variables, we have to have all of them
in int/numeric data type.
Converting the Gender Column into
“Female” with the values 1 as yes and 0
as “male”, and the column Transport to
“Cars” with the values cars filled as “1”
and rest as “0” we have :

 
The correlation
plot shows that the Dependent
Variable ‘Car’ has high correlations
with Salary,
Work.Exp, Age, Distance and license. 
But
Age, Salary, Work.Exp, Distance and
license have high correlations amongst
themselves.Also, Work.Exp and Salary
are very highly correlated. There is a
slightly negative correlation between
Female and license. 
 
Correlation plot
and correlation significance values
indicates the presence of collinearity
between Age, Work-exp, Salary and
license. Correlation Plots give us
correlation between any 2 variables. To
determine multicollinearity (linear
relationship between more than 2
variables), we will use VIF values when
building Logistic Regression model.
Multicollinearity does not affect a kNN
model

 
DATA
PREPARATION
- SMOTE
From the current data set we
see, the number of employees taking car
as the mode of transport is really
small :
 
The graph shows :
 
Car = 0 : 383
Car = 1 : 61
The minority
class i.e. Car = 1 is only 13.74% of the data.

 
 
To fix this
imbalance in the data, we will use the
SMOTE technique.
Before using
SMOTE, split the data into train and
validation set (80:20)
 
Also, we will
have to convert the categorical variables
as factors and divide the dataset in
train and test 
 
post division of
data in test and train we see there is an
even distribution in the test and
train data as is in the whole data set
which has 13.74% employees using cars
 
BALANCING THE DATA USING
SMOTE
 
Comparing the distribution of data
before and after
running SMOTE we see, the minority
sample proportion has been synthetically
improved, using k = 5. That is, 294 new
samples are generated using 5 nearest
neighbors.
PREDICTIVE
MODELLING
Predicting if an employee commutes
using car or not is a classification
problem; hence classification models
must be built. Our dependent variable
‘Car’ is a binary variable. We will use the
balanced data created using SMOTE for
training. For validation, the original
validation set created via splitting will be
used. We will use Logistic Regression,
kNN and Naïve Bayes modelling
techniques to build predictive models,
to predict customer churn. Bagging and
boosting will be used to see if
they improve the predictions. We will
then use Model Performance Metrics to
compare the three models’ performance
on the validation set.
LOGISTIC REGRESSION MODEL

Taking the balanced data as the dataset for


creating the model and using VIF to reduce
the number
of dependant variables which are correlated
we see :
 
logit.train = balanced.train
 
LRmodel1=glm(Car~.,data=logit.train,family
= 'binomial')
summary(LRmodel1)
 
 
 
We see the VIF
of work experience is high (>5) and so we will
drop that column for model
building.
 
logit.train = subset(logit.train[,-c(4)])
 
LRmodel2 =
glm(Cars~.,data=logit.train,family = 'binomial')
summary(LRmodel2)
 
vif(LRModel2)
 
 
N we see AUC is 93.23%
LOGISTIC REGRESSION MODEL

Taking the balanced data as the dataset for


creating the model and using VIF to reduce
the number
of dependant variables which are correlated
we see :
 
logit.train = balanced.train
 
LRmodel1=glm(Car~.,data=logit.train,family
= 'binomial')
summary(LRmodel1)
 
 
 
We see the VIF
of work experience is high (>5) and so we will
drop that column for model
building.
 
logit.train = subset(logit.train[,-c(4)])
 
LRmodel2 =
glm(Cars~.,data=logit.train,family = 'binomial')
summary(LRmodel2)
 
vif(LRModel2)
 
 
 
Now all the vif
values are within range.
 
The model shows
that Age, Academic Qualification, Salary,
Distance of commute, license and Gender
are significant in predicting employee mode
of commute.
 
To check the
acceptability of the model, we will calculate
the performance indices :
 
Hence the
confusion matrix of the validation data is
pretty accurate (97.8%) with the
sensitivity of  96.5% and specificity of
98.6%. The Precision is 97.6%.
 
Calculating the
area under the curve we see AUC is 97.56%
 
 
Using this model
on the test data we see the performance
indices are below:
 
The confusion
matrix of the test data is accurate (94.38%)
with the sensitivity of  91.6% and specificity of
94.8%. The Precision
is 73.3%.
 
Calculating the
area under the curve we see AUC is 93.23%
 
K NEAREST NEIGHBOR MODEL

 
The k-nearest neighbors (KNN) algorithm is a
simple, easy-to-implement supervised
machine learning algorithm that can be used
to solve both classification and regression
problems.
Variable scaling is very important in distance-
based algorithms like kNN, to make sure that
variables with higher numeric values do not
bias the model against variables with lower
numeric values. Hence, all the variables must
be brought to a similar scale/normalised.
 
 
Running the model we see that the accuracy
is 100% with the   sensitivity and precision as
1.
 
NAIVE BAYES MODEL :

 
It is a classification technique based
on Bayes’ Theorem with an assumption of
independence among predictors. In simple
terms, a Naive Bayes classifier assumes that
the presence of a feature in a class is
unrelated to the presence of any other
feature. The dependent variable in our data is
a categorical variable with 2 levels. All our
predictor variables are numeric. A
Naïve Bayes algorithm is applicable in such a
case. However, it is not very easy to derive
insights from such a model .
 
 
To check the acceptability of the model, we
will calculate the performance indices :
 
 
The confusion matrix of the validation data is
pretty accurate (99%) with the sensitivity
of  99.12% and specificity of 98.97%. The
Precision is 98.26%.
 
Calculating the area under the curve we see
 AUC is 99%
 
Using this model on the test data we see the
performance indices are below:
 
 
The confusion matrix of the validation data is
pretty accurate (95.5%) with the sensitivity
of  1 and specificity of 94.8%. The
Precision is 75%.
 
Calculating the area under the curve we see
AUC as 97.4%
Bagging and
Boosting ensemble models

Ensemble modelling techniques like bagging


and boosting help to make better predictions
for imbalanced datasets
Bagging is a way to decrease the variance
in the prediction by generating additional
data for training from dataset using
combinations with repetitions to produce
multi-sets of the original data. Boosting is an
iterative technique which adjusts the weight
of an observation based on the last
classification
BAGGING :

Running the algorithm and checking for the


performance indices of the model on the
training data we see : the confusion
matrix for training data gives a high
accuracy of 99.43% with sensitivity of
97.95% and specificity of 99.67%. The
precision value is 97.95%
Checking for the area under the curve we
have
the below value :
 
 AUC =  0.9881619
 
 
Using this model on the test dataset we see
the below performance indices :
 
 
The confusion matrix shows that the model
works pretty accurately on the test data
(95.5%), with a sensitivity of 83.3%
and specificity of 97.4%. The precision is
83.3% and AUC is 90.36%
 
 
 
We use decision tress for our bagging
algorithm. The predictions provided by
bagging algorithms are accurate even with
unbalanced data. We do not have to use
synthetic sampling techniques like SMOTE
when using bagging. We can use the original
unbalanced data. Bagging does sampling
internally and creates many random samples
where the minority class gets decent
representation. Bagging model predicts
Mode of transport with high Accuracy
BOOSTING:

We will use xgboost() as or boosting


algorithm.  Xgboost works with matrices .All
variables must be numeric for xgboost . We
also need to separate training data
and the dependent variable
 
To check the model
performance we will see the index values : 

From the confusion matrix we see the


boosting
model is 97.75% accurate with a sensitivity of
93.88% and specificity if
98.36%. The precision is 90.2%.
 
Checking for the area under the curve we see
AUC = 0.9954982
 
Running the model on the test data we get :
 
The confusion matrix is 94.38% accurate with
sensitivity of 83.3% and specificity of
96.1%.The precision is 76.9%.
 
The Area under the curve is 98.48%

We used the
xgboost algorithm for boosting our
prediction model. Although balancing the
data helps the boosting model, even without
balancing the performance of the
boosting algorithm is better than most other
algorithms.  Boosting model predicts Mode
of transport with high Accuracy.
INSIGHTS AND
RECOMMENDATIONS

In this model building exercise, we tried


classification of data using models like  KNN,
Naives Bayes and Logistic Regression. But
as the data is unbalanced, we first had to
balance it using techniques like
SMOTE before running any of these
predictive algorithms.
The bagging and Boosting ensemble
techniques have an advantage here as they
can easily be used on unbalanced data as
well.
From the above calculated performance
indices for all the models we see –

All the models predict mode of transport


with high accuracy.
Generally boosting and bagging give us
better predictors. Especially when the data
is unbalanced.
In our present case however, kNN and
Naïve Bayes with SMOTE and Bagging and
Boosting without SMOTE, all give us high
accuracy and sensitivity values.
Except logistic regression, all the other
models have almost perfect predictions on
test data.
More than Accuracy, here we should focus
on Sensitivity and Precision when we
encounter such unbalanced data.
·
Key variables that help to predict if an
employee travels in Car are – Distance of
commute, Age, Salary, Gender and
whether or not the employee has a License

If the sample is a true representation of the


population, fewer employees commute using
a car, most prefer other means of transport.
Distance of travel is a major factor in
deciding the mode of transport. If the
distance is higher (>9 KM), employees are
more likely to use cars. Another very
important factor is salary. Only people with
higher salaries travel by car. One small
exception to the above criteria is gender.
Females are more likely to travel by car than
males even when the salary is mid-range.  
Older employees are more likely to travel by
car.  (Since age of the employees and their
salaries and work experience have high
positive correlation, we can focus on either
one of these variables at a time.)

Taking these factors into consideration, it can


be predicted with a great accuracy which
employees are more likely to use cars for
their everyday commute.

You might also like