Professional Documents
Culture Documents
OF
TRANSPORTATION
Predictive Modelling : ML
Naïve Bayes
Ø Naïve Bayes works under the
assumption that all the predictor
variables are
independent(probabilistically) of each
other.
k Nearest Neighbour
kNN is a non-parametric algorithm, it
makes no
assumptions on the distribution but it
performs best when all independent
variables are at a comparable scale.
STATISTICAL
DATA
ANALYSIS
The below steps
were followed to drive the analysis of the
given CARS dataset:
1. Data Exploration and data
preparation (SMOTE)
2. Logistic Regression
6. Inferences
ENVIRONMENT
SET UP AND
DATA IMPORT
library(dplyr)
install.packages('mice')
library(mice)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(ppcor)
library(caTools)
install.packages('DMwR')
library(DMwR)
library(car)
library(class)
library(e1071)
install.packages('gbm')
library(gbm)
install.packages('xgboost')
library(xgboost)
library(caret)
library(ipred)
library(rpart)
library(ROCR)
library(ineq)
The packages were installed, and the
libraries invoked to create the
model.
SETTING UP
THE
DIRECTORY
Setting a working directory on starting of
the R session
makes importing and exporting data files
and code files easier. Basically,
working directory is the location/ folder
on the PC where you have the data,
codes etc. related to the project.
The working directory is set using the
setwd command like below:
setwd("C:/Users/surabhi1.arora/Downloads
")
EXPLORATORY
DATA
ANALYSIS
empdata <-
read.csv("C:/Users/surabhi1.arora/Downlo
ads/Cars.csv" , header = T)
Checking the
basic stats of the data:
summary(empdata)
dim(empdata)
str(empdata)
names(empdata)
describe(empdata)
Seeing the
output, below are the observations:
Observations:
• Number of Rows in the data set: 444
• Number of columns: 9
• Dependent Variable ‘Transport’ is a
ternary variable. To make binary
predictions for the
employee mode of transport i.e. to
predict if employee uses car or not, we
will have to make a new categorical
variable where, the value is 1 if employee
uses car and 0 if employee uses other
mode of transport.
• Number of Predictor Variables: 8
• Gender, Engineer, MBA, License and
Transport are categorical variables.
• Age, Work.Exp, Salary and Distance
are continuous variables
• Gender is a factor variable with levels
– female and male.
Missing Values :
sapply(empdata,function(x)
sum(is.na(x)))
This shows that
the MBA column has 1 missing value.
We will have to
fill that value in order to get the whole
dataset to run our further analysis. We
impute this missing value using the
mice package. The value imputed
thus puts
MBA = 0 in place of NA.
After
imputation, the data is complete and
there are no missing values.
Analyzing the categorical predictor
Variables we see :
1. Gender
2. Engineer
3. MBA
4. License
Univariate Analysis of Continuous
Variable - Histogram
and Boxplots of the variables
1. AGE
The distribution
is slightly left skewed and
consequently there are a few outliers.
These
outliers are well within the expected
values for employee age and there is
nothing odd about them.
2. WORK EXPERIENCE
3. SALARY
The salary
distribution is left skewed in most
organizations and hence these outliers
do
not indicate anything unusual. None of
the values is abruptly high or low.
4. DISTANCE of COMMUTE
The distribution
of distance between employees’
homes from the office is fairly normal.
There
are a few outliers, but they aren’t
abruptly high or low and hence do not
require any action.
MultiCollinearity
To find the
correlation between the different
variables, we have to have all of them
in int/numeric data type.
Converting the Gender Column into
“Female” with the values 1 as yes and 0
as “male”, and the column Transport to
“Cars” with the values cars filled as “1”
and rest as “0” we have :
The correlation
plot shows that the Dependent
Variable ‘Car’ has high correlations
with Salary,
Work.Exp, Age, Distance and license.
But
Age, Salary, Work.Exp, Distance and
license have high correlations amongst
themselves.Also, Work.Exp and Salary
are very highly correlated. There is a
slightly negative correlation between
Female and license.
Correlation plot
and correlation significance values
indicates the presence of collinearity
between Age, Work-exp, Salary and
license. Correlation Plots give us
correlation between any 2 variables. To
determine multicollinearity (linear
relationship between more than 2
variables), we will use VIF values when
building Logistic Regression model.
Multicollinearity does not affect a kNN
model
DATA
PREPARATION
- SMOTE
From the current data set we
see, the number of employees taking car
as the mode of transport is really
small :
The graph shows :
Car = 0 : 383
Car = 1 : 61
The minority
class i.e. Car = 1 is only 13.74% of the data.
To fix this
imbalance in the data, we will use the
SMOTE technique.
Before using
SMOTE, split the data into train and
validation set (80:20)
Also, we will
have to convert the categorical variables
as factors and divide the dataset in
train and test
post division of
data in test and train we see there is an
even distribution in the test and
train data as is in the whole data set
which has 13.74% employees using cars
BALANCING THE DATA USING
SMOTE
Comparing the distribution of data
before and after
running SMOTE we see, the minority
sample proportion has been synthetically
improved, using k = 5. That is, 294 new
samples are generated using 5 nearest
neighbors.
PREDICTIVE
MODELLING
Predicting if an employee commutes
using car or not is a classification
problem; hence classification models
must be built. Our dependent variable
‘Car’ is a binary variable. We will use the
balanced data created using SMOTE for
training. For validation, the original
validation set created via splitting will be
used. We will use Logistic Regression,
kNN and Naïve Bayes modelling
techniques to build predictive models,
to predict customer churn. Bagging and
boosting will be used to see if
they improve the predictions. We will
then use Model Performance Metrics to
compare the three models’ performance
on the validation set.
LOGISTIC REGRESSION MODEL
The k-nearest neighbors (KNN) algorithm is a
simple, easy-to-implement supervised
machine learning algorithm that can be used
to solve both classification and regression
problems.
Variable scaling is very important in distance-
based algorithms like kNN, to make sure that
variables with higher numeric values do not
bias the model against variables with lower
numeric values. Hence, all the variables must
be brought to a similar scale/normalised.
Running the model we see that the accuracy
is 100% with the sensitivity and precision as
1.
NAIVE BAYES MODEL :
It is a classification technique based
on Bayes’ Theorem with an assumption of
independence among predictors. In simple
terms, a Naive Bayes classifier assumes that
the presence of a feature in a class is
unrelated to the presence of any other
feature. The dependent variable in our data is
a categorical variable with 2 levels. All our
predictor variables are numeric. A
Naïve Bayes algorithm is applicable in such a
case. However, it is not very easy to derive
insights from such a model .
To check the acceptability of the model, we
will calculate the performance indices :
The confusion matrix of the validation data is
pretty accurate (99%) with the sensitivity
of 99.12% and specificity of 98.97%. The
Precision is 98.26%.
Calculating the area under the curve we see
AUC is 99%
Using this model on the test data we see the
performance indices are below:
The confusion matrix of the validation data is
pretty accurate (95.5%) with the sensitivity
of 1 and specificity of 94.8%. The
Precision is 75%.
Calculating the area under the curve we see
AUC as 97.4%
Bagging and
Boosting ensemble models
We used the
xgboost algorithm for boosting our
prediction model. Although balancing the
data helps the boosting model, even without
balancing the performance of the
boosting algorithm is better than most other
algorithms. Boosting model predicts Mode
of transport with high Accuracy.
INSIGHTS AND
RECOMMENDATIONS