Professional Documents
Culture Documents
output: html_document
```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
```{r}
setwd("/Users/jaggiey/Desktop/Desktop Items/University of Texas, Austin - DSBA/Class Project 4")
PROJECT EXPLANATORY NOTE The project seeks to understand the choice of the mode of transportation by employees. Based on the understanding, it aims to
predict whether or not an employee will use a car as a mode of transport work while determining the critical factor or variable driving this preference. It will address
descriptive, predictive and prescriptive aspects of machine learning and modelling.
Load Libraries
```{r}
library(scales)
library(ggplot2)
library(DMwR)
library(caTools)
library(corrplot)
library(class)
library(e1071)
library(ROCR)
library(car)
library(caret)
library(ipred)
library(rpart)
library(gbm)
library(xgboost)
#Age Distribution
```{r}
ggplot(MoT, aes(Age, fill=Age))+geom_histogram(fill="aquamarine2",color="blue", bins = 15,stat="count")+stat_count(geom="text",color="blue",
Notes on Age Distribution: The employees' age is almost normally distributed with few above age 35years. Most employer are under 30 years( i.e.80.75%) while the
modal age is 26 years.
Note on Commuting Distance: The commuting or travel distance appear distributed close to normal. Most employers granve between 6km and 14km; very few
beyond 20km.
Boxplot
```{r} boxplot(MoT)
#Bivariate Analysis:
#Mode of Transport based on Age
```{r}
ggplot(MoT, aes(Age, fill=Trans.Mode))+geom_histogram(color="black", bins = 20)+ggtitle("Mode of Transport by Ages ")+xlab("Age of Employees"
Observation: Employees younger than 30 years, who constitute the majority of employees, use other modes of transportation while only employees from 30 and
above use cars as a mode of transport. Although there is a mix of the modes of transport for the employees between 30 - 35 years of age, those from 36 upwards
solely commute in cars.
Observation: There more male employees than female employees. Proportionately, more male employees prefer cars as a mode of transport than f
Observation: Most employees' commuting distance to work is under 14km. Employees who live farther than approximately 14km from work, prefer
Observations: 1. The distribution of the employees' salaries is rightly skewed with most less than 20,000.00. Almost all employees earning below 30,000.00 commute
by other mode rather than cars. The few exceptional cases, live farhter i.e. between 14-20 km from work.
2. Few earning above 30,000.00 use other mode and they live less than 15km from work.
3. Most employees earning from 30,000.00 and above live farther i.e. from 14km and beyond.
While there appears to be outliers among salary earners, it may not be abnormal considering age and years of experience, considering this ma
#Multivariate Analysis
#Work Experience, Salary and Mode of Transport
```{r}
ggplot(MoT, aes(Work.Exp,Salary, size=Salary, col=Trans.Mode))+geom_point()+geom_smooth()+ggtitle("Work Experience, Salary and Mode of Transp
Observations: There is a positive correlation between salary and work experience i.e. the longer the years of experience, the more the employee earns. Invariably, the
high earners afford a car ride to work than those earning less i.e. less than 30,000.00.
ggplot(MoT, aes(Work.Exp,Distance, size=Salary, col=Gender))+geom_point()+ggtitle("Distribution Across Travel Distance, Work Experience, Salary and
Gender")+xlab("Work Experience")+ylab("Travel Distance in km")+theme(plot.title=element_text(hjust=0.5,face="bold"),title=element_text(face="bold", hjust = 0.5))
Observations: The employment is male dominated as there are much fewer women in the employment. Also, the high echelon is even more may domi
A contrast is that at the lower end of the spectrum of pay versus experience, there is almost a balance in the gender mix.
Most of the employees (Male and Female alike) travel less than 15km to the office. More of lower earning male live after 15 km, although a fe
Observation: Most employees under age of 30 years travel a distance of less than 14km to the office. A proportion of employees above 30 years commute longer than
14km.
Most young employees with lesser earnings travel shorter distance to the office; invariably they stay closer. It may be suggestive of that the mode of transport is
greatly influenced by salary of employees, a reason to stay closer to work to reduce travel costs. This is logic is however speculatively inferred from the exploratory
analysis.
Observations: Most employees prefer public transport option, particularly those traveling than 14km. These are coincidentally, most of the e
b) Determining relevant variables for model building. Few approaches are explored including variable importance test(varImp), Chi-square chec
c)The final and, in my judgment, the most challenging aspect of the problem is the identify and and resolving multi-collinearity among predic
Age and Work Experience.
proportion of outliers in the Work Experience variable is 0.96% and also immaterial.
proportion of outliers in the Salary variable is 0.72% and hence considered inconsequential.
MoT$TMode.binary = as.factor(ifelse(MoT$Trans.Mode=="Car",1,0))
head(MoT$TMode.binary,20)
set.seed(123)
partition = sample.split(MoT$TMode.binary, SplitRatio = 0.7)
MoT.Train =subset(MoT,partition==T)
MoT.Test =subset(MoT,partition==F)
dim(MoT.Train)
dim(MoT.Test)
table(MoT.Train$TMode.binary)
prop.table(table(MoT.Train$TMode.binary))
table(MoT.Test$TMode.binary)
prop.table(table(MoT.Test$TMode.binary))
class(MoT.Train$TMode.binary)
####SMOTE
MoT.Trainsmote = SMOTE(TMode.binary~., data=MoT.Train, perc.over=3000, k=5,perc.under =500 )
dim(MoT.Trainsmote)
table(MoT.Trainsmote$TMode.binary)
prop.table(table(MoT.Trainsmote$TMode.binary))
## Car travel rate equals 0.08. The data set is unbalance and may affect the model performance particularly as regards model sensitivity or r
Gender_chistat = chisq.test(MoT$Trans.Mode,MoT$Gender)
Engineer_chistat = chisq.test(MoT$Trans.Mode,MoT$Engineer)
MBA_chistat = chisq.test(MoT$Trans.Mode,MoT$MBA)
license_chistat = chisq.test(MoT$Trans.Mode,MoT$license)
Cat_p.values = c(Gender_chistat$p.value, Engineer_chistat$p.value, MBA_chistat$p.value, license_chistat$p.value)
Cat_parameters = c(Gender_chistat$parameter, Engineer_chistat$parameter, MBA_chistat$parameter, license_chistat$parameter)
Cat_statistics = c(Gender_chistat$statistic, Engineer_chistat$statistic, MBA_chistat$statistic, license_chistat$statistic)
Categorical_names =c("Gender", "Engineer", "MBA","License")
Cat_relevance =data.frame(Categorical_names,Cat_statistics,Cat_parameters,Cat_p.values)
print(Cat_relevance)
From the variance relevance test, the result shows that only license is a significantly relevant variable. The Null hypothesis is rejected for Gender, Engineer and MBA.
Therefore, they are considered not significant to the model building process. However, driving license returns significant and is relevant in influencing the transport
mode preference by employees. In its case, we fail to reject the null hypothesis.
KNN Model
```{r} MoT KNN=MoT names(KNN) head(KNN) levels(KNN$Gender) =c("1","0") str(KNN)
KNNLabel = c("Performance Metric", "Accuracy", "Sensitivity", "Specificity") KNNOutput = c("Performance Output", KNNaccuracy, KNNsensitivity, KNNspecificity)
KNNResult_Table = data.frame(KNNLabel, KNNOutput) print(KNNResult_Table)
#Naive Bayes Modelling
```{r}
NBTrain= MoT.Train[,-c(9,10)]
NBTest = MoT.Test[,-c(9,10)]
names(NBTrain)
NBModel = naiveBayes(x=NBTrain[,1:8], y=NBTrain[,9])
predMode =predict(NBModel,NBTest)
predModeVal =predict(NBModel,NBTest,type="raw")[,2]
predModeVal = ifelse(predModeVal>0.5,1,0)
NBTest.tab = table(NBTest$TMode.binary,predModeVal)
NBTest.tab
NBRoc = prediction(predModeVal,NBTest$TMode.binary)
nb.perf = performance(NBRoc,"tpr", "fpr")
plot(nb.perf,colorize = TRUE, main="Naive Bayes ROC Plot - Mode of Transport")
str(Logitrain)
LogModel0 is impacted by multi-collinearity as variables (Age and Work.Exp) highly correlate with Salary. The model is
therefore improved by dropping both Age and Work.Exp variables.
=============================================ROC AND
AUC===========================================
----------------------------------------ROC -------------------------------------------------------
LogRoc = prediction(Logpred.test,Logitest$TMode.binary) Log.perf = performance(LogRoc,"tpr", "fpr") plot(Log.perf,colorize = TRUE, main="ROC Plot LogiTest - Mode
of Transport")
------------------------------------------AUC ----------------------------------------------------
Log.perf =performance(LogRoc,"auc") auc=Log.perf@y.values auc = percent(as.numeric(auc), accuracy = 0.01) auc
**Note: Distance and Salary ranked the highest. Both also have the
least correlation based on earlier EDA.
Improving on LogModel by addressing multicollinearity and enhancing model performance.
---------------------------------------------AUC --------------------------------------------------
Logmodel.comparison = data.frame(Loglabel,LogModel1.perf,LogModel2.perf)
print(Logmodel.comparison)
LogModel
lrtest(LogModel) #-------All Variables' combined log-likelihood of -13.94 greater than intercept log-likelihood of -82.94(indicative of a good model) pR2(LogModel)
odds =exp(coef(LogModel)) prob = odds/(1+odds) odds_prob1 = data.frame(odds,prob) print(odds_prob1)
LogModel2
lrtest(LogModel2) #-------All Variables' combined log-likelihood of -14.40 greater than intercept logl-ikelihood of -82.94(indicative of a better model than Logmodel
above, albeit slightly)
odds-probability in both models show Salary and Distance as the significant variables
influence mode of transport preference.
VI. MODEL BUILING 2 - BAGGING, GRADIENT BOOSTING AND EXTREME GRADIENT BOOSTING
#Bagging
```{r}
Gradient Boosting
```{r}
Predictions:GBM
GBLabel = c("Performance Metric", "Accuracy", "Sensitivity", "Specificity") GBOutput = c("Performance Output", GBTest.Accuracy, GBTest.Sensitivity,
GBTest.Specificity) GBResult_Table = data.frame(GBLabel, GBOutput) print(GBResult_Table)
#XGB - Extreme Gradient Boosting
```{r}
names(MoT.Train)
xtreme.trainX = as.matrix(MoT.Train[,c(1,5:7)])
xtreme.trainY = as.matrix(MoT.Train[,11])
xtreme.testX = as.matrix(MoT.Test[,c(1,5:7)])
xtreme.testY = as.matrix(MoT.Test[,11])
summary(xtreme.trainX)
xgb.tuner = xgboost(
data = xtreme.trainX,
label = xtreme.trainY,
eta = 0.0001,
max_depth = 3,
min_child_weight = 3,
nrounds = 10000,
nfold = 5,
objective = "binary:logistic",
verbose = 0,
early_stopping_rounds = 10
)
xtreme.testX$pred = predict(xgb.tuner, xtreme.testX)
xtreme.test.tab= table(xtreme.testY,xtreme.testX$pred>0.5)
xtreme.test.tab
VII. COMPARATIVE ANALYSIS OF MODEL AND RECOMMENDATION ```{r} Model_Name =as.character(c("KNN","Naive Bayes","Logistic Regression","Bagging","Gradient
Boosting", "Extreme Gradient Boosting"))
Comparative Notes
Following the building of bagging and boosting models and as shown in the performance table above, observed that both bagging and gradient boost model perfectly
fits the test data. However, the extreme gradient boost lagged behind slightly in both accuracy and specificity. Hence there is no difference in choice between Bagging
model and the standard Gradient Boosting model in this cases.
Also, on the one hand, KNN and Logistic Regression models both return perfect scores across measures and therefore tie in performance.
Comparing the two class, it is safe to conclude that in this project, the best of the two groups of models tie or match in performance across board.
1.The workforce is largely young with most employees under the age of 30 years.
4. Most employers do not choose cars but prefer other mode of transport.
5. Salary earned and distance traveled to work largely influence the choice of the mode of transportation; logically, perhaps a reflection of expediency(best suited
to distance) and means(affordability).
To understand and better predict the individual employee's preference of transportation mode, any of KNN, Logistic regression, Bagging or Boosting models may be
used as they produced exactly same results across relevant measures.
From the synopsis provided ab initio, it is not clear what decision the project is focused. However, if it is related to employing welfare, the decision makers may
consider: