You are on page 1of 19

Classification

Random Forest

www.proschoolonline.com
Random Forest
#Random Forest model
modelrf <- randomForest(as.factor(left) ~ . , data = trainSplit, do.trace=T)
modelrf

The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.

www.proschoolonline.com
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)

varImpPlot(modelrf)

The variable importance plot displays


a plot with variables sorted by
MeanDecreaseGini

www.proschoolonline.com
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data

confusionMatrix(predrf_tr,trainSplit$left) #Train Data


confusionMatrix(predrf_test,testSplit$left) #TestData

The Confusion Matrix The Confusion Matrix


on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
www.proschoolonline.com
Comparing ROC curves for Decision Tree and Random Forest

# Prediction and Model Evaluation using Confusion Matrix


#Decision Tree ROC
auc1 <- roc(as.numeric(testSplit$left),
as.numeric(predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))

#Random Forest ROC


aucrf <- roc(as.numeric(testSplit$left),
as.numeric(predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), print.thres=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')

#Comparing both ROC curves


plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.

www.proschoolonline.com
Classification Model
Naïve Bayes

www.proschoolonline.com
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes(as.factor(left) ~. , data = trainSplit)
modelnb

These are the apriori probabilities for the variables in the dataset
www.proschoolonline.com
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data

confusionMatrix(prednb_tr,trainSplit$left) #Train Data


confusionMatrix(prednb_test,testSplit$left) #Test Data

The Confusion Matrix


The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model

www.proschoolonline.com
Classification Model
kNN Algorithm

www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm

library(dummies)
#Creating dummy variables for Factor variable
dummy_df = dummy.data.frame(hr_data1[, c('role_code', 'salary.code')])

hr_data2 = hr_data1
hr_data2 = cbind.data.frame(hr_data2, dummy_df)

#Removing role_code and salary.code since we have created dummy variables


hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', 'salary.code'))]

#Converting variables to numeric datatype


hr_data2$Work_accident = as.numeric(hr_data2$Work_accident)
hr_data2$promotion_last_5years = as.numeric(hr_data2$promotion_last_5years)

www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm

#Scale the variables and check their final


structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = as.data.frame(scale(X))

str(hr_data2_scaled)

#Splitting the data for the model building


hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]

hr_train_labels <- hr_data2[splitIndex, 'left']


hr_test_labels <- hr_data2[-splitIndex, 'left']

www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)


CrossTable(x=hr_test_labels ,y=test_pred_1 ,prop.chisq = FALSE)

Here from this crosstab, we can compute the accuracy


of this model, for k = 1.

Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%

www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.


Below we summarize the accuracy for these k values

K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%

From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.

www.proschoolonline.com
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)


CrossTable(x=hr_test_labels ,y=test_pred_rule ,prop.chisq = FALSE)
# accuracy = 4050/4499 = 90.02%

# Another method to determine the k for k-NN


set.seed(400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit

# Checking accuracy of the model with k = 7


test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,prop.chisq = FALSE)
# accuracy = 4357/4499 = 96.84%

#or alternatively we can use this below command


confusionMatrix (hr_test_labels,test_pred_7)

Output on next slide..


www.proschoolonline.com
Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated

www.proschoolonline.com
Step 6
Model
Summarization

www.proschoolonline.com
Summary of Model Performance

Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%

www.proschoolonline.com
Appendix
Packages used for the Classification Analysis:

•data.table
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•rpart.plot #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels

www.proschoolonline.com
Thank You.

You might also like