Random Forest Reference Code

Classification
Random Forest
www.proschoolonline.com
Random Forest
#Random Forest model
modelrf <- randomForest(as.factor(left) ~ . , data = trainSplit, do.trace=T)
modelrf
The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)
varImpPlot(modelrf)
The variable importance plot displays

a plot with variables sorted by
MeanDecreaseGini
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data
confusionMatrix(predrf_tr,trainSplit$left) #Train Data

confusionMatrix(predrf_test,testSplit$left) #TestData
The Confusion Matrix The Confusion Matrix

on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
Comparing ROC curves for Decision Tree and Random Forest
# Prediction and Model Evaluation using Confusion Matrix

#Decision Tree ROC
auc1 <- roc(as.numeric(testSplit$left),
as.numeric(predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))
#Random Forest ROC

aucrf <- roc(as.numeric(testSplit$left),
as.numeric(predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), print.thres=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')
#Comparing both ROC curves

plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.
Classification Model
Naïve Bayes
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes(as.factor(left) ~. , data = trainSplit)
modelnb
These are the apriori probabilities for the variables in the dataset
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data
confusionMatrix(prednb_tr,trainSplit$left) #Train Data

confusionMatrix(prednb_test,testSplit$left) #Test Data
The Confusion Matrix

The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model
Classification Model
kNN Algorithm
kNN Algorithm
#Data Preparation for kNN Algorithm
library(dummies)
#Creating dummy variables for Factor variable
dummy_df = dummy.data.frame(hr_data1[, c('role_code', 'salary.code')])
hr_data2 = hr_data1
hr_data2 = cbind.data.frame(hr_data2, dummy_df)
#Removing role_code and salary.code since we have created dummy variables

hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', 'salary.code'))]
#Converting variables to numeric datatype

hr_data2$Work_accident = as.numeric(hr_data2$Work_accident)
hr_data2$promotion_last_5years = as.numeric(hr_data2$promotion_last_5years)
kNN Algorithm
#Data Preparation for kNN Algorithm
#Scale the variables and check their final

structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = as.data.frame(scale(X))
str(hr_data2_scaled)
#Splitting the data for the model building

hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]
hr_train_labels <- hr_data2[splitIndex, 'left']

hr_test_labels <- hr_data2[-splitIndex, 'left']
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)
test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)

CrossTable(x=hr_test_labels ,y=test_pred_1 ,prop.chisq = FALSE)
Here from this crosstab, we can compute the accuracy

of this model, for k = 1.
Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%
kNN Algorithm
#Applying kNN Algorithm on the dataset
As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.

Below we summarize the accuracy for these k values
K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%
From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51
test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)

CrossTable(x=hr_test_labels ,y=test_pred_rule ,prop.chisq = FALSE)
# accuracy = 4050/4499 = 90.02%
# Another method to determine the k for k-NN

set.seed(400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit
# Checking accuracy of the model with k = 7

test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,prop.chisq = FALSE)
# accuracy = 4357/4499 = 96.84%
#or alternatively we can use this below command

confusionMatrix (hr_test_labels,test_pred_7)
Output on next slide..

Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated
Step 6
Model
Summarization
Summary of Model Performance
Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%
Appendix
Packages used for the Classification Analysis:
•data.table
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•rpart.plot #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels
Thank You.

Random Forest Reference Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Forest Reference Code

Uploaded by

Copyright:

Available Formats

Classification

The variable importance plot displays

confusionMatrix(predrf_tr,trainSplit$left) #Train Data

The Confusion Matrix The Confusion Matrix

# Prediction and Model Evaluation using Confusion Matrix

#Random Forest ROC

#Comparing both ROC curves

confusionMatrix(prednb_tr,trainSplit$left) #Train Data

The Confusion Matrix

#Removing role_code and salary.code since we have created dummy variables

#Converting variables to numeric datatype

#Scale the variables and check their final

#Splitting the data for the model building

hr_train_labels <- hr_data2[splitIndex, 'left']

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)

Here from this crosstab, we can compute the accuracy

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)

# Another method to determine the k for k-NN

# Checking accuracy of the model with k = 7

#or alternatively we can use this below command

Output on next slide..

You might also like