You are on page 1of 17

10/15/21, 1:30 PM RPubs - kNN using R caret package

kNN Using caret R package


Vijayakumar Jawaharlal
April 29, 2014
Recently I’ve got familiar with caret package. Caret is a great R package which provides general interface to nearly
150 ML algorithms. It also provides great functions to sample the data (for training and testing), preprocessing,
evaluating the model etc.,
To get familiar with caret package, please check following URLs
https://www.youtube.com/watch?v=7Jbb2ItbTC4http://http://caret.r-forge.r-project.org http://cran.r-
project.org/web/packages/caret/vignettes/caret.pdf
I am going to use same dataset from previous examples. Intention of this excercise is to get familiar with caret
package

Sampling
library(ISLR)

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(300)

#Spliting data as training and test set. Using createDataPartition() function from caret

indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)

training <- Smarket[indxTrain,]


testing <- Smarket[-indxTrain,]

#Checking distibution in origanl data and partitioned data

prop.table(table(training$Direction)) * 100

##

## Down Up

## 48.19 51.81

prop.table(table(testing$Direction)) * 100

##

## Down Up

## 48.08 51.92

prop.table(table(Smarket$Direction)) * 100

https://rpubs.com/njvijay/16444 1/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

##

## Down Up

## 48.16 51.84

creteDataParition function creates sample very effortlessly. We don’t need to write complex function like previous
example

Preprocessing
kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose
centring and scaling

trainX <- training[,names(training) != "Direction"]

preProcValues <- preProcess(x = trainX,method = c("center", "scale"))

preProcValues

##

## Call:

## preProcess.default(x = trainX, method = c("center", "scale"))

##

## Created from 938 samples and 8 variables

## Pre-processing: centered, scaled

Training and train control


set.seed(400)

ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass


Summary)

knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)

#Output of kNN fit

knnFit

https://rpubs.com/njvijay/16444 2/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## k-Nearest Neighbors

##

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'

##

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)

##

## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...

##

## Resampling results across tuning parameters:

##

## k Accuracy Kappa Accuracy SD Kappa SD

## 5 0.9 0.7 0.04 0.07

## 7 0.9 0.8 0.04 0.08

## 9 0.9 0.8 0.04 0.08

## 10 0.9 0.8 0.03 0.07

## 10 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.04 0.08

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.05

##

## Accuracy was used to select the optimal model using the largest value.

## The final value used for the model was k = 23.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)

plot(knnFit)

https://rpubs.com/njvijay/16444 3/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

knnPredict <- predict(knnFit,newdata = testing )

#Get the confusion matrix to see accuracy value and other parameter values

confusionMatrix(knnPredict, testing$Direction )

https://rpubs.com/njvijay/16444 4/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Confusion Matrix and Statistics

##

## Reference

## Prediction Down Up

## Down 123 8

## Up 27 154

##

## Accuracy : 0.888

## 95% CI : (0.847, 0.921)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : < 2e-16

##

## Kappa : 0.774

## Mcnemar's Test P-Value : 0.00235

##

## Sensitivity : 0.820

## Specificity : 0.951

## Pos Pred Value : 0.939

## Neg Pred Value : 0.851

## Prevalence : 0.481

## Detection Rate : 0.394

## Detection Prevalence : 0.420

## Balanced Accuracy : 0.885

##

## 'Positive' Class : Down

##

mean(knnPredict == testing$Direction)

## [1] 0.8878

#Now verifying 2 class summary function

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum


mary)

knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)

## Loading required package: pROC

## Type 'citation("pROC")' for a citation.

##

## Attaching package: 'pROC'

##

## The following objects are masked from 'package:stats':

##

## cov, smooth, var

## Warning: The metric "Accuracy" was not in the result set. ROC will be used

## instead.

https://rpubs.com/njvijay/16444 5/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

#Output of kNN fit

knnFit

## k-Nearest Neighbors

##

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'

##

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)

##

## Summary of sample sizes: 844, 844, 845, 843, 844, 844, ...

##

## Resampling results across tuning parameters:

##

## k ROC Sens Spec ROC SD Sens SD Spec SD

## 5 0.9 0.8 0.9 0.03 0.07 0.05

## 7 1 0.8 0.9 0.02 0.06 0.05

## 9 1 0.8 0.9 0.02 0.07 0.05

## 10 1 0.8 0.9 0.02 0.07 0.05

## 10 1 0.8 0.9 0.02 0.07 0.05

## 20 1 0.9 0.9 0.02 0.06 0.04

## 20 1 0.9 0.9 0.02 0.06 0.04

## 20 1 0.9 0.9 0.02 0.07 0.03

## 20 1 0.9 0.9 0.01 0.06 0.03

## 20 1 0.9 0.9 0.01 0.06 0.03

## 20 1 0.9 0.9 0.01 0.07 0.03

## 30 1 0.9 0.9 0.01 0.06 0.03

## 30 1 0.8 0.9 0.01 0.07 0.03

## 30 1 0.8 0.9 0.01 0.06 0.03

## 30 1 0.8 0.9 0.01 0.07 0.03

## 40 1 0.8 0.9 0.01 0.06 0.03

## 40 1 0.8 0.9 0.01 0.06 0.03

## 40 1 0.8 0.9 0.01 0.06 0.02

## 40 1 0.8 0.9 0.01 0.06 0.02

## 40 1 0.8 0.9 0.01 0.06 0.03

##

## ROC was used to select the optimal model using the largest value.

## The final value used for the model was k = 43.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)

plot(knnFit, print.thres = 0.5, type="S")

https://rpubs.com/njvijay/16444 6/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

knnPredict <- predict(knnFit,newdata = testing )

#Get the confusion matrix to see accuracy value and other parameter values

confusionMatrix(knnPredict, testing$Direction )

https://rpubs.com/njvijay/16444 7/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Confusion Matrix and Statistics

##

## Reference

## Prediction Down Up

## Down 123 9

## Up 27 153

##

## Accuracy : 0.885

## 95% CI : (0.844, 0.918)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : < 2e-16

##

## Kappa : 0.768

## Mcnemar's Test P-Value : 0.00461

##

## Sensitivity : 0.820

## Specificity : 0.944

## Pos Pred Value : 0.932

## Neg Pred Value : 0.850

## Prevalence : 0.481

## Detection Rate : 0.394

## Detection Prevalence : 0.423

## Balanced Accuracy : 0.882

##

## 'Positive' Class : Down

##

mean(knnPredict == testing$Direction)

## [1] 0.8846

Trying to plot ROC curve to check specificity and sensitivity

library(pROC)

knnPredict <- predict(knnFit,newdata = testing , type="prob")

knnROC <- roc(testing$Direction,knnPredict[,"Down"], levels = rev(testing$Direction))

knnROC

##

## Call:

## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te


sting$Direction))
##

## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).

## Area under the curve: 0.5

plot(knnROC, type="S", print.thres= 0.5)

https://rpubs.com/njvijay/16444 8/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

##

## Call:

## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te


sting$Direction))
##

## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).

## Area under the curve: 0.5

Applying Random Forest to see the performance


improvement
set.seed(400)

ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass


Summary)

# Random forrest

rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
nter","scale"), tuneLength = 20)

https://rpubs.com/njvijay/16444 9/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Loading required package: randomForest

## randomForest 4.6-7

## Type rfNews() to see new features/changes/bug fixes.

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

rfFit

## Random Forest
##

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'

##

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)

##

## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...

##

## Resampling results across tuning parameters:

##

## mtry Accuracy Kappa Accuracy SD Kappa SD

## 2 1 1 0.004 0.009

## 3 1 1 0.004 0.009

## 4 1 1 0.004 0.009

## 5 1 1 0.004 0.009

## 6 1 1 0.004 0.009

## 7 1 1 0.004 0.009

## 8 1 1 0.004 0.009

##

## Accuracy was used to select the optimal model using the largest value.

## The final value used for the model was mtry = 2.

plot(rfFit)

https://rpubs.com/njvijay/16444 10/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

rfPredict <- predict(rfFit,newdata = testing )

confusionMatrix(rfPredict, testing$Direction )

https://rpubs.com/njvijay/16444 11/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Confusion Matrix and Statistics

##

## Reference

## Prediction Down Up

## Down 150 0

## Up 0 162

##

## Accuracy : 1

## 95% CI : (0.988, 1)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : <2e-16

##

## Kappa : 1

## Mcnemar's Test P-Value : NA

##

## Sensitivity : 1.000

## Specificity : 1.000

## Pos Pred Value : 1.000

## Neg Pred Value : 1.000

## Prevalence : 0.481

## Detection Rate : 0.481

## Detection Prevalence : 0.481

## Balanced Accuracy : 1.000

##

## 'Positive' Class : Down

##

mean(rfPredict == testing$Direction)

## [1] 1

#With twoclasssummary

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum


mary)

# Random forrest

rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
nter","scale"), tuneLength = 20)

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

## Warning: The metric "Accuracy" was not in the result set. ROC will be used

## instead.

rfFit

https://rpubs.com/njvijay/16444 12/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Random Forest
##

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'

##

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)

##

## Summary of sample sizes: 844, 844, 845, 845, 843, 845, ...

##

## Resampling results across tuning parameters:

##

## mtry ROC Sens Spec ROC SD Sens SD Spec SD

## 2 1 1 1 9e-05 0.007 0.005

## 3 1 1 1 8e-05 0.007 0.005

## 4 1 1 1 0 0.007 0.005

## 5 1 1 1 0 0.007 0.005

## 6 1 1 1 8e-05 0.007 0.005

## 7 1 1 1 0 0.007 0.005

## 8 1 1 1 0 0.007 0.005

##

## ROC was used to select the optimal model using the largest value.

## The final value used for the model was mtry = 4.

plot(rfFit)

https://rpubs.com/njvijay/16444 13/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

#Trying plot with some more parameters

plot(rfFit, print.thres = 0.5, type="S")

https://rpubs.com/njvijay/16444 14/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

rfPredict <- predict(rfFit,newdata = testing )

confusionMatrix(rfPredict, testing$Direction )

https://rpubs.com/njvijay/16444 15/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

## Confusion Matrix and Statistics

##

## Reference

## Prediction Down Up

## Down 150 0

## Up 0 162

##

## Accuracy : 1

## 95% CI : (0.988, 1)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : <2e-16

##

## Kappa : 1

## Mcnemar's Test P-Value : NA

##

## Sensitivity : 1.000

## Specificity : 1.000

## Pos Pred Value : 1.000

## Neg Pred Value : 1.000

## Prevalence : 0.481

## Detection Rate : 0.481

## Detection Prevalence : 0.481

## Balanced Accuracy : 1.000

##

## 'Positive' Class : Down

##

mean(rfPredict == testing$Direction)

## [1] 1

Ploting ROC curve

library(pROC)

rfPredict <- predict(rfFit,newdata = testing , type="prob")

rfROC <- roc(testing$Direction,rfPredict[,"Down"], levels = rev(testing$Direction))

rfROC

##

## Call:

## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes


ting$Direction))

##

## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).

## Area under the curve: 0.5

plot(rfROC, type="S", print.thres= 0.5)

https://rpubs.com/njvijay/16444 16/17
10/15/21, 1:30 PM RPubs - kNN using R caret package

##

## Call:

## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes


ting$Direction))

##

## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).

## Area under the curve: 0.5

https://rpubs.com/njvijay/16444 17/17

You might also like