Grid Search For KNN

10/15/21, 1:30 PM RPubs - kNN using R caret package
kNN Using caret R package

Vijayakumar Jawaharlal
April 29, 2014
Recently I’ve got familiar with caret package. Caret is a great R package which provides general interface to nearly
150 ML algorithms. It also provides great functions to sample the data (for training and testing), preprocessing,
evaluating the model etc.,
To get familiar with caret package, please check following URLs
https://www.youtube.com/watch?v=7Jbb2ItbTC4http://http://caret.r-forge.r-project.org http://cran.r-
project.org/web/packages/caret/vignettes/caret.pdf
I am going to use same dataset from previous examples. Intention of this excercise is to get familiar with caret
package
Sampling
library(ISLR)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(300)
#Spliting data as training and test set. Using createDataPartition() function from caret
indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)
training <- Smarket[indxTrain,]

testing <- Smarket[-indxTrain,]

#Checking distibution in origanl data and partitioned data
prop.table(table(training$Direction)) * 100
##
## Down Up
## 48.19 51.81
prop.table(table(testing$Direction)) * 100
##
## Down Up
## 48.08 51.92
prop.table(table(Smarket$Direction)) * 100
https://rpubs.com/njvijay/16444 1/17
##
## Down Up
## 48.16 51.84
creteDataParition function creates sample very effortlessly. We don’t need to write complex function like previous
example
Preprocessing
kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose
centring and scaling
trainX <- training[,names(training) != "Direction"]
preProcValues <- preProcess(x = trainX,method = c("center", "scale"))
preProcValues
##
## Call:
## preProcess.default(x = trainX, method = c("center", "scale"))
##
## Created from 938 samples and 8 variables
## Pre-processing: centered, scaled
Training and train control

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass

Summary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 938 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...
##
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Accuracy SD Kappa SD
## 5 0.9 0.7 0.04 0.07
## 7 0.9 0.8 0.04 0.08
## 9 0.9 0.8 0.04 0.08
## 10 0.9 0.8 0.03 0.07
## 10 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.06
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.04 0.08
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.06
## 30 0.9 0.8 0.03 0.06
## 30 0.9 0.8 0.03 0.07
## 30 0.9 0.8 0.03 0.07
## 30 0.9 0.8 0.03 0.07
## 40 0.9 0.8 0.03 0.07
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.05
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)
knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 123 8
## Up 27 154
##
## Accuracy : 0.888
## 95% CI : (0.847, 0.921)
## No Information Rate : 0.519
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.774
## Mcnemar's Test P-Value : 0.00235
##
## Sensitivity : 0.820
## Specificity : 0.951
## Pos Pred Value : 0.939
## Neg Pred Value : 0.851
## Prevalence : 0.481
## Detection Rate : 0.394
## Detection Prevalence : 0.420
## Balanced Accuracy : 0.885
##
## 'Positive' Class : Down
##
mean(knnPredict == testing$Direction)
## [1] 0.8878
#Now verifying 2 class summary function
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum

mary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 938 samples
## 8 predictors
##
##
##
##
## k ROC Sens Spec ROC SD Sens SD Spec SD
## 5 0.9 0.8 0.9 0.03 0.07 0.05
## 7 1 0.8 0.9 0.02 0.06 0.05
## 9 1 0.8 0.9 0.02 0.07 0.05
## 10 1 0.8 0.9 0.02 0.07 0.05
## 10 1 0.8 0.9 0.02 0.07 0.05
## 20 1 0.9 0.9 0.02 0.06 0.04
## 20 1 0.9 0.9 0.02 0.06 0.04
## 20 1 0.9 0.9 0.02 0.07 0.03
## 20 1 0.9 0.9 0.01 0.06 0.03
## 20 1 0.9 0.9 0.01 0.06 0.03
## 20 1 0.9 0.9 0.01 0.07 0.03
## 30 1 0.9 0.9 0.01 0.06 0.03
## 30 1 0.8 0.9 0.01 0.07 0.03
## 30 1 0.8 0.9 0.01 0.06 0.03
## 30 1 0.8 0.9 0.01 0.07 0.03
## 40 1 0.8 0.9 0.01 0.06 0.03
## 40 1 0.8 0.9 0.01 0.06 0.03
## 40 1 0.8 0.9 0.01 0.06 0.02
## 40 1 0.8 0.9 0.01 0.06 0.02
## 40 1 0.8 0.9 0.01 0.06 0.03
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 43.
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit, print.thres = 0.5, type="S")
knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )
##
## Reference
## Down 123 9
## Up 27 153
##
## Accuracy : 0.885
## 95% CI : (0.844, 0.918)
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.768
## Mcnemar's Test P-Value : 0.00461
##
##
##
mean(knnPredict == testing$Direction)
## [1] 0.8846
Trying to plot ROC curve to check specificity and sensitivity
library(pROC)
knnPredict <- predict(knnFit,newdata = testing , type="prob")
knnROC <- roc(testing$Direction,knnPredict[,"Down"], levels = rev(testing$Direction))
knnROC
##
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te

sting$Direction))
##
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).
## Area under the curve: 0.5
plot(knnROC, type="S", print.thres= 0.5)
##
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te

sting$Direction))
##
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).
Applying Random Forest to see the performance

improvement
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass

Summary)
# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
nter","scale"), tuneLength = 20)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
rfFit
## Random Forest
##
## 938 samples
## 8 predictors
##
##
##
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.004 0.009
## 3 1 1 0.004 0.009
## 4 1 1 0.004 0.009
## 5 1 1 0.004 0.009
## 6 1 1 0.004 0.009
## 7 1 1 0.004 0.009
## 8 1 1 0.004 0.009
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(rfFit)
rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )
##
## Reference
## Down 150 0
## Up 0 162
##
## Accuracy : 1
## 95% CI : (0.988, 1)
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
##
##
mean(rfPredict == testing$Direction)
## [1] 1
#With twoclasssummary
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum

mary)
# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
nter","scale"), tuneLength = 20)
## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.
rfFit
## Random Forest
##
## 938 samples
## 8 predictors
##
##
##
##
## mtry ROC Sens Spec ROC SD Sens SD Spec SD
## 2 1 1 1 9e-05 0.007 0.005
## 3 1 1 1 8e-05 0.007 0.005
## 4 1 1 1 0 0.007 0.005
## 5 1 1 1 0 0.007 0.005
## 6 1 1 1 8e-05 0.007 0.005
## 7 1 1 1 0 0.007 0.005
## 8 1 1 1 0 0.007 0.005
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
plot(rfFit)
#Trying plot with some more parameters
plot(rfFit, print.thres = 0.5, type="S")
rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )
##
## Reference
## Down 150 0
## Up 0 162
##
## Accuracy : 1
## 95% CI : (0.988, 1)
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
##
##
mean(rfPredict == testing$Direction)
## [1] 1
Ploting ROC curve
library(pROC)
rfPredict <- predict(rfFit,newdata = testing , type="prob")
rfROC <- roc(testing$Direction,rfPredict[,"Down"], levels = rev(testing$Direction))
rfROC
##
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes

ting$Direction))
##
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).
plot(rfROC, type="S", print.thres= 0.5)
##
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes

ting$Direction))
##
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction
2).

Grid Search For KNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grid Search For KNN

Uploaded by

Copyright:

Available Formats

10/15/21, 1:30 PM RPubs - kNN using R caret package

kNN Using caret R package

## Loading required package: lattice

## Loading required package: ggplot2

indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)

training <- Smarket[indxTrain,]

#Checking distibution in origanl data and partitioned data

trainX <- training[,names(training) != "Direction"]

preProcValues <- preProcess(x = trainX,method = c("center", "scale"))

## preProcess.default(x = trainX, method = c("center", "scale"))

## Created from 938 samples and 8 variables

## Pre-processing: centered, scaled

Training and train control

ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass

#Output of kNN fit

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)

## Resampling results across tuning parameters:

## k Accuracy Kappa Accuracy SD Kappa SD

## 5 0.9 0.7 0.04 0.07

## 7 0.9 0.8 0.04 0.08

## 9 0.9 0.8 0.04 0.08

## 10 0.9 0.8 0.03 0.07

## 10 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.04 0.08

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.05

## The final value used for the model was k = 23.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)

knnPredict <- predict(knnFit,newdata = testing )

## Confusion Matrix and Statistics

## 95% CI : (0.847, 0.921)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : < 2e-16

## Mcnemar's Test P-Value : 0.00235

## Pos Pred Value : 0.939

## Neg Pred Value : 0.851

## Detection Rate : 0.394

## Detection Prevalence : 0.420

## Balanced Accuracy : 0.885

## 'Positive' Class : Down

#Now verifying 2 class summary function

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum

## Loading required package: pROC

## Type 'citation("pROC")' for a citation.

## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':

## cov, smooth, var

#Output of kNN fit

## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)