Professional Documents
Culture Documents
Introduction
Objectives
Pre-requisites
• Participants are expected to be familiar with the R syntax and basic plotting
functionality.
• R 3.5.1 or higher.
can learn from data, identify patterns and make decisions with minimal human
intervention.
Machine learning algorithms are often categorized as supervised or unsu-
pervised. In supervised learning, the learning algorithm is presented with la-
belled example inputs, where the labels indicate the desired output. Supervised
algorithms are composed of classification, where the output is categorical, and
regression, where the output is numerical. In unsupervised learning, no labels
are provided, and the learning algorithm focuses solely on detecting structure
in unlabelled input data.
Note that there are also semi-supervised learning approaches that use
labelled data to inform unsupervised learning on the unlabelled data to identify
and annotate new classes in the dataset (also called novelty detection).
Packages
R has multiple packages for machine learning. These are some of the most
popular:
install.packages("caret")
install.packages("caretEnsemble")
install.packages("tidyverse")
install.packages("corrplot")
install.packages("doSNOW")
install.packages("randomForest")
library(caret)
library(corrplot)
library(tidyverse)
We will use the wine data set from the UCI Machine Learning data repository.
These are results of a chemical analysis of wines grown in the same region in
Italy. The analysis determined the quantities of 13 constituents found in each
of the three types of wines.
The goal is to predict wine quality, of which there are 7 values (integers 3-
9). We will turn this into a binary classification task to predict whether a wine
is ‘good’ or not, which is arbitrarily chosen as 6 or higher. After getting the
hang of things one might redo the analysis as a multiclass problem or even
toy with regression approaches, just note there are very few 3s or 9s so you
really only have 5 values to work with. The original data along with detailed
description can be found here, but aside from quality it contains predictors
such as residual sugar, alcohol content, acidity and other characteristics of the
wine.
The original data is separated into white and red data sets. I have com-
bined them and created additional variables: color and good, indicating scores
greater than or equal to 6 (denoted as ‘Good’ or ‘Bad’).
The following will show some basic numeric information about the data
summary(wine)
## fixed.acidity volatile.acidity
## Min. : 3.800 Min. :0.0800
## 1st Qu.: 6.400 1st Qu.:0.2300
## Median : 7.000 Median :0.2900
## Mean : 7.215 Mean :0.3397
## 3rd Qu.: 7.700 3rd Qu.:0.4000
## Max. :15.900 Max. :1.5800
## citric.acid residual.sugar
## Min. :0.0000 Min. : 0.600
## 1st Qu.:0.2500 1st Qu.: 1.800
## Median :0.3100 Median : 3.000
## Mean :0.3186 Mean : 5.443
## 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide
## Min. :0.00900 Min. : 1.00
## 1st Qu.:0.03800 1st Qu.: 17.00
## Median :0.04700 Median : 29.00
## Mean :0.05603 Mean : 30.53
## 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density
## Min. : 6.0 Min. :0.9871
## 1st Qu.: 77.0 1st Qu.:0.9923
## Median :118.0 Median :0.9949
## Mean :115.7 Mean :0.9947
## 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :440.0 Max. :1.0390
## pH sulphates
## Min. :2.720 Min. :0.2200
## 1st Qu.:3.110 1st Qu.:0.4300
## Median :3.210 Median :0.5100
## Mean :3.219 Mean :0.5313
INTRODUCTION TO MACHINE LEARNING IN R 5
We can visualize the correlations between all variables in the dataset with
the corrplot::corrplot function.
sulphates
citric.acid
chlorides
density
alcohol
quality
white
pH
1
fixed.acidity 1 0.220.32
−0.110.3−0.28
−0.33
0.46
−0.250.3−0.1−0.08
−0.49
volatile.acidity 0.22 1−0.38 −0.20.38
−0.35
−0.41
0.270.260.23
−0.04
−0.27
−0.65 0.8
citric.acid 0.32
−0.381 0.140.040.130.2 0.1−0.33 0.06
−0.010.090.19 0.6
residual.sugar−0.11 −0.20.14 1−0.130.4 0.50.55−0.27
−0.19
−0.36
−0.040.35
0.4
chlorides 0.30.38 0.04
−0.131 −0.2 −0.28
0.360.040.4−0.26−0.2
−0.51
free.sulfur.dioxide−0.28
0.2
−0.350.130.4−0.2 1 0.720.03−0.15
−0.19
−0.180.060.47
total.sulfur.dioxide−0.33−0.410.2 0.5−0.280.72 1 0.03
−0.24
−0.28
−0.27
−0.040.7 0
density 0.46 0.270.10.550.360.030.03 1 0.010.26−0.69
−0.31
−0.39 −0.2
pH−0.25 0.26
−0.33
−0.270.04
−0.15
−0.24
0.01 1 0.190.120.02−0.33
−0.4
sulphates 0.30.23 0.06
−0.190.4−0.19
−0.28
0.260.19 1 0 0.04 −0.49
alcohol −0.1−0.04
−0.01
−0.36
−0.26
−0.18
−0.27
−0.69
0.12 0 1 0.440.03 −0.6
quality−0.08
−0.270.09
−0.04−0.20.06
−0.04
−0.31
0.020.040.44 1 0.12 −0.8
white−0.49
−0.650.190.35
−0.510.470.7−0.39
−0.33
−0.490.030.12 1
−1
INTRODUCTION TO MACHINE LEARNING IN R 6
Data partition
set.seed(1234) #so that the indices will be the same when re-run
trainIndices <- createDataPartition(wine$good,
p = 0.8,
list = F)
wine_train <- wine[trainIndices, -c(6, 8, 12:14)] #remove quality and color, as well as density and othe
wine_test <- wine[!1:nrow(wine) %in% trainIndices, -c(6, 8, 12:14)]
Random forest
Implementation
library(randomForest)
By “tune the forest” we mean the process of determining the optimal number
of variables to consider at each split in a decision-tree. Too many prediction
INTRODUCTION TO MACHINE LEARNING IN R 7
variables and the algorithm will over-fit; too few prediction variables and the
algorithm will under-fit. so first, we use tuneRF function to get the possible
optimal numbers of prediction variables. The tuneRF function takes two
arguments: the prediction variables and the response variable.
This function also returns a plot on how the error varies depending on the
number of prediction varialbles.
0.179
0.177
## -0.03444444 0.05
OOB Error
0.175
tuneRF returns the several numbers of variables randomly sampled as
0.173
2 3 6
candidates at each split (mtry). error.](https://en.wikipedia.org/wiki/ mtry
## [1] 3
##
## Call:
## randomForest(x = wine_train[, -10], y = wine_train$good, mtry = mintree, importance = TRUE)
INTRODUCTION TO MACHINE LEARNING IN R 8
0.30
plot(rf_model, main="")
Error
0.20
0.10
We can have a look at each variable’s influence by plotting their importance 0 100 200 300 400 500
alcohol alcohol
volatile.acidity volatile.acidity
sulphates total.sulfur.dioxide
total.sulfur.dioxide chlorides
residual.sugar residual.sugar
pH sulphates
fixed.acidity citric.acid
chlorides pH
citric.acid fixed.acidity
## Validation
We can check the model fitness against the test dataset
• https://uc-r.github.io/random_forests
• https://towardsdatascience.com/random-forest-in-r-f66adf80ec9
We will predict if a wine is good or not. We have the data on good, so this is a
problem of supervised classification.
Consider the typical distance matrix that is often used for cluster analysis
of observations. If we choose something like Euclidean distance as a metric,
each point in the matrix gives the value of how far an observation is from some
other, given their respective values on a set of variables.
k-NN approaches exploit this information for predictive purposes. Let us
take a classification example, and k = 5 neighbors. For a given observation
xi , find the 5 closest k neighbors in terms of Euclidean distance based on the
predictor variables. The class that is predicted is whatever class the majority
of the neighbors are labeled as. For continuous outcomes we might take the
mean of those neighbors as the prediction.
So how many neighbors would work best? This is an example of a tuning
INTRODUCTION TO MACHINE LEARNING IN R 10
parameter, i.e. k, for which we have no knowledge about its value without doing
some initial digging. As such we will select the tuning parameter as part of the
validation process.
Implementation
The caret package provides several techniques for validation such as k-fold,
bootstrap, leave-one-out and others. We will use 10-fold cross validation. We
will also set up a set of values for k to try out.
train is the function used to fit the models. You can check all available
methods here. This function is used for: * evaluate, using resampling, the
effect of model tuning parameters on performance * choose the “optimal”
model across these parameters * estimate model performance from a training
set
You can control training parameters such as resampling method and it-
erations with the cv_opts function. In this case we will use a k-fold cross-
validation (cv) with 5 resampling iterations.
knn_opts = data.frame(.k=c(seq(3, 11, 2), 25, 51, 101)) #odd to avoid ties
knn_model = train(good~., data=wine_train, method="knn",
preProcess="range", trControl=cv_opts,
tuneGrid = knn_opts)
knn_model
## k-Nearest Neighbors
##
## 5199 samples
## 9 predictor
## 2 classes: ’Bad’, ’Good’
##
## Pre-processing: re-scaling to [0, 1] (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4679, 4678, 4679, 4679, 4680, 4679, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 3 0.7518692 0.4574756
## 5 0.7453311 0.4405051
## 7 0.7512978 0.4518213
## 9 0.7497564 0.4476991
## 11 0.7534099 0.4561257
## 25 0.7509106 0.4469177
INTRODUCTION TO MACHINE LEARNING IN R 11
## 51 0.7468703 0.4326833
## 101 0.7451462 0.4272734
##
## Accuracy was used to select the
## optimal model using the largest value.
## The final value used for the model was k
## = 11.
dotPlot(varImp(knn_model))
alcohol
volatile.acidity
chlorides
fixed.acidity
citric.acid
total.sulfur.dioxide
pH
sulphates
residual.sugar
0 20 40 60 80 100
Importance
##
## Reference
## Prediction Bad Good
## Bad 276 130
## Good 200 692
##
## Accuracy : 0.7458
## 95% CI : (0.7211, 0.7693)
## No Information Rate : 0.6333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4351
##
## Mcnemar’s Test P-Value : 0.0001457
##
## Sensitivity : 0.8418
## Specificity : 0.5798
## Pos Pred Value : 0.7758
## Neg Pred Value : 0.6798
## Prevalence : 0.6333
## Detection Rate : 0.5331
## Detection Prevalence : 0.6872
## Balanced Accuracy : 0.7108
##
## ’Positive’ Class : Good
##
Neural networks
Neural nets have been around for a long while as a general concept in artificial
intelligence and even as a machine learning algorithm, and often work quite
well. In some sense they can be thought of as nonlinear regression. Visually
however, we can see them in as layers of inputs and outputs. Weighted com-
binations of the inputs are created and put through some function (e.g. the
sigmoid function) to produce the next layer of inputs. This next layer goes
through the same process to produce either another layer or to predict the
output, which is the final layer. All the layers between the input and output
are usually referred to as ‘hidden’ layers. If there were no hidden layers then it
becomes the standard regression problem.
One of the issues with neural nets is determining how many hidden layers
INTRODUCTION TO MACHINE LEARNING IN R 13
and how many hidden units in a layer. Overly complex neural nets will suffer
from a variance problem and be less generalizable, particularly if there is less
relevant information in the training data. Along with the complexity is the
notion of weight decay, however this is the same as the regularization function
we discussed in a previous section, where a penalty term would be applied to a
norm of the weights.
Parallel processing
replace the method with nnet and shorten the tuneLength to 3 which will be
faster without much loss of accuracy. Also, the function we are using has only
one hidden layer, but the other neural net methods accessible via the caret
package may allow for more, though the gains in prediction with additional
layers are likely to be modest relative to complexity and computational cost.
In addition, if the underlying function has additional arguments, you may pass
those on in the train function itself.
We will use parallel processing with type SOCK2 . 2
If you are using a Linux/GNU or macOS
you can use the FORK type. In this case, the
environment is linked in all processors.
library(doSNOW)
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(makeCluster(3, type = "SOCK"))
Implementation
##
## Pre-processing: re-scaling to [0, 1] (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4680, 4679, 4678, 4680, 4679, 4679, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.7284013 0.3965111
## 1 1e-04 0.7287841 0.3885902
## 1 1e-03 0.7253277 0.3791323
## 1 1e-02 0.7291661 0.4007401
## 1 1e-01 0.7314834 0.3944406
## 3 0e+00 0.7089882 0.3167041
## 3 1e-04 0.7212815 0.3746202
## 3 1e-03 0.7216783 0.3634109
## 3 1e-02 0.7289812 0.3925674
## 3 1e-01 0.7243551 0.3574783
## 5 0e+00 0.7091598 0.3136761
## 5 1e-04 0.7184094 0.3554943
## 5 1e-03 0.7243629 0.3685528
## 5 1e-02 0.7280134 0.3808471
## 5 1e-01 0.7226288 0.3688245
## 7 0e+00 0.7255075 0.3693651
## 7 1e-04 0.7243606 0.3796744
## 7 1e-03 0.7199446 0.3509379
## 7 1e-02 0.7145766 0.3321129
## 7 1e-01 0.7251306 0.3804193
## 9 0e+00 0.7258972 0.3796424
## 9 1e-04 0.7307127 0.3972374
## 9 1e-03 0.7289775 0.3849989
## 9 1e-02 0.7276339 0.3877238
## 9 1e-01 0.7305152 0.3867421
##
## Tuning parameter ’bag’ was held constant
## at a value of FALSE
## Accuracy was used to select the
## optimal model using the largest value.
## The final values used for the model
## were size = 1, decay = 0.1 and bag = FALSE.
Once you’ve finished working with your cluster, it’s good to clean up and
stop the cluster child processes (quitting R will also stop all of the child pro-
cesses).
INTRODUCTION TO MACHINE LEARNING IN R 15
stopCluster(cl)
Validation
• https://www.datacamp.com/community/tutorials/neural-network-models-r
• https://www.analyticsvidhya.com/blog/2017/09/creating-visualizing-neural-network-in-r/
• https://datascienceplus.com/neuralnet-train-and-test-neural-networks-using-r/
INTRODUCTION TO MACHINE LEARNING IN R 16
Ensemble method
You can combine the predictions of multiple caret models using the caretEnsemble
package.
library(caretEnsemble)
##
## Call:
INTRODUCTION TO MACHINE LEARNING IN R 17
## summary.resamples(object = results)
##
## Models: lda, glm, knn, rf
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean
## lda 0.7013487 0.7325991 0.7439844 0.7428332
## glm 0.7038462 0.7258766 0.7461538 0.7418079
## knn 0.6653846 0.6826923 0.6971154 0.6958994
## rf 0.7861272 0.8177885 0.8291747 0.8275347
## 3rd Qu. Max. NA’s
## lda 0.7548077 0.7750000 0
## glm 0.7525289 0.7677543 0
## knn 0.7081131 0.7480769 0
## rf 0.8360577 0.8557692 0
##
## Kappa
## Min. 1st Qu. Median Mean
## lda 0.3258011 0.4081795 0.4271401 0.4250502
## glm 0.3289198 0.3908777 0.4299979 0.4238487
## knn 0.2476846 0.2963154 0.3305481 0.3244994
## rf 0.5292575 0.5996030 0.6225940 0.6204321
## 3rd Qu. Max. NA’s
## lda 0.4531663 0.5011152 0
## glm 0.4489842 0.4822137 0
## knn 0.3492237 0.4388797 0
## rf 0.6432326 0.6858892 0
dotplot(results)
INTRODUCTION TO MACHINE LEARNING IN R 18
Accuracy Kappa
rf
lda
glm
knn
Accuracy Kappa
Confidence Level: 0.95
We can see that the Random Forests creates the most accurate model with an
accuracy of 82.75%.
When we combine the predictions of different models using stacking, it is
desirable that the predictions made by the sub-models have low correlation.
This would suggest that the models are skillful but in different ways, allow-
ing a new classifier to figure out how to get the best from each model for an
improved score.
If the predictions for the sub-models were highly corrected (>0.75) then
they would be making the same or very similar predictions most of the time
reducing the benefit of combining the predictions.
splom(results)
INTRODUCTION TO MACHINE LEARNING IN R 19
Accuracy
0.85
0.75
0.65 rf
0.75
0.75
0.65
0.85
0.75
knn
0.65
0.75
0.75
0.65
0.85
0.75
glm
0.65
0.75
0.75
0.65
0.85
0.75
0.65lda
0.75
0.75
0.65
Scatter Plot Matrix
We can see the LDA and GLM have high correlation and all other pairs of pre-
dictions have generally low correlation. Let’s eliminate the glm method be-
cause it has the lowest accuracy.
Let’s combine the predictions of the classifiers using a simple linear model.
caretStack finds a a good linear combination of chosen classification mod-
els. It can use linear regression, elastic net regression, or greedy optimization.
savePredictions=TRUE,
classProbs=TRUE)
set.seed(1234)
stack.glm <- caretStack(models,
method="glm",
metric="Accuracy",
trControl=stackControl)
print(stack.glm)
We can see that we have lifted the accuracy to 75.34% which is a small
improvement over using SVM alone. This is also an improvement over using
random forest alone on the dataset, as observed above.
We can also use more sophisticated algorithms to combine predictions in
an effort to tease out when best to use the different methods. In this case, we
can use the random forest algorithm to combine the predictions. This method
is slower than using glm.
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
INTRODUCTION TO MACHINE LEARNING IN R 21
print(stack.rf)
We can see that this has lifted the accuracy to 96.26% an impressive im-
provement on SVM alone.
k-means clustering
To process the learning data, the K-means algorithm in data mining starts
with a first group of randomly selected centroids, which are used as the begin-
ning points for every cluster, and then performs iterative (repetitive) calcula-
tions to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
Implementation
In R, we use
where
• the k-means algorithm has a random component and can be repeated nstart
times to improve the returned model
To learn about k-means, let’s use the iris dataset with the sepal and petal
length variables only (to facilitate visualisation). Create such a data matrix and
name it iris_subset.
## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1
## [81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [101] 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1
## [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 3 3 3 3 3 1 3
## [141] 3 3 1 3 3 3 1 3 3 1
##
## Within cluster sum of squares by cluster:
## [1] 23.508448 9.893725 20.407805
## (between_SS / total_SS = 90.5 %)
##
## Available components:
##
## [1] "cluster" "centers"
## [3] "totss" "withinss"
## [5] "tot.withinss" "betweenss"
## [7] "size" "iter"
## [9] "ifault"
7
6
Petal.Length
5
plot(iris_subset, col = cl$cluster)
4
3
2
1
Due to the random initialization, one can obtain different clustering results. 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Sepal.Length
When k-means is run multiple times, the best outcome, i.e. the one that gen-
Figure 2: k-means algorithm on sepal and
erates the smallest total within cluster sum of squares (SS), is selected. The petal lengths
total within SS is calculated as:
For each cluster results:
• for each observation, determine the squared euclidean distance from obser-
vation to centre of cluster
• sum all distances
We can check how the squared distances changes with different values of k
(centers).
ks <- 1:5
tot_within_ss <- sapply(ks, function(k) {
cl <- kmeans(iris_subset, k, nstart = 10)
cl$tot.withinss
})
plot(ks, tot_within_ss, type = "b")
INTRODUCTION TO MACHINE LEARNING IN R 24
500
tot_within_ss
300
100
1 2 3 4 5
ks
• https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/
• https://towardsdatascience.com/clustering-analysis-in-r-using-k-means-73eca4fb7967
Literature
• http://topepo.github.io/caret/index.html
• https://lgatto.github.io/IntroMachineLearningWithR/