Professional Documents
Culture Documents
Ali Santacruz
R-Spatialist at amsantac.co
Key ideas
· Machine learning classifiers fail to cope with imbalanced training datasets
· Performance of ML classifiers may get biased towards majority class
· Sampling methods are used to treat imbalanced datasets: undersampling, oversampling,
synthetic data generation and cost sensitive learning
· Metrics such as precision, recall, or Fscore are preferred over overall accuracy as performance
measures when dealing with imbalanced datasets
2/12
Remote sensing example
First let's import image to be classified and shapefile with training data
library(rgdal)
library(raster)
library(caret)
set.seed(123)
img <‐ brick(stack(as.list(list.files("data/", "sr_band", full.names = TRUE))))
names(img) <‐ c(paste0("B", 1:5, coll = ""), "B7")
trainData <‐ shapefile("data/training_15.shp")
responseCol <‐ "class"
3/12
Extract data from image bands
dfAll = data.frame(matrix(vector(), nrow = 0, ncol = length(names(img)) + 1))
for (i in 1:length(unique(trainData[[responseCol]]))){
category <‐ unique(trainData[[responseCol]])[i]
categorymap <‐ trainData[trainData[[responseCol]] == category,]
dataSet <‐ extract(img, categorymap)
dataSet <‐ sapply(dataSet, function(x){cbind(x, class = rep(category, nrow(x)))})
df <‐ do.call("rbind", dataSet)
dfAll <‐ rbind(dfAll, df)
}
dim(dfAll)
[1] 80943 7
4/12
Create partition for training and test sets
inBuild <‐ createDataPartition(y = dfAll$class, p = 0.7, list = FALSE)
training <‐ dfAll[inBuild,]
testing <‐ dfAll[‐inBuild,]
dim(training)
[1] 56662 7
dim(testing)
[1] 24281 7
table(training$class)
1 2 3 5 6 7
4753 21626 14866 8093 3535 3789
5/12
Model using imbalanced dataset
training_imb <‐ training[sample(1:nrow(training), 2400), ]
table(training_imb$class)
1 2 3 5 6 7
212 900 613 353 149 173
mod1_imb <‐ train(as.factor(class) ~ B3 + B4 + B5, method = "rf", data = training_imb)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod1_imb$results[, 1:2]
mtry Accuracy
1 2 0.979454
2 3 0.977318
6/12
Balancing a dataset by undersampling
undersample_ds <‐ function(x, classCol, nsamples_class){
for (i in 1:length(unique(x[, classCol]))){
class.i <‐ unique(x[, classCol])[i]
if((sum(x[, classCol] == class.i) ‐ nsamples_class) != 0){
x <‐ x[‐sample(which(x[, classCol] == class.i),
sum(x[, classCol] == class.i) ‐ nsamples_class), ]
}
}
return(x)
}
7/12
Balance training dataset
(nsamples_class <‐ 400)
[1] 400
training_bc <‐ undersample_ds(training, "class", nsamples_class)
table(training_bc$class)
1 2 3 5 6 7
400 400 400 400 400 400
8/12
Model using balanced dataset
mod1_bc <‐ train(as.factor(class) ~ B3 + B4 + B5, method = "rf", data = training_bc)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod1_bc$results[, 1:2]
mtry Accuracy
1 2 0.9797371
2 3 0.9766507
9/12
Evaluate accuracy of the two models using the
testing set
# Imbalanced data
pred1_imb <‐ predict(mod1_imb, testing)
confusionMatrix(pred1_imb, testing$class)$overall[1]
Accuracy
0.9829496
# Balanced data
pred1_bc <‐ predict(mod1_bc, testing)
confusionMatrix(pred1_bc, testing$class)$overall[1]
Accuracy
0.9788312
10/12
Evaluate sensitivity in the two models using the
testing set
# Imbalanced data
confusionMatrix(pred1_ub, testing$class)$byClass[, 1]
Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
0.9951644 0.9794283 0.9806938 0.9945213 1.0000000 0.9809816
# Balanced data
confusionMatrix(pred1_bc, testing$class)$byClass[, 1]
Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
0.9941973 0.9662191 0.9759849 0.9904844 1.0000000 0.9975460
11/12
Further resources
· For a detailed explanation please see:
This post in my blog (includes link for downloading sample data and source code)
And this video on my YouTube channel
· Also check out these useful resources:
Practical guide to deal with imbalanced classification problems in R
8 tactics to combat imbalanced classes in your machine learning dataset
12/12