You are on page 1of 54

Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Preface
Why You are Struggling to Improve Your Land Cover Classification
(And How to Fix It)
Many years ago, I struggled to improve my land cover classification results. It was a
time when I was obsessed with trying the latest shiny or advanced machine learning
models in order to improve land cover classification. I invested a lot of time and effort in
learning how to run new and advanced machine learning models in python or R. Yet,
time and time again, the results were disappointing despite reading about successful
use cases in major remote sensing and GIS journals.

So what was I doing wrong? Why was I struggling? Eventually, I realized that my
mindset was fixated on applying the shiny and advanced machine learning models,
which were in fashion those days. So I took a step back. I then started performing land
cover classification using a simple machine learning model such as k nearest neighbor
(KNN). And guess what? The land cover classification results were almost similar or
even better than the advanced machine learning models. From that moment, I realized
that there was more to just tuning or trying to optimize advanced machine learning
models. Much as it sounds like common sense, believe me some researchers or
students focus on only on the new machine or deep learning models.

Looking for Something More Unique?


I had to go back to the basics. I simply started by understanding the core land cover
classification problem at hand and the geography of the study area. Following that, I
looked at land cover class definitions at the appropriate scale of analysis as well as
selecting appropriate reference data, satellite imagery and ancillary data. I began to
focus on compiling reliable training sample data. Finally, I could focus on building both
simple and advanced machine learning models, model parameter tuning, performing
cross-validation, and evaluating the models. I also began interested in land cover
classification uncertainty and errors. That is, understanding underlying mechanisms,
biases, and errors in a machine learning model. This also led me to explainable
machine learning.

Researchers have recently developed methods to address the complexity and


explainability of machine learning models. But what is explainable machine learning?
How does it help to improve land cover mapping results? Explainable machine learning
is difficult to define. In this guide, explainable machine learning refers to the extent to
which the underlying mechanism of a machine learning model can be explained. That
is, explainable machine learning models allow us (humans) to explain what the model
learned and how it made predictions (post-hoc). For example, explainable machine
learning can give us insight into the algorithms used for land cover classification. This
insight can enable us to understand how the algorithm works and how it assigns classes
to pixels.

Page 2
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

What Comes in My Guide?


Land cover classification remains challenging. The good news is that more very high
resolution images available as reference data. Even geotagged photographs are
becoming available online for free download. In addition, efforts to improve
transparency and accountability in machine learning model is becoming an important
research topic.

This guide is not a cookbook. The guide is for those who are interested in improving
land cover classification using explainable machine learning models. The goal of the
guide is to introduce explainable machine learning in the context of land cover
classification. My hope is that you will go through all modules and try to understand
what it takes to improve land cover classification models.

Conventions Used in this Guide


The R commands or scripts are written in Arial font size 12, while the R output has two
hash signs (##). Note that long output from R code is omitted from the workbook to save
space. I use small font sizes to show how the R output or results would appear in some
cases. I use this for illustration purposes. Readers will see the whole R output when
they execute the commands. The hash sign (#) at the start of a line of code indicates
that it is a comment.

Data, Additional R Scripts, and Exercises


Data (Sentinel-2 imagery and training samples), and additional scripts and exercises
available at Ai.Geolabs.

Page 3
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Table of Contents
Chapter 1. Introduction 5

1.1 Importance of explainable machine learning for land cover classification 5

1.2 What is explainable machine learning? 6

1.3 What will you learn? 6

1.4 Prerequisites 6

1.5 Test site and data 7

Chapter 2. Machine Learning Modules 8

2.1 Module 1: Load libraries and set up a work environment 8

2.2 Module 2: Prepare training data 10

2.3 Module 3: Perform exploratory data analysis (EDA) 13

2.4 Module 4: Define machine learning model tuning parameters 16

2.5 Module 5: Train and evaluate machine learning models 17

2.6 Module 6: Compute accumulated local effects (ALE) 37

Chapter 3. Concluding Remarks 51

References 53

Appendix 54

Page 4
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Chapter 1. Introduction
1.1 Importance of explainable machine learning for land cover
classification
During the past decades, there have been significant advancements in land cover
classification due to the availability of Earth Observation (EO) data and the rapid
development of machine learning algorithms. A quick scan on science direct or Google
scholar reveals many journal articles reporting successful applications of machine
learning models for land cover classification. To date, many online and onsite courses,
textbooks, and tutorials on machine learning models and algorithms are available. In
addition, machine learning researchers have developed many machine learning libraries
or packages. Many researchers use libraries and packages in free and open-source
software (FOSS) and commercial off-the-shelf (COTS) software applications. In some
cases, software applications have made it easy to use machine learning models without
understanding machine learning algorithms, data science, or environmental remote
sensing. To date, one can quickly run a machine learning model and produce a land
cover map without a basic understanding of machine learning and remote sensing
principles.

There is no doubt that advanced machine learning algorithms and models have
improved land cover classification. However, most of the advanced machine learning
models and algorithms are very complex. Generally, the machine learning models and
algorithms do not clearly explain how and why they make predictions. In addition, it is
challenging to trust only evaluation results from machine learning models because
evaluation measures (overall accuracy, receiver operating characteristic curve, RMSE,
etc.) are more focused on data. In practice, model performance evaluation measures
compare model output values and input data values. However, a machine learning
model can achieve high accuracy by simply memorizing features or patterns in the data.
Therefore, it is questionable to rely only on model performance evaluation and
prediction measures.

Furthermore, model evaluation measures do not provide information on how predictor


variables affect the response variable or the nature of the effects (global or local).
Hence there is a need to gain insights or understand machine learning model effects,
leading to improved model performance. The insights will also make machine learning
models more trustworthy and reliable. In terms of machine learning for geospatial
analysis (geospatial machine learning), it is also essential to acknowledge that real-
world geospatial data are imperfect due to uncertainties and errors. For example,
uncertainties in EO data due to sensor errors and lack of reliable or insufficient
reference data (Lu and Weng 2007) are typical for land cover classification. Therefore,
explainable machine learning will also provide more insights into training model
uncertainty and errors.

Page 5
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

1.2 What is explainable machine learning?


Researchers have recently developed methods to address the complexity and
explainability of machine learning models (Roscher et al. 2019; Apley and Zhu 2020). In
addition, there have been calls to incorporate transparency and accountability in
machine learning models. As a result, many researchers are working hard to develop
explainable and interpretable machine learning models (Molnar 2019, Biecek and
Burzykowski 2020). Explainable machine learning is difficult to define. In this guide,
explainable machine learning refers to the extent to which the underlying mechanism of
a machine learning model can be explained (Biecek and Burzykowski 2020). That is,
explainable machine learning models allow us (humans) to explain what the model
learned and how it made predictions (post-hoc). Note, this is different from interpretable
machine learning (e.g., linear and logistic regression models), which refers to the extent
to which a cause and effect is observed within a model (Molnar 2019). That is, we can
understand underlying mechanisms, biases, and errors in an interpretable machine
learning model. However, domain knowledge is required to select essential variables for
an interpretable machine learning model.

In contrast, typical machine learning models focus on model prediction. That is,
standard machine learning models are good at making predictions without a solid
explanation. Hence, the need for explainable machine learning (Roscher et al. 2019).

1.3 What will you learn?


The goal of the guide is to introduce explainable machine learning in the context of land
cover classification. This guide comprises six modules. Module 1 focuses on setting up
the work environment, while module 2 focuses on preparing training data. Modules 3
and 4 focus on performing exploratory data analysis (EDA) and defining machine
learning model tuning parameters. Module 5 deals with training and evaluating the K-
nearest neighbors (KNN), random forest (RF), and support vector machines (SVMs)
models for land cover classification. We will use the Classification And REgression
Training (caret) package (Kuhn 2008) for this module. Finally, module 6 introduces
explainable machine learning models using the iml (interpretable machine learning)
package (Molnar 2020). The iml package provides tools to analyze and explain machine
learning models and predictions (post-hoc). This guide focuses on the accumulated
local effects (ALE), one of the most critical explainable machine learning methods
(Apley and Zhu 2020).

By the time you finish this guide, you will learn how to:
 import satellite imagery and training sample points;
 prepare training data;
 perform exploratory data analysis;
 tune and train machine learning models; and
 compute accumulated local effects (ALE).

1.4 Prerequisites
This guide is for people familiar with remote sensing classification, machine learning,
and R. However, beginners can also learn about remote sensing classification, machine

Page 6
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

learning models, and R. I provide a link to further materials in the resources (see
appendix 1).

1.5 Test site and data


We are going to use Gweru -a small city in Zimbabwe - as a test site. The rainy season
is from November to March, while the hottest month is October, and the coldest is July
(Kamusoko et al. 2021). The average temperature range from 21 °C in July to 30 °C in
October, while the annual rainfall is about 684 mm. In this guide, we will use only
median post-rainy season Sentinel-2 imagery for land cover classification. The median
post-rainy Sentinel-2 imagery (April - June 2020) was compiled in Google Earth Engine
(GEE). Sentinel-2 imagery comprises 13 spectral bands with a spatial resolution range
between 10 m and 20 m (Table 1). Reference data for the test site were digitized from
very high-resolution imagery available from Google Earth™ and ESRI Satellite. We will
use six land cover classes in this guide: (1) built-up; (2) bare areas; (3) cropland; (4)
grass/open areas; (5) woodlands, and (6) water (see appendix 2 for the land cover class
definitions).

Table 1. Spectral bands for the Sentinel-2 sensors


Sentinel-2A Sentinel-2B
Sentinel-2 bands
Central wavelength
Central wavelength (nm) Spatial resolution (m)
(nm)

Band 1 - Coastal aerosol 442.7 442.2 60

Band 2 - Blue 492.4 492.1 10

Band 3 -Green 559.8 559.0 10

Band 4 - Red 664.6 664.9 10

Band 5 - VRE 1 704.1 703.8 20

Band 6 - VRE 2 740.5 739.1 20

Band 7 - VRE 3 782.8 779.7 20

Band 8 - NIR 832.8 832.9 10

Band 8A - Narrow NIR 864.7 864.0 20

Band 9 -Water vapour 945.1 943.2 60

Band 10 -SWIR (Cirrus) 1,373.5 1,376.9 60

Band 11 -SWIR 1 1,613.7 1,610.4 20

Band 12 - SWIR 2 2,202.4 2,185.7 20

Note: VRE - vegetation red edge; NIR - Near-infrared; and SWIR - short-wave infrared bands

Page 7
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Chapter 2. Machine Learning Modules


2.1 Module 1: Load libraries and set up our environment
Before getting started, install and load the necessary packages that will be used in this
guide.
# Install the necessary packages
# install.packages("caret")
# install.packages("randomForest")
# Load the libraries
rm(list=ls())
library(rgdal) # interface to the “Geospatial Data Abstraction Library (GDAL)
library(raster) # provides important functions for importing and handling raster data
library(mapview) # provides functions to visualize geospatial data
library(sf) # represents simple features as native R objects
library(randomForest) # implements Breiman’s random forest (RF) algorithm
library("kernlab") # implements support vector machine (SVM) algorithm
library(dplyr) # provides basic data transformation functions
library(ggplot2) # provides extension for visualizations
library(corrplot) # provides graphical display of a correlation matrix
library(iml) # provides tools for analyzing any black box machine learning model
library(caret) # provides procedures for training and evaluating machine learning models
library(rasterVis) # provides functions to display raster objects
## Warning: package 'lattice' was built under R version 4.1.0

Next, set up your work environment. Note, this is the directory or folder that contains
your Sentinel-2 imagery and training data.

# set up your working directory


setwd("/home/ ~/Urban/Gweru/Satellite_Imagery/Sentinel/Sentinel2/2020")

Create a raster object “rvars” (that is, raster variables), which will contain the post-rainy
season Sentinel-2 imagery. Next, load the imagery into the “rvars” object using the
stack() function.

# Create a raster object “rvars” and import Sentinel-2 imagery


rvars <- stack("Gw_S2_Apr_Jun2020.tif")

Check the attributes of the post-rainy season Sentinel-2 imagery (compiled between
April and June 2020).

# Check the raster variable object

Page 8
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

print(rvars)
## class : RasterStack
## dimensions : 2489, 2657, 6613273, 9 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 783740, 810310, 7835490, 7860380 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs
## names : B2, B3, B4, B5, B6, B7, B8, B11, B12

For this guide, we selected only nine spectral bands. Note that the vegetation red edge
bands (B5, B6, and B7) and the short-wave infrared bands (B11 and B12) were
resampled to 10 m to match the spatial resolution of the visible and near-infrared (NIR)
bands.

Next, create an object "ta_data" (that is, training data), which will contain training
sample data points. We will use the readOGR() function to load the training sample
points into the “ta_data” object.

# Create an object “ta_data” and import the training data


ta_data <- readOGR(getwd(), "Gweru_TA_2020_Points_Dec")

## OGR data source with driver: ESRI Shapefile


## Source: "/home/kamusoko/Documents/Projects/DL_ML_Apps/GEE_Classifications/
Urban/Gweru/Satellite_Imagery/Sentinel/Sentinel2/2020", layer:
"Gweru_TA_2020_Points_Dec"
## with 1911 features
## It has 1 fields

The training data has one field (column) with 1,911 land cover features.

You can also check the training data using the print() function.

# Check the training data object


print(ta_data)
## class : SpatialPointsDataFrame
## features : 1911
## extent : 783921.7, 810123.5, 7835656, 7860248 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs
## variables : 1
## names : Class
## min values : Bare areas
## max values : Woodlands

Page 9
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Next, use the viewRGB() function from the mapview package to display Sentinel-2
imagery and the training data (Figure 1). Click on the training points to see the
corresponding land cover classes.

# Display Sentinel-2 imagery and training data


viewRGB(rvars, r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery") +
mapview(ta_data)

Figure 1. Training data points overlaid on Sentinel-2 imagery (displayed as true color
composite; r = 3, g = 2, b = 1)

2.2 Module 2: Prepare training data


To train machine learning models, we need to create a training sample data frame that
contains spectral reflectance and the corresponding land cover classes.

First, we are going to use the extract() function to extract spectral reflectance values
from Sentinel-2 bands (rvars) and training sample points (ta_data). A data frame called
"ta" (training area) that contains spectral reflectance values from the Sentinel-2 bands
will be created. This computation takes a bit of time, depending on your computer
memory. So be patient!!!.

# Extract spectral reflectance values from Sentinel-2 bands and training sample points
ta <- as.data.frame(extract(rvars, ta_data))

Next, we are going to plot the mean spectral reflectance from Sentinel-2 bands.

# Procedure to create a spectral profile plot from the Sentinel-2 imagery


# Create a data frame to store the mean spectral reflectance (msr)
msr <- aggregate(ta, list(ta_data$Class), mean, na.rm = TRUE)

Page 10
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

# Remove the first column in order to use the actual land cover names
rownames(msr) <- msr[,1]
msr <- msr[,-1]
# Create a matrix
msr <- as.matrix(msr)

# Specify column names (bands)


colnames(msr) <- names(rvars)

# Create a land cover color palette


mycolor <- c("grey", " red", "yellow", "green", "blue", "darkgreen")

# Create an empty plot


plot(1, ylim=c(0.01, 0.35), xlim = c(1,9), xlab = "Bands", ylab = "Reflectance", xaxt='n')

# Custom X-axis
axis(1, at=1:9, lab=colnames(msr))
# add spectral reflectance
for (i in 1:nrow(msr)){
lines(msr[i,], type = "o", lwd = 3, lty = 1, col = mycolor[i])
}
# Add the title
title(main="Spectral Profile from Sentinel-2 bands", font.main = 2)
# Add the legend
legend("topleft", rownames(msr), cex=0.8, col=mycolor, lty = 1, lwd =3, bty = "n")

Figure 2. Spectral profile generated from post-rainy season Sentinel-2 bands

Page 11
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 2 shows a spectral profile that represents mean spectral reflectance. For built-up
and bare areas classes, there is a close spectral similarity in the blue (B2), green (B3),
red (B4), vegetation red edge 1 (B5), and vegetation red edge 2 (B6) bands, and narrow
separability in the other bands. However, spectral separability is relatively wide between
built-up and bare areas classes on the one hand and cropland class on the other hand,
especially for bands 2 to 7. There is also a close spectral similarity between cropland
and grass open areas because these classes are spectrally similar during the post-rainy
season (Kamusoko et al. 2021). In the test site, most crops are harvested during the
post-rainy season. As a result, cropland areas are left with a crop residue or are
covered by grass, which has the same spectral reflectance as grass/open areas. We
can also observe a relatively broader spectral separability between grass/ open areas
and woodlands classes. However, there is a close spectral similarity in the blue, green,
and red, and wide separability in the other bands for water and woodlands classes.

Note that the spectral plot represents only the mean spectral reflectance for this specific
training data in the test site and during the post-rainy season. Therefore, mean spectral
reflectance should be interpreted with caution since it depends on the particular
imagery, local conditions, the quantity and quality of training data. However, spectral
information is vital because it helps us visually assess the separability of land cover
classes. Furthermore, mean spectral reflectance helps select optimal bands and
optimize machine learning models.

Next, create a data frame with labeled training points containing all spectral reflectance
values from the training points.

# Create a data frame with training points containing all spectral reflectance values
ta_data@data = data.frame(ta_data@data,ta[match(rownames(ta_data@data),
rownames(ta)),])

Next, take a look at the whole training data set structure using the str() function.

# Check the structure of the training data


str(ta_data@data)
## 'data.frame': 1911 obs. of 10 variables:
## $ Class: Factor w/ 6 levels "Bare areas","Built-up",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ B2 : num 0.188 0.125 0.242 0.261 0.148 ...
## $ B3 : num 0.224 0.143 0.266 0.293 0.177 ...
## $ B4 : num 0.232 0.187 0.326 0.296 0.212 ...

The data frame consists of 1,911 training data points. Ten variables comprise a
response variable (Class) with six target land cover classes and nine predictor
variables. The six target land cover classes are bare areas, built-up, grass/ open areas,

Page 12
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

cropland, woodlands, and water. In contrast, the predictor variables consist of nine
Sentinel-2 bands (see Table 1).

Next, use the complete.cases() function to check if there missing values or NAs in the
training data. Alternatively, you can use anyNA() function to check whether the training
data contains missing values or not. It is important to treat or remove missing values
before model training because some machine learning models cannot handle them.

# Check for missing values or NAs


ta_data_complete <- complete.cases(ta_data@data)
head(ta_data_complete)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

The training data does not contain missing values or NAs. However, you can use the
na.omit() function if the training data has missing values or NAs.

Machine learning experts recommend splitting the training data into training and testing
sets. In this guide, the createDataPartition() function from the caret package splits
training data into training and testing sets. We will use 60% of the training data as a
training set and 40% as a testing set. The training set will be used to find the best
model parameters and check initial training model performance based on repeated
cross-validation. The testing set will be used for checking model accuracy. First, we
need to set a pre-defined value using set.seed() function so that results will be
repeatable. Setting the seed to the same number will ensure that we get the same
result. Choose any number you like to set the seed (Kamusoko 2019).

# Set seed and split training data


set.seed(27)
inTraining <- createDataPartition(ta_data@data$Class, p = .60, list = FALSE)
training <- ta_data@data[ inTraining,]
testing <- ta_data@data[-inTraining,]

2.3 Module 3: Perform exploratory data analysis (EDA)


Checking the summary statistics of training data and graphical visualization before
training machine learning models is very important. Unfortunately, many remote sensing
analysts ignore EDA because they assume that remote sensing imagery and reference
data (training and testing) are accurate. However, this is not true as errors are
introduced during the remote sensing data acquisition or compilation of reference data.
Therefore, it is crucial to check summary statistics as well as visualize training data.
EDA helps to identify distributions, unusual anomalies, outliers, and predictor
correlations.

We start by checking the descriptive statistics of the training data set using the
summary() function.

Page 13
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

# Check summary statistics


summary(training)
## Class B2 B3 B4
## Bare areas :131 Min. :0.02530 Min. :0.0475 Min. :0.02960
## Built-up :446 1st Qu.:0.05040 1st Qu.:0.0769 1st Qu.:0.08205
## Cropland :156 Median :0.07320 Median :0.1037 Median :0.12010
## Grass/ open:332 Mean :0.08233 Mean :0.1133 Mean :0.12889
## Water : 25 3rd Qu.:0.10025 3rd Qu.:0.1367 3rd Qu.:0.16370
## Woodlands : 59 Max. :0.38050 Max. :0.4098 Max. :0.46060

The output above shows the number of land cover training points (Class) in the first
column. Columns two to nine provides descriptive statistics for each Sentinel-2 band.
However, it is difficult to understand nature of the distribution or to detect outliers and
collinearity.

Next, we will examine training data using graphical visualizations. Graphs help select
appropriate methods for pre-processing, transformation and classification. We are going
to use the featurePlot(), the ggplot2(), corrplot(), and corrplot.mixed functions.

Next, create density estimation plots for each attribute by land cover class value.

# Create density estimation plots (density plots)


featurePlot(x = training[, 2:10],
y = training$Class,
plot = "density",
labels=c("Reflectance", "Density distribution"),
scales = list(x = list(relation="free"),
y = list(relation="free")),
layout = c(3, 4),
auto.key = list(columns = 3))

Figure 3. Density plots for the training data

Page 14
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 3 summarizes the distribution of the data for each land cover class and Sentinel-
2 bands. Generally, the distributions are skewed and multimodal, indicating that the
distribution in the population is not normal.

Next, display the box and whisker plots for each attribute by class value.

# Create box plots


featurePlot(x=training[,2:10],y=training$Class,
plot="box",
scales=list(y=list(relation="free"),
x=list(rot=90)),
layout=c(3,4))

Figure 4 shows the presence of outliers, especially for the bare areas, built-up, and
grass/ open areas classes. Outliers can cause problems for machine learning models.
While researchers recommend removing the outliers from the training data, this could
throw away valuable training data. Therefore, it is critical to understand the reason for
having outliers in the first place. For example, outliers in the bare and grass/ open areas
are attributed to spectral confusion between the two land cover classes.

Figure 4. Box plot for the training data

Next, we are going to use the cor() and corrplot() functions to check the correlation
between predictor variables (Sentinel-2 bands).

# Check band correlations


bandCorrelations <- cor(training[, 2:10])
bandCorrelations
## B2 B3 B4 B5 B6 B7 B8
## B2 1.0000000 0.9562693 0.8636215 0.7974834 0.4679139 0.4007641 0.3947219
## B3 0.9562693 1.0000000 0.9297967 0.8743282 0.5758543 0.4707148 0.4582506
## B4 0.8636215 0.9297967 1.0000000 0.9284172 0.5623456 0.4317394 0.3983765

Page 15
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

## B5 0.7974834 0.8743282 0.9284172 1.0000000 0.6885852 0.5578760 0.4773255


## B6 0.4679139 0.5758543 0.5623456 0.6885852 1.0000000 0.9602700 0.8842688
## B7 0.4007641 0.4707148 0.4317394 0.5578760 0.9602700 1.0000000 0.9387786
## B8 0.3947219 0.4582506 0.3983765 0.4773255 0.8842688 0.9387786 1.0000000
## B11 0.4654458 0.4803279 0.5610827 0.6347700 0.5259243 0.4998926 0.4961494
## B12 0.5670506 0.5541277 0.6290936 0.6793260 0.4206535 0.3832866 0.3678112

The R output is also difficult to read. So we will display a mixed plot (numbers and plots)
to understand the correlation between the predictors.

# Display mixed plot


corrplot.mixed(bandCorrelations,lower.col="black", number.cex = .7, upper = "color")

Figure 5 shows positive correlations in dark blue, while the red color shows negative
correlations. The color intensity and the size of the square are proportional to the
correlation coefficients. Bands 2 and 3, 3 and 4, 6 and 7, 7 and 8, and 11 and 12 are
highly correlated (correlation coefficient is above 0.9). Machine learning experts
recommend removing the highly correlated bands or performing a principal component
analysis to reduce redundancy and improve computation efficiency.

Figure 5. Mixed plot showing correlation among the Sentinel-2 bands

2.4 Module 4: Define machine learning model tuning parameters


Defining model tuning parameters is vital because most machine learning models have
at least one fundamental parameter, controlling model complexity. This guide will use
the trainControl() function from the caret package (Kuhn 2008) to evaluate how model
tuning parameters affect performance. We will also use this function to select the best
model as well as to estimate model performance. Next, define the model tuning

Page 16
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

parameters by specifying the "method," "number," and "repeats" parameters of the


trainControl() function as shown below.

# Define model tuning parameters


fitControl<- trainControl(method = "repeatedcv",
number = 10,
repeats =10)

The "method" parameter contains resampling methods such as"repeatedcv", "boot",


"cv", "LOOCV" etc. In this guide, we will use repeated cross-validation (repeatedcv).
The "number" parameter refers to the number of folds or number of resampling
iterations. The "repeats" parameter provides sets of folds to compute repeated cross-
validation. Therefore, we will run a 10-fold cross-validation and repeat it ten times
(number = 10 and repeats = 10).

Cross-validation refers to the subdivision of training data into several mutually exclusive
groups. For the k-fold cross-validation, the algorithm partitions the initial training data
randomly into k mutually exclusive subsets or folds. Training and validation are then
performed k times. The k-fold cross-validation repeats the model building based on
different subsets of the available training data. Then it evaluates the model only on
unseen data during model building. Kuhn and Johnson (2016) recommend the k value
be 5 or 10 to minimize bias.

2.5 Module 5: Train and evaluate machine learning models


The train() function from the caret package (Kuhn 2008) is used for tuning and training
all machine learning models. The function sets tuning parameters, fits each model, and
calculates a resampling-based performance measure. For each training data set, the
performance of held-out samples is calculated, and summary statistics are provided.
The final model with the optimal resampling statistic is selected.

2.5.1 K-nearest neighbors (KNN) model


K-nearest neighbors (KNN) is a simple non-parametric algorithm used for either
classification or regression. This algorithm predicts a new data sample using the k-
closest sample from the training data. KNN depends on distance metrics such as the
euclidean distance (the straight-line distance between two samples). Therefore, pre-
processing (centering and scaling) for all predictor variables is mandatory before model
training to remove biases (Kuhn and Johnson 2016). Pre-processing is important
because it allows all predictors to be treated equally when computing distance. While
the KNN algorithm is simple, it is computationally inefficient since it requires computing
the distance of k-nearest neighbors.

For training the KNN model, we specify the train() function. First, we define "Class~.,"
which denotes a formula for using all attributes in the model. The "Class" is the
response variable, and the "data" contains the Sentinel-2 band reflectance values. The

Page 17
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

"method" is defined as "kknn", while the tune control (trControl) is specified as


"fitControl". We save all results in the object "knnFit".

Let’s start to train the KNN model.

# Set-up timer to check how long it takes to run the model


timeStart <- proc.time() # Starting time

# Train a KNN model


set.seed(27)
knnFit <- train(Class ~ ., data = training,
method = "kknn",
preProcess = c("center", "scale"),
trControl = fitControl)

# Check the KNN model performance


print(knnFit)
## k-Nearest Neighbors
##
## 1149 samples
## 9 predictor
## 6 classes: 'Bare areas', 'Built-up', 'Cropland', 'Grass/ open', 'Water', 'Woodlands'
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1034, 1036, 1033, 1036, 1034, 1032, ...
## Resampling results across tuning parameters:
##
## kmax Accuracy Kappa
## 5 0.7524870 0.6602668
## 7 0.7568685 0.66655977
## 9 0.7641887 0.6751390
##
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
## parameter 'kernel' was held constant at a value of optimal
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 9, distance = 2 and kernel =
optimal.
proc.time() - timeStart # Finishing time
## user system elapsed
## 45.681 0.000 45.679

The output shows that the algorithm used 1,149 training samples for training. There are
nine predictors (that is, nine Sentinel-2 bands) and six land cover classes (bare areas,
built-up, cropland, grass/ open areas, woodlands, and water). The cross-validation

Page 18
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

results show the sample sizes and final values that were selected for the best model.
The best model has a kmax = 9 and a distance = 2.

We can also display the model training performance based on overall accuracy using
the command below.

# Display the KNN model plot


plot(knnFit)

Figure 6 shows the repeated cross-validation profile for the KNN model, which is based
on overall accuracy. The model achieved 76% accuracy with k is 9.We observe a slight
decrease from "kmax =9" to "kmax=5", which means increasing k values does not affect
the accuracy.

Figure 6. Repeated cross-validation (based on overall accuracy) profile for the KNN
model

Now, check the parameters of the best model.

# Compute variable importance


knnFit$finalModel
#
# kknn::train.kknn(formula = .outcome ~ ., data = dat, kmax = param$kmax, distance
= param$distance, kernel = as.character(param$kernel))
# Type of response variable: nominal
# Minimal misclassification: 0.2315057
# Best kernel: optimal
# Best k: 8

Page 19
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Next, display variable importance using the varImp() function.

# Compute variable importance


knn_varImp <- varImp(knnFit)
ggplot(knn_varImp)

Figure 7. KNN variable importance

Figure 7 shows that the most significant band contribution is observed for the cropland,
grass/open areas, and woodland classes. The band contribution for the built-up and
bare areas classes is moderate, while the water class is relatively low. Although the
KNN model variable importance provides essential insights, the variable importance
values are suspiciously high for the grass/open areas, cropland, and woodland classes.
In addition, band 8 (NIR) has a low contribution for the woodlands class, which
contradicts the spectral profile (Figure 2). It is not easy to understand how the KNN
model came up with the variable importance ranking. Therefore, we should interpret the
variable importance metrics with caution. Strobl et al. (2008) reported that variable
importance metrics are biased when the predictor variables (Sentinel-2 bands) are
highly correlated, which leads to selecting suboptimal predictor variables.

We will assess the KNN model performance using new testing data and the
confusionMatrix() function. The KNN model results are used for prediction and building
a confusion matrix.

# Prepare a confusion matrix


pred_knn <- predict(knnFit, newdata = testing)
confusionMatrix(data = pred_knn, testing$Class)

Page 20
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

##Confusion Matrix and Statistics


## Reference
## Prediction Bare areas Built-up Cropland Grass/ open Water Woodlands
## Bare areas 39 10 3 4 0 0
## Built-up 16 255 8 5 0 0
## Cropland 15 11 59 22 0 0
## Grass/ open 17 19 34 186 1 9
## Water 0 0 0 1 15 0
## Woodlands 0 1 0 3 0 29

##Overall Statistics
## Accuracy : 0.7651
## 95% CI : (0.7333, 0.7948)
## No Information Rate : 0.3885
## P-Value [Acc > NIR] : < 2.2e-16
## Kappa : 0.6755

## Mcnemar's Test P-Value : NA


## Statistics by Class:
## Class: Bare areas Class: Built-up Class: Cropland Class: Grass/ open Class: Water
Class: Woodlands
## Sensitivity 0.44828 0.8615 0.56731 0.8416 0.93750 0.76316
## Specificity 0.97481 0.9378 0.92705 0.8521 0.99866 0.99448
## Pos Pred Value 0.69643 0.8979 0.55140 0.6992 0.93750
0.87879
## Neg Pred Value 0.93201 0.9142 0.93130 0.9294 0.99866
0.98765
## Prevalence 0.11417 0.3885 0.13648 0.2900 0.02100 0.04987
## Detection Rate 0.05118 0.3346 0.07743 0.2441 0.01969 0.03806
## Detection Prevalence 0.07349 0.3727 0.14042 0.3491 0.02100
0.04331
## Balanced Accuracy 0.71155 0.8996 0.74718 0.8469 0.96808 0.87882

The overall accuracy is 76.5%, while the "No Information Rate" is about 39%. The
producer's accuracy (sensitivity) is lower than the user's accuracy (Pos Pred Value) for
bare areas, indicating an underestimation. Concerning the built-up class, the producer's
accuracy (sensitivity) is higher than the user's accuracy (Pos Pred Value), suggesting
that the KNN model overestimated built-up areas. In contrast, the producer's accuracy
(sensitivity) is lower than the user's accuracy (Pos Pred Value) for the cropland class,
which indicates omission errors and thus an underestimation of the cropland areas.
However, the producer's accuracy is substantially higher than the user's accuracy for
the grass/ open areas class, indicating an overestimation. Generally, the woodland and
water classes have high individual accuracies. However, bare areas and cropland
classes have low individual accuracies, suggesting spectral confusion and class
imbalance problems. A closer look at the confusion matrix shows high misclassification
errors, especially between bare areas and built-up classes and grass/ open areas and
cropland classes.

Page 21
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Next, use the KNN model to classify the Sentinel-2 imagery. We will use the predict()
function from the raster package to perform the prediction or classification. We are
going to pass on two arguments. The first parameter is the trained model ("knnFit"), and
the second parameter, "new data" holds the testing data ("testing"). The predict()
function will return a list that we will save as "pred_knnFit."

# Predict land cover


timeStart<- proc.time() # measure computation time
lc_knn <- predict(rvars, knnFit)
proc.time() - timeStart # user time and system time.
## user system elapsed
## 321.922 3.182 325.176

We will display the land cover map using the levelplot() function (Figure 8).

# Display land cover map


# Use levelplot from the rasterVis to display the land cover map
levelplot(lc_knn, col.regions=c("grey","red","yellow","green", "blue", "darkgreen"))

# Alternatively, you can use plot, splom or ggplot to display the land cover map
# plot(lc_knn, col = c("grey","red","yellow","green", "blue", "darkgreen"))
# spplot(lc_knn, col.regions=c("grey","red","yellow","green", "blue", "darkgreen"))

Figure 8. Land cover classification based on post-rainy season Sentinel-2 imagery and
a KNN model

Page 22
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 8 shows an underestimation of bare areas and a substantial overestimation of


built-up areas. Severe classification problems attributed to spectral confusion are
conspicuous in the test site. For example, the KNN model misclassified some
grass/open areas in the south and south-western part of the city as built-up areas.
During the late post-rainy season, grass/open and cropland spaces respond as bare
soil. Therefore, it is difficult to separate them from built-up areas.

Next, calculate the land cover class statistics produced using the KNN model.

# Count of pixels per class


class_freq <- freq(lc_knn, useNA="no")
class_freq
## value count
## [1,] 1 185387
## [2,] 2 452968
## [3,] 3 903788
## [4,] 4 4577341
## [5,] 5 8483
## [6,] 6 485306

# Get resolution of land cover map


res_lc <- res(lc_knn)
# Multiply the amount of pixels per class with the area
area_km2 <- class_freq[, "count"] * prod(res_lc) * 1e-06

# Create a data frame for the land cover


lc_knn_map <- data.frame(landcover = (c("Bare areas", "Built-up", "Cropland", "Grass/
open areas", "Water", "Woodlands")), area_km2 = area_km2)

# View land cover statistics


lc_knn_map
## landcover area_km2
## 1 Bare areas 18.5
## 2 Built-up 45.3
## 3 Croplans 90.4
## 4 Grass/ open areas 457.7
## 5 Water 0.8
## 6 Woodlands 48.5

Finally, save the land cover map so that you can visualize it in GIS software.

# Save the land cover map in img format


writeRaster(lc_knn, filename="Gw_KNN_S2_LC_2020.img", type="raw",
datatype='INT1U', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Page 23
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

2.5.2 Random forest (RF) model


Random forests (Breiman 2001) have been used successfully for land cover
classification (Rodriguez-Galiano et al. 2012; Mellor et al. 2013). Random forests are
supervised algorithms that build many decision trees and combine them to solve
classification or regression problems. The algorithm extends the bagging method since
it uses both bagging and feature randomness to create an uncorrelated forest of
decision trees. The RF algorithm works by drawing a data sample from a training set
with replacement (bootstrap sample). Generally, one-third of the training set is used for
validation (the out-of-bag (OOB) sample). For a classification task, a majority vote is
used for class prediction. Three main tuning parameters, namely the node size, the
number of trees, and the is mtry (which represents the number of randomly selected
predictors (k) to choose from each split), need to be set before training. We can specify
the RF model as shown below.

# Train a RF model
timeStart <- proc.time()
set.seed(27)
rf_model <- train(Class ~.,data = training,
method = "rf",
trControl = fitControl,
prox = TRUE,
fitBest = FALSE,
localImp = TRUE,
returnData = TRUE)
proc.time() - timeStart
## user system elapsed
343.129 0.424 343.514

Next, check the RF model performance. Again we use the print() and plot() functions.

# Check the RF model performance


print(rf_model)
## Random Forest
##
## 1149 samples
## 9 predictor
## 6 classes: 'Bare areas', 'Built-up', 'Cropland', 'Grass/ open', 'Water', 'Woodlands'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1034, 1036, 1033, 1036, 1034, 1032, ...
## Resampling results across tuning parameters:

Page 24
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

##
## mtry Accuracy Kappa
## 2 0.7777824 0.6893787
## 5 0.7784873 0.6916921
## 9 0.7744061 0.6870036
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

The RF model used 1,149 training samples for training. The model achieved 78%
accuracy with mtry is 5. Note that the RF model accuracy is slightly better than the KNN
model accuracy (76%).

Next, display the RF model training performance based on overall accuracy (Figure 9).
# Display the RF model plot
plot(rf_model)

Figure 9. Repeated cross-validation (based on overall accuracy) profile for the RF


model

Next, check the parameters of the best model.

# Check the best model parameters


rf_model$finalModel
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, fitBest =
FALSE, returnData = TRUE)
## Type of random forest: classification
## Number of trees: 500

Page 25
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

## No. of variables tried at each split: 5


## OOB estimate of error rate: 22.37%
## Confusion matrix:
## Bare areas Built-up Cropland Grass/ open Water Woodlands class.error
## Bare areas 65 35 8 21 0 2 0.5038168
## Built-up 8 401 12 25 0 0 0.1008969
## Cropland 6 25 85 40 0 0 0.4551282
## Grass/ open 15 20 22 270 0 5 0.1867470
# Water 0 1 0 1 23 0 0.0800000
## Woodlands 0 0 1 10 0 48 0.1864407

The output shows a confusion matrix for the best model (after cross-validation). The RF
algorithm used 500 decision trees and selected five bands from nine predictor Sentinel-
2 bands. The out-of-bag (OOB) estimate of error rate is 22%, indicating poor training
model performance. Note that the land cover class errors for the bare areas and
cropland are very high, which is not good. The RF model is severely affected by
spectral confusion between bare areas and built-up, cropland, and grass/ open areas.
For example, the RF model misclassified 35 built-up pixels as bare areas.

Next, display variable importance using varImp() function.

# Compute variable importance


rf_varImp <- varImp(rf_model)
ggplot(rf_varImp)

Figure 10. Random forest variable importance

Page 26
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 10 shows that the most significant band contribution is observed for the
grass/open areas, built-up, and cropland classes. The band contribution for the bare
areas and woodland classes is moderate, while the water class is relatively low. Again,
the variable importance metrics should be interpreted with caution because the
Sentinel-2 bands are highly correlated (Figure 5). The variable importance values are
suspiciously high for the grass/open areas. In addition, band 8 (NIR) has no contribution
to the woodlands class, suggesting serious model training problems.

Next, assess the RF model performance using new testing data. We are going to use
the RF model results for prediction and then build a confusion matrix.

pred_rf <- predict(rf_model, newdata = testing)


confusionMatrix(data = pred_rf, testing$Class)

## Confusion Matrix and Statistics


## Reference
## Prediction Bare areas Built-up Cropland Grass/ open Water Woodlands
## Bare areas 40 6 3 7 0 0
## Built-up 23 271 14 10 0 0
## Cropland 10 3 61 20 0 1
## Grass/ open 14 14 26 180 1 6
## Water 0 0 0 0 15 0
## Woodlands 0 2 0 4 0 31

## Overall Statistics
## Accuracy : 0.7848
## 95% CI : (0.7539, 0.8135)
## No Information Rate : 0.3885
## P-Value [Acc > NIR] : < 2.2e-16

## Kappa : 0.7002
## Mcnemar's Test P-Value : NA

## Statistics by Class:
## Class: Bare areas Class: Built-up Class: Cropland Class: Grass/ open Class: Water Class: ## Woodlands
## Sensitivity 0.45977 0.9155 0.58654 0.8145 0.93750 0.81579
## Specificity 0.97630 0.8991 0.94833 0.8872 1.00000 0.99171
## Pos Pred Value 0.71429 0.8522 0.64211 0.7469 1.00000 0.83784
## Neg Pred Value 0.93343 0.9437 0.93553 0.9213 0.99866 0.99034
## Prevalence 0.11417 0.3885 0.13648 0.2900 0.02100 0.04987
## Detection Rate 0.05249 0.3556 0.08005 0.2362 0.01969 0.04068
## Detection Prevalence 0.07349 0.4173 0.12467 0.3163 0.01969 0.04856
## Balanced Accuracy 0.71803 0.9073 0.76743 0.8509 0.96875 0.90375

The confusion matrix shows that the overall classification accuracy is 78%, slightly
higher than the previous KNN model. However, we observe serious classification
problems. The producer's accuracy (sensitivity) is significantly lower than the user's
accuracy (Pos Pred Value) for the bare areas, indicating an underestimation. However,
the producer's accuracy (sensitivity) is higher than the user's accuracy (Pos Pred Value)
for the built-up class indicating that the RF model overestimated built-up areas. In
contrast, the cropland class has a substantial low producer's and a high user's

Page 27
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

accuracies. For the grass/ open areas class, the producer's accuracy is relatively higher
than the user's, indicating high commission errors. Concerning the woodland and water
classes, the producer's accuracy is slightly higher than the user's accuracy. Generally,
the RF model had difficulty separating built-up and bare areas and cropland and grass/
open areas classes.

Next, we will use the RF model to classify the Sentinel-2 imagery and create a land
cover map using the predict() function.

# Predict land cover


timeStart<- proc.time() # measure computation time
lc_rf <- predict(rvars, rf_model)
proc.time() - timeStart # user time and system time.
## user system elapsed
## 200.339 1.166 201.655

We will display the land cover map using the levelplot() function (Figure 11).

# Display land cover map


levelplot(lc_rf, col.regions=c("grey","red","yellow","green", "blue", "darkgreen"))

Figure 11. Land cover classification based on post-rainy season Sentinel-2 imagery and
an RF model

Page 28
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 11 shows that the RF model overestimated built-up areas and underestimated
bare areas. For example, there are many misclassified built-up pixels in the northwest
and south parts of the city. In addition, the RF model misclassified some bare areas,
cropland, and grass/open areas in the south and south-western part of the city as built-
up areas.

Next, calculate the land cover class statistics produced using the RF model.

# Count of pixels per class


class_freq <- freq(lc_rf, useNA="no")
class_freq
## value count
## [1,] 1 170572
## [2,] 2 669200
## [3,] 3 922484
## [4,] 4 4358174
## [5,] 5 8386
## [6,] 6 484457

# Get resolution of land cover map


res_lc <- res(lc_rf)

# Multiply the amount of pixels per class with the area


area_km2 <- class_freq[, "count"] * prod(res_lc) * 1e-06

# Create a data frame for the land cover


lc_rf_map <- data.frame(landcover = (c("Bare areas", "Built-up", "Croplans", "Grass/
open areas", "Water", "Woodlands")), area_km2 = area_km2)

# View land cover statistics


lc_rf_map
## landcover area_km2
## 1 Bare areas 17.1
## 2 Built-up 66.9
## 3 Croplans 92.3
## 4 Grass/ open areas 435.8
## 5 Water 0.84
## 6 Woodlands 48.5

Finally, save the land cover map so that you can visualize it in GIS software.

# Save the land cover map in “img” or “geotiff” formats


writeRaster(lc_rf, filename="Gw_RF_S2_LC_2020.img", type="raw", datatype='INT1U',
index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Page 29
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

2.5.3 Support vector machine (SVM) model


Literature review shows that support vector machine (SVM) models have been used
successfully for remote sensing image classification (Pal and Mather 2005; Nemmour
and Chibani 2006). SVM models are supervised machine learning methods used for
classification and regression. The goal of SVM models is to locate an optimal decision
boundary to maximize the margin between two classes (Zhang and Ma 2008; Pal and
Mather 2005). The decision boundary's location depends on the subset of training data
closest to it, known as support vectors (Cortes and Vapnik 1995; Kuhn and Johnson
2016). Boser et al. (1992) and Cortes and Vapnik (1995) improved the original SVM
classifier. For example, Cortes and Vapnik (1995) developed a cost function, while
Boser et al. (1992) developed kernel functions to allow non-linear class boundaries. The
cost function quantifies misclassification errors or penalties, while kernel functions
transform training data set into higher dimensional feature space for non-linear
classification problems (Zhang and Ma 2008; Pal and Mather 2005). Kernel functions
such as polynomial, radial basis, and hyperbolic tangent are commonly used for
transformation (Hsu et al. 2010). This guide will use the radial basis function, which has
been successfully used for remote sensing image classification. All Sentinel-2 bands
are centered and scaled to reduce the impact of predictor variables with high attribute
values (Kuhn and Johnson 2016). First, let's train the radial basis SVM model.

# Train the SVM model


timeStart <- proc.time()
set.seed(27)
svm_model<-train(Class~.,data=training,
method = "svmRadial",
trControl = fitControl,
preProc = c("center", "scale"),
tuneLength = 5)
proc.time() - timeStart
## user system elapsed
## 800.826 871.699 157.371

Next, check the SVM model performance. Again we use the print() and plot() functions.

# Check the SVM model performance


print(svm_model)
## Support Vector Machines with Radial Basis Function Kernel
## 1149 samples
## 9 predictor
## 6 classes: 'Bare areas', 'Built-up', 'Cropland', 'Grass/ open', 'Water', 'Woodlands'
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1034, 1036, 1033, 1036, 1034, 1032, ...
## Resampling results across tuning parameters:

Page 30
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

## C Accuracy Kappa
## 0.25 0.7309417 0.6168089
## 0.50 0.7632173 0.6663508
## 1.00 0.7905587 0.7075065
## 2.00 0.8049228 0.7288197
## 4.00 0.8138859 0.7419757
## Tuning parameter 'sigma' was held constant at a value of 0.3622052
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.3622052 and C = 4.

The SVM model used 1,149 training samples for training. There are nine predictors (that
is, nine Sentinel-2 bands) and six land cover classes.

Next, display the SVM model training performance based on overall accuracy (Figure
12).

# Display SVM plot


plot(svm_model)

Figure 12. Repeated cross-validation (based on overall accuracy) profile for the same
model

Figure 12 shows that the optimal model has a Cost value of about four and an accuracy
of 80%, which is relatively good.

Next, check the parameters of the best model.

# Check the best model parameters


svm_model$finalModel

Page 31
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

## Support Vector Machine object of class "ksvm"


## SV type: C-svc (classification)
## parameter : cost C = 4
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.362205215673485
## Number of Support Vectors : 650
## Objective Function Value : -348.2444 -266.3336 -345.442 -12.5864 -36.0419 -343.6577 -343.4874 -##
13.323 -32.0382 -637.2282 -11.9613 -38.5012 -12.7641 -126.2647 -7.9073
## Training error : 0.105309

The SVM model selected 667 support vectors from a total of 1,149 training samples.
The training error rate is relatively low.

Next, display variable importance using varImp() function.

# Compute variable importance


svm_varImp <- varImp(svm_model)
ggplot(svm_varImp)

Figure 13 also shows that the most significant band contribution is observed for the
grass/open areas, cropland, and woodland classes. We noticed the same pattern with
the KNN model variable importance. The band contribution for the built-up, bare areas,
and water classes is moderate. While the SVM model variable importance provides
essential insights, it is difficult to understand how it came up with the variable
importance ranking. Again, we should be cautious interpreting variable importance
when the spectral bands are highly correlated.

Figure 13. Variable importance for the SVM model

Page 32
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Next, we are going to assess the SVM model performance using new testing data. We
are going to use the SVM model results for prediction and then build a confusion matrix.

# Prepare a confusion matrix


pred_svm <- predict(svm_model, newdata = testing)
confusionMatrix(data = pred_svm, testing$Class)

Confusion Matrix and Statistics


Reference
Prediction Bare areas Built-up Cropland Grass/ open Water Woodlands
Bare areas 49 5 5 7 0 0
Built-up 12 276 11 6 0 0
Cropland 9 6 58 14 0 0
Grass/ open 17 8 29 191 1 8
Water 0 0 0 0 15 0
Woodlands 0 1 1 3 0 30

Overall Statistics
Accuracy : 0.8123
95% CI : (0.7828, 0.8395)
No Information Rate : 0.3885
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7392
Mcnemar's Test P-Value : NA

Statistics by Class:
Class: Bare areas Class: Built-up Class: Cropland Class: Grass/ open Class: Water Class:
Woodlands
Sensitivity 0.56322 0.9324 0.55769 0.8643 0.93750 0.78947
Specificity 0.97481 0.9378 0.95593 0.8835 1.00000 0.99309
Pos Pred Value 0.74242 0.9049 0.66667 0.7520 1.00000 0.85714
Neg Pred Value 0.94540 0.9562 0.93185 0.9409 0.99866 0.98900
Prevalence 0.11417 0.3885 0.13648 0.2900 0.02100 0.04987
Detection Rate 0.06430 0.3622 0.07612 0.2507 0.01969 0.03937
Detection Prevalence 0.08661 0.4003 0.11417 0.3333 0.01969 0.04593
Balanced Accuracy 0.76902 0.9351 0.75681 0.8739 0.96875 0.89128

The overall accuracy is about 81%, which is higher than the KNN and RF models. In
terms of individual accuracies, the producer's accuracy (sensitivity) is also lower than
the user's accuracy (Pos Pred Value) for the bare areas, indicating n underestimation.
However, the producer's accuracy (sensitivity) is higher than the user's accuracy (Pos
Pred Value) for the built-up class indicating that the SVM model overestimated built-up
areas. In contrast, the cropland class has a substantially low producer's and a high
user's accuracy, meaning omission errors. For the grass/ open areas class, the
producer's accuracy is relatively higher than the user's, indicating high commission
errors. Concerning the woodlands class, the producer's accuracy is similar to the user's
accuracy. The water class has the highest individual accuracy, indicating that the SVM
model produced an optimal classification. While the SVM model had difficulty separating
built-up and bare areas and cropland and grass/ open areas classes, there is a slight
improvement in accuracy.

Page 33
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Next, we will use the SVM model to classify the Sentinel-2 imagery and create a land
cover map using the predict() function.

# Predict land cover


timeStart<- proc.time() # measure computation time
lc_svm <- predict(rvars, svm_model)
proc.time() - timeStart # user time and system time.
## user system elapsed
## 4467.012 4844.401 925.576

We will display the land cover map using the levelplot() function (Figure 14).

# Display land cover map


levelplot(lc_svm, col.regions=c("grey","red","yellow","green", "blue", "darkgreen"))

Figure 14. Land cover classification based on post-rainy season Sentinel-2 imagery and
an SVM model

Figure 14 also shows the overestimation of built-up area and underestimation of bare
areas. There are many misclassified built-up pixels in the northwest and south parts of
the city. In addition, the SVM model misclassified some bare areas, cropland, and
grass/open areas in the south and south-western part of the city as built-up areas.

Next, calculate the land cover class statistics produced using the SVM model.

Page 34
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

# Count of pixels per class


class_freq <- freq(lc_svm, useNA="no")
class_freq
## value count
## [1,] 1 151414
## [2,] 2 518488
## [3,] 3 772219
## [4,] 4 4730240
## [5,] 5 8106
## [6,] 6 432806

# Get resolution of land cover map


res_lc <- res(lc_svm)

# Multiply the amount of pixels per class with the area


area_km2 <- class_freq[, "count"] * prod(res_lc) * 1e-06

# Create a data frame for the land cover


lc_svm_map <- data.frame(landcover = (c("Bare areas", "Built-up", "Cropland", "Grass/
open areas", "Water", "Woodlands")), area_km2 = area_km2)

# View land cover statistics


lc_svm_map
## landcover area_km2
## 1 Bare areas 15.1414
## 2 Built-up 51.8488
## 3 Cropland 77.2219
## 4 Grass/ open areas 473.0240
## 5 Water 0.8106
## 6 Woodlands 43.2806

Finally, save the land cover map so that you can visualize it in GIS software.

# Save land cover map


writeRaster(lc_svm, filename="Gw_SVM_S2_LC_2020.img", type="raw",
datatype='INT1U', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

2.5.4 Compare the training performance of machine learning models


Next, compare the performance of all models based on the cross-validation statistics.
We are going to run the resamples() function and then check the resamps_ml object.

# Compare machine learning models


resamps_ml <- resamples(list(kknn = knnFit,

Page 35
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

randomForest = rf_model,
kernlab = svm_model))
resamps_ml
## ## Call:
## resamples.default(x = list(kknn = knnFit, randomForest = rf_model, kernlab
## = svm_model))
## ## Models: kknn, randomForest, kernlab
## Number of resamples: 100
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit

Finally, check the summary statistics and then display the results in graphic form.

# Check model performance summary statistics


summary (resamps_ml)
summary.resamples(object = resamps_ml)
Models: kknn, randomForest, kernlab
Number of resamples: 100
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
kknn 0.6810345 0.7439931 0.7652174 0.7641887 0.7880764 0.8421053 0
randomForest 0.6752137 0.7565217 0.7807018 0.7784873 0.7986842 0.8761062 0
kernlab 0.7192982 0.7974022 0.8173913 0.8138859 0.8322271 0.8938053 0

# Display a graph that compares the model performance


bwplot(resamps_ml)

Figure 15 shows that the SVM model (Kernlab) had the highest training accuracy,
followed by the RF and KNN models. However, there is a slight difference in training
model performance between the KNN and RF models.

Page 36
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 15. Comparisons of training model performance

2.6 Module 6: Compute accumulated local effects (ALE)


2.6.1 Background
Remote sensing practitioners are interested in understanding the marginal effects
between land cover and predictor variables such as spectral bands, spectral indices,
texture, and elevation. Marginal effects show how a response/ dependent variable (e.g.,
land cover) changes when a specific predictor or independent variable (e.g., spectral
band) changes. Researchers can infer marginal effects from conventional statistical
approaches such as logistic regression. In general, traditional statistical methods for
classification assumes that there is a linear relationship between the predictor and
response variables. However, real-world geographic features and geospatial data (such
as satellite imagery) are very complex. Therefore, it is unrealistic to assume linearity or
normal distribution assumptions when using remotely-sensed data for land cover
classification. As such, non-linear machine learning models such as SVM and RF have
been used for land cover classification since they do not adhere to these assumptions.
Indeed many remote sensing studies have shown that machine learning approaches
achieve better classification performance. However, machine learning approaches do
not have parameters (as conventional statistical models) to understand the effects of
spectral bands on land cover classification. Although we computed variable or feature
importance for all models in the previous module, we did not understand how the
predictor variables (Sentinel-2 bands) influenced the predicted land cover. The variable
importance only identified the magnitude of a predictor variable's relationship with the
response compared to other predictor variables used in the models. However, we did

Page 37
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

not understand how the Sentinel-2 bands affected the land cover or the nature of the
effects (positive or negative).

Explainable machine learning methods such as accumulated local effects (ALE) help us
understand the nature and direction of effects. ALE is a model-agnostic method (Apley
and Zhu 2020) that infers the relationship between the predicted land cover and the
spectral bands. In particular, the ALE values answer this question: conditional on a
given value, what is the relative effect of changing the predictor variable value on the
prediction? ALE averages the changes in the predictions and accumulates them over
the local grid. As a result, the effect of a particular factor can be evaluated at different
levels.

This guide will produce ALE plots because they are more robust to predictor collinearity
(Apley and Zhu 2020). Furthermore, ALE plots are faster to compute than partial
dependence (PD) plots (Molnar 2019; Apley and Zhu 2020). The ALE plots will help us
understand how the predictor variables (Sentinel-2 bands) influence the machine
learning models' predicted land cover (response variable). More importantly, ALE plots
can also help us gain insights into the underlying predictive probability model based on
our domain knowledge (e.g., environmental remote sensing).

2.6.2 Accumulated local effects (ALE) for all models


In this module, we will compute ALE using the "iml" package (Molnar 2020). To learn
more about ALE and iml package, please refer to Apley and Zhu (2020) and Molnar
(2019).

First, we will create predictor objects that hold the machine learning models and the
data. The new objects (predictor_knn, predictor_rf, and predictor_svm) are created by
calling Predictor$new(). For each observation of a specific predictor such as Sentinel-2
band 2 (X-value), the ALE plot calculates the average change in the target prediction
(land cover) within a local multidimensional window.
# Create predictor objects for each machine learning model
set.seed(27)
X = training[which(names(training) != "Class")]
predictor_knn = Predictor$new(knnFit, data = X, y = training$Class)

X = training[which(names(training) != "Class")]
predictor_rf = Predictor$new(rf_model, data = X, y = training$Class)

X = training[which(names(training) != "Class")]
predictor_svm = Predictor$new(svm_model, data = X, y = training$Class)

For simplicity, we will compute feature effects and create ALE plots for each band for all
the models. To do that, we will create an object called "effect_" that stores results from
the computation of the feature effects for each model. Note that you can also compute
feature effects and create ALE plots for all bands at once.

Page 38
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Let’s start by computing feature effects and creating (ALE) plots for band 2 (blue band).

# Accumulated local effects (ALE)


set.seed(27) # Set seed
# Compute B2 ALE plots for all models
effect_knn2 = plot(FeatureEffect$new(predictor_knn, feature = c("B2"))) + ggtitle("KNN
model")
effect_rf2 = plot(FeatureEffect$new(predictor_rf, feature = c("B2"))) + ggtitle("RF
model")
effect_svm2 = plot(FeatureEffect$new(predictor_svm, feature = c("B2"))) + ggtitle("SVM
model")
# Display ALE plots for all models
grid.arrange(effect_knn2, effect_rf2, effect_svm2, ncol =3)

Figure 16 shows the ALE plots for the KNN, RF, and SVM models. The ALE plots are
centered at zero, representing the mean model prediction across all values of the
spectral band variable. Values below zero have a negative effect on the predicted land
cover (decreased land cover prediction), while values above zero have a positive effect
(increased land cover prediction). ALE values on the y axis represent the main feature
effect at a specific spectral reflectance value (on the x-axis) on the predicted land cover
compared to the overall prediction. For example, let us observe a point in figure 16 with
an ALE value y = 0.3 and a spectral reflectance value x = 0.3. This example shows that
for a spectral reflectance value equal to 0.3, the predicted probability of land cover
increases by 30% compared to the average prediction of land cover.

Figure 16 shows that all models have initial positive ALE values associated with low
spectral reflectance and constant or negative ALE values related to high spectral
reflectance for the bare areas. In contrast, all models show that band 2 has a strong
positive effect on the prediction of the built-up class. The RF model shows an initial
steep increase followed by a flat and constant average prediction. Concerning the
cropland class, all models show L-shaped ALE plots, with high positive ALE values
associated with low spectral reflectance and low negative ALE values associated with
high spectral reflectance. For the grass/ open area class, all models have different
effects. However, the RF model shows an initial increase followed by a flat and constant
average prediction. In general, band 2 (B2) does not affect water and woodland
predictions that much. This effect contradicts the variable importance values, which
show that B2 had the highest contribution, suggesting model training problems. We
observe that ALE effects for all land cover classes except the built-up class peaks at
spectral reflectance vary between 0 and 0.1, consistent with the spectral profile (see
Figure 2). However, the ALE effect of the built-up class for the SVM model peaks at
spectral reflectance of about 0.4, which is not consistent with the spectral profile. The
ALE effect reflects an overestimation of the built-up areas due to the high-class
imbalance (note that built-up areas constitute about 39% of the training pixels).

Page 39
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 16. Accumulated local effects (ALE) plots for all models (band 2). Plots indicate
how predictions in a given model changed on average concerning different values in
local value-areas of the respective band. ALE values are centered around zero. Note
the distinct vertical axes scales for the other machine learning models.

Next, let’s compute feature effects and creating (ALE) plots for band 3 (green band).

# Compute B3 ALE plots for all models


set.seed(27) # Set seed
effect_knn3 = plot(FeatureEffect$new(predictor_knn, feature = c("B3"))) + ggtitle("KNN
model")
effect_rf3 = plot(FeatureEffect$new(predictor_rf, feature = c("B3"))) + ggtitle("RF
model")
effect_svm3 = plot(FeatureEffect$new(predictor_svm, feature = c("B3"))) + ggtitle("SVM
model")
# Display ALE plots for all models
grid.arrange(effect_knn3, effect_rf3, effect_svm3, ncol =3)

Page 40
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 17. ALE plots for all models (band 3)

The ALE plots for all models show a complex and non-linear relationship between bare
areas and band 3 (Figure 3). For the KNN and SVM models, the average prediction
rises with increasing spectral reflectance but falls again above the spectral reflectance
of 0.1. However, the RF model shows an initial increase followed by a flat and constant
average prediction. In contrast, the KNN model shows a strong positive effect on the
prediction of the built-up class. The ALE effect and plot shape are different for SVM and
RF models, indicating non-linear relationships.

Concerning the cropland class, all models have different ALE effects. However, the ALE
plot shape and direction are almost similar for all models. For example, we observe high
positive ALE values are associated with very low spectral reflectance, and constant ALE
values are associated with high spectral reflectance. For the grass/ open area class, all
models show an L-shaped ALE plot. The high positive ALE values are related to low
spectral reflectance, while low negative ALE values are associated with high spectral
reflectance. For water and woodlands classes, all models fail to capture the effects.
Generally, we expect band 3 (green band) to have more impact on the woodlands
because the green band helps detect peak vegetation and assess vegetation. The small
training data set for the woodlands class was probably the leading cause of the negative
ALE effect. The predicted built-up class for the KNN model also peaks at spectral
reflectance of about 0.4, which is not consistent with the spectral profile (see Figure 2).

Next, let's compute feature effects and create (ALE) plots for band 4 (red band).

Page 41
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

# Compute B4 ALE plots for all models


set.seed(27) # Set seed
effect_knn4 = plot(FeatureEffect$new(predictor_knn, feature = c("B4"))) + ggtitle("KNN
model")
effect_rf4 = plot(FeatureEffect$new(predictor_rf, feature = c("B4"))) + ggtitle("RF
model")
effect_svm4 = plot(FeatureEffect$new(predictor_svm, feature = c("B4"))) + ggtitle("SVM
model")
# Display ALE plots for all models
grid.arrange(effect_knn4, effect_rf4, effect_svm4, ncol =3)

Figure 18. ALE plots for all models (band 4)

The ALE plots for band 4 (red band) reveal a strong negative effect on bare areas
predictions (Figure 18) for the KNN and SVM models. However, the ALE plot for the RF
model shows a small and null effect, suggesting model training problems. In contrast,
the ALE plots for the KNN and SVM models show a strong and positive association with
built-up area prediction. The average prediction of built-up areas increases with spectral
reflectance, which is the expected behavior. Generally, the red band helps discriminate
between artificial objects and vegetation.

On the contrary, the RF model negatively affects built-up predictions, suggesting


serious model training problems. Although all models show high ALE variability, the
effects decrease and flatten with increased spectral reflectance. For the grass/ open
area class, all models show an initial steep increase in average prediction. However,

Page 42
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

this is followed by a decrease in average prediction for all models. For the water class,
all models fail to capture the effects. However, L-shaped ALE plots for the woodlands
class suggest poor prediction for all models.

Next, let's compute feature effects and create (ALE) plots for band 5 (vegetation red
edge 1 band).

# Compute B5 ALE plots for all models


set.seed(27) # Set seed
effect_knn5 = plot(FeatureEffect$new(predictor_knn, feature = c("B5"))) + ggtitle("KNN
model")
effect_rf5 = plot(FeatureEffect$new(predictor_rf, feature = c("B5"))) + ggtitle("RF
model")
effect_svm5 = plot(FeatureEffect$new(predictor_svm, feature = c("B5"))) + ggtitle("SVM
model")

# Display ALE plots for all models


grid.arrange(effect_knn5, effect_rf5, effect_svm5, ncol =3)

Figure 19. ALE plots for all models (band 5)

The ALE plots for the KNN and SVM models show a non-linear effect on bare areas
predictions (Figure 19). Interestingly band 5 had no predictive effect on bare areas for
the RF model. In contrast, the ALE plots reveal a positive impact on built-up areas for
the KNN and SVM models. However, there is no effect on the built-up class. Concerning

Page 43
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

the cropland class, all models have variable effects. Nonetheless, the ALE effects peak
at a spectral reflectance between 0.1 and 0.2. The ALE effect is consistent with the
spectral profile (see Figure 2) because reflectance in the red-edge band is sensitive to
vegetation status. For the grass/ open area class, all models show an initial steep
increase in average prediction. However, this is followed by a substantial decrease in
average prediction with an increase in spectral reflectance. As was observed before, all
models fail to capture the effects for the water class. Concerning the woodlands class,
the ALE plots exhibit a steep decrease in average prediction, characterized by an L-
shaped plot.

Next, let's compute feature effects and create (ALE) plots for band 7 (vegetation red
edge 3 band).

# Compute B7 ALE plots for all models


set.seed(27) # Set seed
effect_knn7 = plot(FeatureEffect$new(predictor_knn, feature = c("B7"))) + ggtitle("KNN
model")
effect_rf7 = plot(FeatureEffect$new(predictor_rf, feature = c("B7"))) + ggtitle("RF
model")
effect_svm7 = plot(FeatureEffect$new(predictor_svm, feature = c("B7"))) + ggtitle("SVM
model")
# Display ALE plots for all models
grid.arrange(effect_knn7, effect_rf7, effect_svm7, ncol =3)

Figure 20. ALE plots for all models (band 7)

Page 44
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

The ALE plots for band 7 (vegetation red edge 3 band) positively affect bare areas
predictions (Figure 20) for the KNN model. In contrast, we observe a negative effect on
bare areas predictions for the RF and SVM models. All models reveal variable effects
on built-up predictions. However, we notice that average prediction rises with increasing
spectral reflectance but falls and flattens for KNN and RF models, respectively. On the
other hand, the SVM model shows an initial decrease followed by a constant and sharp
increase in average prediction for built-up areas. For the KNN model, we observe a
linear and negative effect on cropland predictions. While we discern non-linear positive
and negative effects on cropland predictions for the SVM model, the RF model shows a
positive effect that saturates at a spectral reflectance of 0.2.

Concerning the grass/ open area class, all models show complex and non-linear effects.
However, the ALE plot shape and patterns are similar for the KNN and SVM models.
The RF model fails to capture the effects for the water class, while the KNN and RF
models show a decreasing and negative effect with an increase in spectral reflectance.
For the woodlands class, all models exhibit different ALE plots that do not provide
meaningful information.

Next, let's compute feature effects and create (ALE) plots for band 8 (NIR).

# Compute B8 ALE plots for all models


set.seed(27) # Set seed
effect_knn8 = plot(FeatureEffect$new(predictor_knn, feature = c("B8"))) + ggtitle("KNN
model")
effect_rf8 = plot(FeatureEffect$new(predictor_rf, feature = c("B8"))) + ggtitle("RF
model")
effect_svm8 = plot(FeatureEffect$new(predictor_svm, feature = c("B8"))) + ggtitle("SVM
model")
# Display ALE plots for all models
grid.arrange(effect_knn8, effect_rf8, effect_svm8, ncol =3)

The ALE plots for band 8 (NIR) exhibit a positive non-linear effect on bare areas
predictions for all models (Figure 21). However, the ALE plots reveal a negative effect
on built-up predictions for all models. Concerning the cropland class, all models show
high ALE variability. However, we see positive effects that peak at a spectral reflectance
of 0.2 for the RF and SVM models. The ALE effect is consistent with the spectral
reflectance of the NIR during the post-rainy season (see Figure 2). All models have
positive ALE values for the grass/ open areas at a spectral reflectance of 0.2.
Nonetheless, we see constant and decreasing ALE values associated with high spectral
reflectance. For water, the ALE plots do not provide helpful information for the KNN and
RF models. However, water predictive probability drops rapidly with increasing spectral
reflectance and then levels off for the SVM model. Contrary to expectations, the ALE
plots for all models fail to capture the effects on woodlands. The ALE effect is
inconsistent with spectral profile (Figure 2) and environmental remote sensing principles
because spectral reflectance is high for vegetation in the NIR band. While there might

Page 45
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

be many reasons for the failure to capture the effects on woodlands, the small number
of training sample areas for this class might be the leading cause.

Figure 21. ALE plots for all models (band 8)

Next, let's compute feature effects and create (ALE) plots for band 11 (SWIR 1).

# Compute B11 ALE plots for all models


set.seed(27) # Set seed
effect_knn11 = plot(FeatureEffect$new(predictor_knn, feature = c("B11"))) +
ggtitle("KNN model")
effect_rf11 = plot(FeatureEffect$new(predictor_rf, feature = c("B11"))) + ggtitle("RF
model")
effect_svm11 = plot(FeatureEffect$new(predictor_svm, feature = c("B11"))) +
ggtitle("SVM model")
# Display ALE plots for all models
grid.arrange(effect_knn11, effect_rf11, effect_svm11, ncol =3)

Page 46
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Figure 22. ALE plots for all models (band 11)

Figure 22 shows that the average prediction of bare areas rises with increasing spectral
reflectance for the KNN and SVM models. However, the RF model shows a flat and
sinusoidal effect, indicating that the SWIR1 band does not affect the average prediction
of bare areas. In contrast, the ALE plots reveal a negative effect on built-up predictions
for all models. We observe a positive effect on cropland predictions for the KNN and RF
models concerning the cropland class. However, the cropland predictive probability for
the SVM model increases positively and rapidly drops with an increase in spectral
reflectance. We expect this behavior since crops are still healthy in April and then
mature for harvest during June. Generally, reflectance in SWIR 1 band is sensitive to
vegetation moisture content and drought analysis, suggesting that the KNN and RF
models failed to capture the effect for cropland. For all models, the grass/ open area
predictive probability increases positively and decreases with spectral reflectance.
Concerning water, the KNN and RF models fail to capture the effects. However, water
predictive probability drops rapidly with increasing spectral reflectance and then levels
off for the SVM model. For the woodlands, all models capture different effects.
However, all the models have high ALE peaks at low spectral reflectance, consistent
with the spectral profile (Figure 2).

Finally, let's compute feature effects and create (ALE) plots for band 12 (SWIR2).

# Compute B12 ALE plots for all models

Page 47
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

set.seed(27) # Set seed


effect_knn12 = plot(FeatureEffect$new(predictor_knn, feature = c("B12"))) +
ggtitle("KNN model")
effect_rf12 = plot(FeatureEffect$new(predictor_rf, feature = c("B12"))) + ggtitle("RF
model")
effect_svm12 = plot(FeatureEffect$new(predictor_svm, feature = c("B12"))) +
ggtitle("SVM model")
# Display ALE plots for all models
grid.arrange(effect_knn12, effect_rf12, effect_svm12, ncol =3)

Figure 23. ALE plots for all models (band 12)

Figure 23 shows that the ALE plots for band 12 (SWIR2) exhibit a non-linear effect on
the prediction of bare areas. However, the SVM model decreases and negatively affects
bare areas predictions, consistent with the spectral profile (Figure 2). Similarly, we
observe non-linear effects for the built-up predictions for the KNN and RF models.
However, the ALE plots reveal a strong positive effect on built-up predictions for the
SVM model, contrary to the spectral reflectance profile (see Figure 2). Concerning the
cropland class, a positive effect on cropland predictions is observed for KNN and RF
models. However, the cropland predictive probability for the SVM model decreases with
an increase in spectral reflectance, which agrees with the spectral profile (Figure 2).
The KNN and RF models failed to capture the effect for cropland. For all models, the
grass/ open area predictive probability increases positively and decreases with

Page 48
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

increased spectral reflectance, which is also expected during the post-rainy season.
Concerning water, all models fail to capture the effects. For the woodlands, we see a
negative effect that flattens out gradually for all models.

2.6.3 Summary
The ALE results revealed that the SVM model has the highest average prediction
followed by the KNN model, while the RF has the lowest average prediction. In general,
the ALE plots highlighted essential insights. First, most ALE plots revealed non-
monotonic and non-linear associations between Sentinel-2 bands and the predicted
land cover. Second, all models exhibited positive ALE values associated with low
spectral reflectance and constant or negative ALE values related to high spectral
reflectance. Therefore, low magnitude spectral reflectance values are associated with
higher than an average prediction, while high magnitude spectral reflectance values are
associated with lower than average prediction. Third, the results revealed that ALE
effects for bare areas, cropland, and grass/open areas peaked at spectral reflectance
that varies between 0 and 0.1, consistent with the spectral profile (see Figure 2). In
contrast, some ALE effects peaked at spectral reflectance of about 0.4 for the SVM and
KNN models, suggesting an overestimation of the built-up areas.

Fourth, ALE plots highlighted the inherent problem of inadequate and imbalanced
training data. The class imbalance is a common problem (especially in remote sensing),
which has profound implications for tuning and training machine learning models. In
most cases, inadequate and imbalanced training data is the leading cause for biased
and inaccurate machine learning models. In this guide, the class imbalance leads to
overestimating built-up areas and underestimating water and woodland areas.
Therefore, we should not interpret the constant ALE values for the woodlands class as
evidence of the absence of an effect (strangely, the NIR band did not affect the
prediction of woodlands). Instead, it results from a small training data set for the
woodlands class, which fails to capture the spectral variability of the woodlands.
Therefore, remote sensing researchers should make more efforts to collect more
training data, especially for small land cover classes. In addition, remote sensing
researchers can also use upsampling or synthetic minority oversampling technique
(SMOTE) to minimize the class imbalance problem.

Fifth, the ALE plots revealed that all the models had some uncertainty due to spectral
confusion between cropland and grass/open areas (see spectral profile in Figure 2). For
example, ALE effects are abysmal for some classes such as cropland and bare areas,
especially for the RF model, which had high-class errors for bare areas and cropland.
The ALE effects suggest that the RF model was affected by overfitting and
multicollinearity. Interestingly, the ALE plots indicated simple KNN model was better
than the advanced RF model. Therefore, starting with simple machine learning models
is always essential rather than using shiny and advanced machine learning models.
Last but not least, we should analyze the ALE plots with caution because most of the
Sentinel-2 bands were highly correlated. Although ALE plots account for predictor
collinearity, they can be misleading for highly correlated spectral bands. Generally, ALE
searches for small contributions of individual spectral bands in an isolated way. In

Page 49
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

addition, spatial autocorrelation influences pixel-based image classification, which can


be problematic for land cover classification. Therefore, we need further research on the
effect of collinearity on ALE, variable interaction effects, and spatial autocorrelation.

Page 50
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Chapter 3. Concluding Remarks


Remotely sensed satellite data is critical for producing land cover maps at a regional
scale. To date, many moderate to high spatial resolution remotely-sensed satellite data
(10-50 m) such as the Landsat series and Sentinel-2 have a relatively good global
coverage and are available free of charge. Furthermore, advancement in machine
learning and deep learning methods has also increased land cover mapping and
monitoring studies. While machine learning has been used successfully for land cover
classification, significant challenges remain. For example, limited or poor training data
(errors or uncertainties), imbalanced data, and overfitting affect most machine learning
methods. In addition, machine learning models are complex and, therefore, difficult to
understand their decision-making processes. For instance, it is difficult to explain the
whole forest in an RF model (compared to single decision trees) or understand high
dimensionality and data transformations in SVM models. Furthermore, the complexity of
machine learning models inhibits their explainability since variable interaction, and
marginal effects are not understood. Therefore, post-hoc explainable agnostic model
approaches are helpful to gain insights into machine learning models.

The ALE results revealed that model-agnostic approaches provide more insights than
model-specific ones. Furthermore, the ALE results highlighted machine learning model
limitations, which gives more opportunities for further improvement. Although the results
and discussions in this guide are far too limited to form the basis for firm conclusions,
they warrant further research, especially for land cover classification. However, this
guide only provided a basic workflow for building general machine learning models and
explainable machine learning models using ALE from the iml package. Therefore, there
is a need to explore other methods such as individual conditional expectation (ICE)
plots, Shapley values, and local interpretable model-agnostic explanations (LIME). In
addition, there are different packages such as DALEX, auditor, ExplainPrediction,
fastshap, and lime in R or python. Therefore, learning these methods and packages is
crucial to explain machine learning models for land cover classification.

References
Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box
supervised learning models. J. Roy. Stat. Soc. B 82 (4), 1059-1086.
https://doi.org/10.1111/rssb.123777
Anchang JY, Prihodko L, Ji W, Kumar SS, Ross CW, Yu Q, Lind B, Sarr MA, Diouf AA
and Hanan NP (2020) Toward Operational Mapping of Woody Canopy Cover in
Tropical Savannas Using Google Earth Engine. Front. Environ. Sci. 8:4. DOI:
10.3389/fenvs.2020.00004
Biecek, P and Burzykowski, T (2020) Explanatory Model Analysis: Explore, Explain, and
Examine Predictive Models. With examples in R and Python.
https://ema.drwhy.ai/preface.html
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin
classifiers. Proceedings of the fifth annual workshop on Computational learning
theory. ACM, p.144-52. Available online: http://dl.acm.org/citation.cfm?id=130401.
Accessed on 22 February 2014)

Page 51
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Breiman L (2001) Random forests. Mach. Learn 45: 5-32


Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20 (3): 273-297
Friedman JH, Bogdan E P (2008) Predictive learning via rule ensembles. The Annals of
Applied Statistics 2 (3), 916-954
Hsu CW, Chang CC, Lin CJ (2010) A Practical Guide to Support Vector Classification,
Department of Computer Science, National Taiwan University, Taipei, Taiwan, p.16,
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Kamusoko C, Kamusoko OW, Chikati Enos and Gamba J (2021). Mapping Urban and
Peri-Urban Land Cover in Zimbabwe: Challenges and Opportunities. Geomatics 1,
114-147. https://doi.org/10.3390/geomatics1010009
Kamusoko C (2019) Remote Sensing Image Classification in R. Springer
Kuhn M (2008) Building predictive models in R using the caret package. J. Stat. Softw
28(5): 1-26
Kuhn M, Johnson K (2016) Applied Predictive Modeling. Springer
Lu D, Weng Q (2007) A survey of image classification methods and techniques for
improving classification performance. Int. J. Remote Sens: 28, 823-870
Mather PM (1999) Computer Processing of Remotely-Sensed Images-An Introduction.
New York, John Wiley & Sons
Mellor A, Haywood A, Stone C, Jones S (2013) The performance of random forests in
an operational setting for large area sclerophyll forest classification. Remote Sens 5
(6): 2838-56
Miao X, Heaton JS, Zheng S, Charlet DA, Liu H (2012) Applying tree-based ensemble
algorithms to the classification of ecological zones using multi-temporal multi-source
remote-sensing data. Int. J. Remote Sensing 33(6): 1823-49
Molnar C, Casalicchio G, Bischl B (2018) iml: An R package for Interpretable Machine
Learning. The Journal of Open Source Software, 26(3), 786.
doi:10.21105/joss.00786
Molnar C (2019) Interpretable Machine Learning - A Guide for Making Black Box
Models Explainable. https://christophm.github.io/interpretable-ml-book/ Accessed 20
July 2021
Molnar C (2020) Package ‘iml’. https://cran.r-project.org/web/packages/iml/iml.pdf
Nemmour H, Chibani Y (2006) Multiple support vector machines for land cover change
detection: An application for mapping urban extensions. ISPRS J. Photogramm.
Remote Sensing 61: 125-133
Pal M, Mather PM (2005) Support vector machines for classification in remote sensing.
Int. J. Remote Sens 26(5): 1007-11
Rodriguez-Galiano VF, Chica-Olmo M, Abarca-Hernandez F, Atkinson PM, Jeganathan
C (2012) Random Forest classification of Mediterranean land cover using multi-
seasonal imagery and multi-seasonal texture. Remote Sens. Environ 121: 93-107
Roscher R, Bohn B, Duarte MF, Garcke J (2019) Explainable machine learning for
scientific insights and discoveries arXiv: 1905.08883
Strobl C, Boulesteix AL, Kneib T. et al. (2008) Conditional variable importance for
random forests. BMC Bioinformatics 9, 307. https://doi.org/10.1186/1471-2105-9-
307
Zhang R, Ma J (2008) An improved SVM method P-SVM for classification of remotely-
sensed data. Int. Journal of Remote Sensing 29: 6,029-6,036

Page 52
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

Appendix
Appendix 1. Resources
Several educational and training resources are available to learn about remote sensing,
general machine learning, R, and explainable and interpretable machine learning.

Websites remote sensing in R


1. Spatial data science (GIS and Remote sensing)
https://keen-swartz-3146c4.netlify.app/

2. Rtips. Revival 2014!


http://pj.freefaculty.org/R/Rtips.html

3. Online R resources for Beginners


http://www.introductoryr.co.uk/R_Resources_for_Beginners.html

4. About Quick-R
https://www.statmethods.net/

Websites with machine learning, explainable and interpretable machine learning,


remote sensing and GIS.
1. https://christophm.github.io/interpretable-ml-book/
2. https://ema.drwhy.ai//
3. https://dalex.drwhy.ai/
4. https://bradleyboehmke.github.io/HOML/
5. https://mlr3book.mlr-org.com/

Appendix 2. Land cover classification scheme


Land Cover Description
Built-up (BU) Residential, commercial, services, industrial, transportation,
communication, and utilities, and construction sites.
Bare areas (BA) Bare sparsely vegetated area with >60% soil background. Includes
sand and gravel mining pits, rock outcrops.
Cropland (Cr) Cultivated land or cropland under preparation, fallow cropland, and
cropland under irrigation.
Woodland (Wd) Woodlands, riverine vegetation, shrub, and bush.

Grass/open areas Grass cover, open grass areas, golf courses, and parks.
(Gr)

Water (Wt) Rivers, reservoirs, and lakes.

Page 53
Explainable Machine Learning for Land Cover Classification: An Introductory Guide

About The Author


Courage Kamusoko is an independent geospatial consultant based in Japan. His
expertise include remote sensing image processing and classification, land use/cover
change modeling, geospatial machine learning, and the design and implementation of
geospatial database management systems. In addition to his focus on geospatial
research and consultancy, he has dedicated time to teaching practical machine learning
for geospatial analysis and modeling.

Recent Books:
(1) Kamusoko, C. (in press). Optical and SAR Remote Sensing of Urban Areas: A
Practical Guide. Springer.
(2) Kamusoko, C. (2019). Remote Sensing Image Classification in R. Springer.

Page 54

You might also like