Professional Documents
Culture Documents
MASTERS THESIS
Author: Supervisor:
Mohamed Othman Dr. Abrar Ullah
(H00383877)
in the
August 2023
Declaration of Authorship
I, Mohamed Othman, declare that this thesis titled,’ Detecting Spatial Information from
Satellite Imagery using Deep Learning for Semantic Segmentation’ and the work
presented in it is my own. I confirm that this work submitted for assessment is my own
and is expressed in my own words. Any uses made within it of the works of other authors
in any form (e.g., ideas, equations, figures, text, tables, programs) are properly
acknowledged at any point of their use. A list of the references employed is included.
i
“The world is changed by your example not by your opinion.”
Paulo Coelho
ii
Abstract
Detecting spatial information from satellite imagery using deep learning for semantic
segmentation is an important field that is significantly growing due to its importance in
applications such as the automated generation of vector maps, urban planning, and
geographic information systems. In this research, the utilization of deep learning for the
semantic segmentation of spatial information from satellite imagery is explored. The
objective is to devise an efficient and precise method for detecting and categorizing diverse
features on the Earth's surface, including road networks, building footprints, water bodies,
vegetation, and land cover which can be used in automatic map production. The proposed
technique entails training a deep convolutional neural network to detect spatial features
from a small dataset of satellite imagery, followed by a segmentation process to classify the
various spatial features. This study conducts various experiments on satellite imagery to
achieve high accuracy rates that outperform traditional image processing techniques. In
addition, this project also compares various models such as networks with U-shaped
architecture U-Net and modified U-Net (Inception ResNetV2U-Net) with various spatial
features. Both Implemented models achieved higher results than other relevant research
papers. Although the Inception ResNetV2U-Net model produced slightly better results than
U-Net, with a validation accuracy of 87.5% and a validation coefficient of 87%, the U-Net
model achieved also high validation accuracy and coefficient of 86.5% and 84%, respectively.
Additionally, the U-Net model exhibited significantly improved and better training and
validation loss than ResNetV2U-Net. Furthermore, the U-Net model showed a shorter
average prediction time of satellite imagery. Therefore, the U-Net model is proven to be
more suitable for detecting spatial information from small satellite datasets.
iii
Acknowledgements
I would like to show my sincere gratitude to my supervisor Dr. Abrar Ullah for his
invaluable guidance, time, and support. I am also grateful to my family, friends, and
colleagues for their support and encouragement during the entire duration of this
research. Their unwavering support has motivated me to push my limits and strive for
success.
iv
Contents
v
4.2 Proposed Methodology ......................................................................................... 26
4.3 Evaluation .............................................................................................................. 28
4.3.1 Average pixel-wise intersection over union (mIoU) .......................................................... 28
4.3.2 Global accuracy ........................................................................................................... 29
4.3.3 Loss Functions ............................................................................................................. 30
4.4 Technologies and Tools.......................................................................................... 31
5 Implementation ..................................................................................................................... 32
5.1 Data Processing and Preparation........................................................................... 32
5.2 Semantic Segmentation Models ............................................................................ 33
5.2.1 U-Net Model Architecture .......................................................................................... 34
5.2.2 Inception ResNetV2U-Net Model Architecture .......................................................... 35
5.3 Training and Validation using Various Experiments .............................................. 37
6 Evaluation and Results........................................................................................................... 38
6.1 Experiments Evaluation and Results ...................................................................... 38
6.1.1 U-Net Model Experiments .......................................................................................... 38
6.1.1.1 U-Net Experiment #1...................................................................................................... 38
6.1.1.2 U-Net Experiment #2...................................................................................................... 39
6.1.1.3 U-Net Experiment #3...................................................................................................... 40
6.1.1.4 U-Net Experiment #4...................................................................................................... 41
6.1.1.5 U-Net Experiment #5...................................................................................................... 42
6.1.2 Inception ResNetV2U-Net Model Experiments .......................................................... 45
6.1.2.1 ResNetV2U-Net Experiment #1 ...................................................................... 45
6.1.2.2 ResNetV2U-Net Experiment #2 ...................................................................................... 46
6.1.2.3 ResNetV2U-Net Experiment #3 ...................................................................................... 47
6.1.2.4 ResNetV2U-Net Experiment #4 ...................................................................................... 47
6.1.2.5 ResNetV2U-Net Experiment #5 ...................................................................................... 49
6.2 Comparison between U-Net Model and ResNetV2U-Net Results ......................... 50
6.2.1 Quantitative Evaluation .............................................................................................. 51
6.2.2 Qualitative Evaluation................................................................................................. 52
6.2.2.1 Semantic Segmentation of Buildings ............................................................................. 52
6.2.2.2 Semantic Segmentation of Roads .................................................................................. 53
6.2.2.3 Semantic Segmentation of Water .................................................................................. 53
6.2.2.4 Semantic Segmentation of Lands................................................................................... 54
6.2.2.5 Semantic Segmentation of Vegetation .......................................................................... 55
6.3 Baseline Comparisons with Existing Relevant Research ........................................ 56
6.4 Comparison with Research Objectives .................................................................. 57
7 Conclusion and Future Work ................................................................................................. 58
7.1 Conclusion.............................................................................................................. 58
7.2 Future Work ........................................................................................................... 58
References ..................................................................................................................................... 60
Appendices .................................................................................................................................... 65
vi
Appendix A: Project Plan ........................................................................................................... 65
Project Schedule ......................................................................................................................... 65
Risk Management ....................................................................................................................... 66
Appendix B: Professional, Legal, Ethical, and Social Issues ....................................................... 68
Professional issue ....................................................................................................................... 68
Legal issue................................................................................................................................... 68
Ethical Issues............................................................................................................................... 68
Social issues ................................................................................................................................ 68
vii
List of Figures
FIGURE 4. 1: LABELED CLASSES IN VARIOUS COLORS FROM THE SATELLITE IMAGE DATASET [13] .............................................25
FIGURE 4. 2: SAMPLE OF SATELLITE IMAGE AND CORRESPONDING MASK FROM THE DATASET [13] ..........................................25
FIGURE 4. 3: THE LIFE CYCLE MODEL OF CRISP-DM ......................................................................................................26
viii
List of Tables
ix
Abbreviations
ML Machine Learning
DL Deep Learning
CV Computer vision
RS Remote Sensing
NN Neural Network
OSM OpenStreetMap
x
Symbols
∑ The summation symbol means adding up a series of values.
α Hyperparameters to control the balancing and the focusing strengths of the
loss.
γ Hyperparameters to control the balancing and the focusing strengths of the
loss.
Log Represents the logarithm function.
xi
Dedicated to all the people I love.
xii
Chapter 1
1 Introduction
1.1 Overview
There are many satellite images captured by a significant number of satellites around our
planet. These satellite images provide essential information for the current situation of
various spatial data such as road network, land use, land cover, agricultural, and other global
environmental changes. This data is being used in a significant number of applications in
different fields like urban planning, environment monitoring, geographic information
systems, agriculture, and many more [1].
The automation process of map production by detecting and extracting spatial information
from satellite imagery using Artificial Intelligence is significant in terms of reducing the time
and cost of making base maps. For instance, satellite imaging is extensively used in mapping
roads, and it takes a significant amount of manual work to map or update all the roads
network on our planet. That is why automatically extracting roads from satellite images is
crucial for keeping maps up to date. Therefore, it is important to optimize the automation
process of road networks since it is a required layer in a lot of mapping activities such as
navigation, route planning, fleet management, traffic management, geographic information
systems, and autonomous driving [38].
Machine learning (ML) and Computer vision (CV) can be used to automate the extraction
process of informative spatial data from satellite imagery [2]. Automation in the field of
satellite imagery analysis and remote sensing (RS) can benefit from the advances in CV and
this collaboration of RS and CV is gaining high interest from researchers in both fields [3].
In addition to that, Deep Learning is significantly being applied to Satellite images to get
more insightful information in many studies [4]. Especially, Semantic Segmentation which is
a process that aims to attach a class label to every pixel, resulting in an image with
1
Chapter 1. Introduction 2
highlighted objects. This process is used to identify and detect different object classes in
satellite images such as buildings, roads, land cover, and water bodies [5].
There are still many challenges with semantic segmentation models applied to satellite
imagery to detect high-resolution spatial information. For instance, using a small, labeled
dataset of satellite imagery instead of a big dataset will make it hard for the model to detect
spatial information with high accuracy since any machine learning model needs a large
dataset to be able to learn properly from it. Moreover, comparing various semantic
segmentation models applied to a small dataset of satellite imagery is yet to be investigated
further.
1.2 Motivations
Detecting spatial information from satellite imagery is essential for automating digital map
production, change detection, and other geographic information systems applications.
Semantic segmentation for deep learning can be used to automate the process of geospatial
data extraction from satellite images to make it useful for many applications. For instance,
the automation process of detecting and extracting spatial information from satellite
imagery will reduce the time and cost of creating digital base maps manually. This topic is
challenging due to the large number of different classes involved in satellite images which
makes it hard to classify.
This paper aims to approach a proper solution of deep learning for semantic segmentation
to detect spatial information such as building footprints, land cover, roads, vegetation, and
water bodies from a small dataset of satellite imagery. More specifically, the objectives of
this research are as the following:
• Apply deep learning models for semantic segmentation such as U-Net and
Modified U-Net on a small dataset of labeled satellite imagery to detect spatial
information.
Chapter 1. Introduction 3
2.1 Background
Satellite images are an essential source of information for automating map production,
Geographic Information Systems (GIS), agriculture, and urban planning. Satellite imagery
contains more structured and uniform spatial data compared to traditional images. These
kinds of tasks related to road extraction, building footprint detection, and land cover
classification are based on semantic segmentation models [25]. Data extracted from satellite
imagery is being used in a significant number of applications in different domains like urban
planning, environment monitoring, geographic information systems, fleet management, and
many more [1].
There are many models of semantic segmentation related to spatial data detection and
extraction methods from satellite imagery by using classical Computer Vision algorithms or
other Deep Learning algorithms. Many studies discussed various approaches and models for
detecting spatial information such as roads and building footprints from satellite imagery
which will be discussed in more detail in the related work section of this chapter.
Furthermore, employing a small, labeled dataset of satellite imagery rather than a large one
would render it arduous for the model to discern spatial information with a high degree of
precision. Many studies applied semantic segmentation models to detect various
4
Chapter 2. Background and Literature Review 5
spatial information such as roads, building footprints, or vegetation but still comparing
various semantic segmentation models with small, labeled satellite datasets is yet to be
investigated further. Before we discuss in detail the related work with critical analysis, it is
important to define and clarify some of the common terms used in computer vision, deep
learning, and semantic segmentation in the following upcoming sections.
2.2.1 Overview
The above diagram in Figure 2.1 shows how deep learning is a specific type of representation
learning. Deep learning itself falls under the umbrella of machine learning, which is utilized
for many approaches to AI [26].
Spatial data refers to data that is associated with a particular location on the Earth's surface,
such as satellite imagery, maps, and geospatial data.
Machine learning can be applied to spatial data detection from satellite imagery. The use of
machine learning algorithms in this context can help automate the process of identifying and
analyzing patterns in satellite images, such as identifying land cover types, detecting changes
in land use over time, and monitoring natural disasters or environmental changes. By feeding
large amounts of satellite imagery data into machine learning algorithms, the algorithms can
learn to recognize patterns and features in the images. This process can be used to develop
predictive models that can help identify areas at risk of natural disasters or environmental
hazards, monitor the health of ecosystems, or support urban planning and development
[32].
One example of the use of machine learning in spatial data detection from satellite imagery
is in the field of precision agriculture. By using machine learning algorithms to analyze
satellite images of crop fields, farmers can identify areas that require water, fertilizer, or
other inputs, which can help optimize crop yields and reduce waste. Overall, machine
learning has the potential to revolutionize the way we analyze and use spatial data from
satellite imagery to support a wide range of applications and industries [37].
Machine learning can be used to automate the process of spatial data extraction of satellite
images. However, there are a lot of satellite images with many labels which makes it very
hard to use popular machine learning to extract spatial data. In addition to that this data is
high dimension and difficult to process [2].
Chapter 2. Background and Literature Review 7
2.3.1 Overview
Deep learning is a machine learning technique that has been inspired by our understanding
of the human brain, as well as statistics and applied mathematics. Its development has
spanned several decades, and in recent years, it has gained significant popularity and
practicality. The increasing power of computers, larger datasets, and advances in training
techniques for deeper networks have contributed to this growth. There are both challenges
and opportunities to enhance deep learning further and explore new frontiers. [26].
The above diagram in Figure 2.2 shows how different components of an AI system relate to
each other for different systems with the shaded portions indicating which parts learn from
data [26].
Deep learning, which is a contemporary method for supervised learning, offers a potent
framework. By incorporating more layers and units within a layer, a deep network can
portray progressively complex functions. This approach is adept at managing assignments
that necessitate the mapping of input vectors to output vectors and can be executed
effortlessly and expeditiously by humans. Nevertheless, this requires extensive models and
labeled training data. Tasks that are intricate and cannot be described as mapping one vector
to another, or that demand substantial human contemplation and reasoning, cannot be
presently accomplished through deep learning [26].
One application of deep learning in spatial data analysis is in the field of remote sensing. By
using deep learning algorithms to analyze satellite imagery, it is possible to identify and
classify various features on the Earth's surface, such as land cover types, vegetation density,
and urban areas. This information can be used to monitor changes in the environment over
time, such as deforestation or urbanization, and to inform decision-making in fields such as
urban planning, agriculture, and environmental conservation [33].
Neurons in a neural network are arranged in layers, each layer comprising numerous
neurons that fulfill a specific function. The input layer accepts the data for processing, while
the output layer generates the final outcome of the network. The layers between the input
Chapter 2. Background and Literature Review 9
and output layers are referred to as hidden layers, and they perform intermediate
computations to transform the input data into the intended output. During training, a neural
network modifies the weights and biases of its neurons to minimize the discrepancy
between the predicted output and the desired output. This process is termed
backpropagation, which entails passing the error from the output layer through the hidden
layers to adjust the weights and biases of the neurons [26].
The utilization of artificial neural networks (ANNs) has gained significant popularity as a
means of analyzing remotely sensed data such as satellite and aerial images. Considerable
advancements have been achieved in image classification through the application of neural
networks. [35].
The convolutional neural network (CNN) is a deep learning method used for image
processing, particularly for computed tomography, X-ray images, and magnetic resonance
imaging. It consists of convolutional, pooling, and fully connected layers. Filters (kernels)
slide over preprocessed signals in the convolutional layer, generating a feature map. The
pooling layer reduces dimensionality to prevent overfitting and reduce computational load.
The final layer utilizes activation functions to introduce nonlinearity to the outputs. [30].
Deep Learning methods such as Fully convolutional networks (FCN) are used increasingly for
addressing semantic segmentation problems. FCN replaces the fully connected layers found
at the end of the classification networks with convolutional layers to output a spatial
segmentation map [7].
Convolutional networks have played a crucial role in the development of deep learning by
applying brain-inspired insights. They achieved impressive performance before the potential
of deep models was fully realized. In addition to that, convolutional networks were pioneers
in commercial applications of neural networks and continue to lead in the practical
implementation of deep learning. [26].
Chapter 2. Background and Literature Review 10
A convolutional neural network (CNN) is a form of Neural Network (NNs) that form the core
of the concept of deep learning. Deep learning has given impressive results in many areas,
but there are still many drawbacks with CNN in practice. For instance, a considerable volume
of manually annotated data is still necessary for training CNN models. [2].
Convolutional Neural Networks (CNNs) have proven to be highly effective in processing data
with grid-like structures, such as time series and image data. Their specialized use of
convolution operations sets them apart from traditional neural networks that rely on general
matrix multiplication. CNNs have demonstrated remarkable success in practical applications
and continue to be a prominent choice for various tasks in the field of deep learning. [28].
Convolutional networks are well-suited for neural networks to handle data with a grid-
structured topology, and they can be scaled to large sizes. This approach has been
particularly successful for two-dimensional image data. However, for processing one-
dimensional sequential data, another specialized form of the neural network framework is
used, known as recurrent neural networks [26].
Where I is the 2D image input, K is the 2D kernel or filter bank and S is the feature map
output [2].
Chapter 2. Background and Literature Review 11
Figure 2. 3: Structure of a Fully Connected Network (all connections are weighted) [26]
The above Figure 2.3 shows the FCN structure of the Input layer, hidden layers, and output
layer.
The above Figure 2.4 shows the most common layer of a fully connected layer, where all the
nodes in adjacent layers are fully connected and there are no interconnections between
neurons in the same layer [7].
Fully Convolutional Networks replace the fully connected layers at the end of classification
networks with convolutional layers to output a spatial segmentation map. Since this
segmentation map has a lower resolution as compared to the original image, it is up-sampled
to produce the final segmentation output. In order to recover finer details that are lost
during the down-sampling phase, the outputs from some of the earlier down-sampling layers
are added to the output as shown in Figure 2.4 [7,2].
Chapter 2. Background and Literature Review 12
Computer Vision is a field of artificial intelligence that enables computer systems to extract
information from images or videos. Semantic segmentation is a task for partitioning an
image into segments to be able to detect objects from the images by assigning a semantic
label to each pixel of an image. It is used to identify the boundaries of objects in images [2].
Deep learning has exhibited remarkable accuracy in computer vision tasks and holds
tremendous potential for efficiently processing vast amounts of earth observation satellite
image data in automated workflows. [24].
The field of computer vision has been a thriving research area for deep learning applications,
primarily due to the inherent complexity of vision, which humans and animals effortlessly
perform but poses significant challenges for computers [27]. Computer vision is a highly
expansive discipline that encompasses diverse image-processing techniques and a multitude
of applications. Its scope spans from emulating human visual capabilities, like facial
recognition, to pioneering novel visual abilities. Common benchmark tasks for evaluating
deep learning algorithms in computer vision include optical character recognition and object
recognition. [26].
One of the most challenges in the history of computer vision is Semantic Segmentation
because it requires the algorithm to not only detect objects in an image but also to precisely
segment them into their individual parts. Unlike object detection, where the goal is to
identify the location of an object in an image, semantic segmentation requires pixel-level
labeling of each object in the image. Furthermore, the need for high precision and accuracy
in semantic segmentation makes it particularly challenging. Even small errors in the
segmentation of an object can have significant consequences in downstream applications,
such as autonomous driving, where a misclassified object could result in a collision [6].
2.4.1 Overview
Likewise, in other Deep learning models, the accuracy of semantic segmentation models
depends on various factors, including the quality of the training dataset, the architecture
and parameters of the CNN model, and the complexity of the scene being analyzed.
Additionally, satellite imagery can present challenges such as atmospheric interference or
differences in lighting conditions, which may require preprocessing or specialized techniques
to address [36].
The structure shown in Figure 2.5 maps an input x to an output, referred to as the
reconstruction r, using an internal representation or code h. It comprises two components:
the encoder f (which maps x to h) and the decoder (which maps h to r). [26].
The above figure shows the lowest-resolution image, measuring 32x32 pixels. Each blue box
corresponds to a multi-channel feature map, displaying the number of channels at the top.
The x-y size is indicated at the lower left of each box. White boxes represent copied feature
maps, while the arrows depict the different operations involved. [8].
A network with U-shaped architecture (U-Net) proposed in [18], is a widely used deep
learning model for semantic segmentation, featuring encoder and decoder blocks connected
by skip connections, akin to fully convolutional networks. Initially developed for medical
image segmentation, U-Net has found success in satellite imagery segmentation with
impressive outcomes. [8].
Detecting spatial information from satellite imagery using deep learning for semantic
segmentation is a rapidly growing research area with tremendous potential for applications
in fields such as automating map production, geographic information systems, urban
planning, agriculture, and environmental monitoring. Deep learning for semantic
segmentation models is being used to detect and classify spatial objects such as road
extraction to be utilized in GIS maps dataset updates. Other common tasks in this field are
extracting land covers and buildings footprints using different models of deep learning for
semantic segmentation.
However, the effectiveness of deep learning models for the semantic segmentation of
satellite imagery is subject to several challenges and limitations that require careful
consideration and analysis. One of the challenges is applying semantic segmentation models
on small, labeled satellite imagery dataset rather than large dataset that need time to be
labeled manually. Moreover, implementing further investigation and experiments of various
models to detect different spatial information are yet to be discovered with a small dataset.
The effectiveness of the model is also based on the kind of spatial feature which can be
detected from satellite imagery such as roads, building footprints, land cover, vegetation,
and water bodies.
In similar research [11] for Semantic Segmentation of satellite images using Modified U-Net,
Chapter 2. Background and Literature Review 16
they mentioned proposed a solution for generating automatic area segmentation and shows
high accuracy for six classes: Building, land, road, vegetation, water, and miscellaneous. The
baseline U-Net model is enhanced by incorporating the Inception ResNet V2 model in its
encoder, resulting in increased mathematical and structural complexity. The performance of
this modified model is assessed using the Dice coefficient and pixel accuracy, yielding values
of 82 percent and 87 percent respectively. However, no comparison is made between their
proposed model and other existing models for semantic segmentation.
In another similar research [12] on semantic segmentation of aerial images using the U-Net
architecture, the authors highlight that the conventional semantic segmentation process
fails to provide accurate details due to the low resolution of aerial images. To address this
issue, they propose an algorithm based on the U-Net architecture. The U-Net model
comprises two distinct paths, namely the compression path (encoder) and the symmetric
expanding path (decoder). The encoder is made up of a stack of convolutional and maximal
pooling layers and is responsible for capturing the image's context. On the other hand, the
decoder utilizes transposed convolutions to enable precise localization and dense
prediction. Unlike traditional models, U-NET is an end-to-end fully convolutional network,
meaning it doesn't contain any dense layers, and therefore can handle images of any size.
The effectiveness of the proposed U-NET model was evaluated by comparing its accuracy to
that of previous methods using dense prediction to enhance an image. The process of
prediction of pixels in the border region was accurate and fast [12]. However, there is no
comparison between both proposed algorithms, and which one gives better results for
detecting each spatial feature.
One of the most common tasks in this domain is detecting roads. In this paper [14], a novel
deep convolutional neural network was introduced for road extraction from high-resolution
remote sensing imagery. The model utilized a U-Net architecture and incorporated
DenseNet as the feature extractor, resulting in improved accuracy of road network
extraction by capturing both local and global road information. Comparative evaluations
were conducted against state-of-the-art semantic segmentation methods, including FCN, U-
Net, and DeepLab V3+. The experimental results demonstrated the proposed network's
accurate and effective road extraction capabilities, surpassing other machine learning and
deep learning approaches in terms of precision, recall, and harmonic mean. However, they
Chapter 2. Background and Literature Review 17
did not compare the results with other models and with previous research using small and
large labeled image datasets.
Another common task in this field is extracting land covers. In this study [15], the authors
used the U-net architecture and image segmentation with different encoders. The
comparison of results aligns with the study, demonstrating an accuracy of 82.2% and
improved precision in extracting buildings within high-density residential areas. Other
remarkable findings are that the effectiveness of the Unet-ResNet50 model in data
comprehension surpasses that of the standard U-Net model. Nonetheless, comparing the
proposed model with other models such as the modified U-Net is not explored in their
research.
In this research on extracting building footprints [16], The authors introduced a U-Net-based
approach for semantic segmentation to accurately extract building footprints from satellite
images. The U-Net model was enhanced by integrating multiple strategies including data
augmentation, data processing techniques, and the integration of GIS map data and satellite
images. The proposed method achieved a notable F1-score of 0.704, surpassing the top
three solutions in the SpaceNet Building Detection Competition by 1.1% to 12.5%. Moreover,
it outperformed the standard U-Net-based method by 3.0% to 9.2%. Even if there are
remarkable results were achieved such as in this research, but still, they did not try and
compare their results with different models of semantic segmentation.
Various models are compared in this research [19] between U-Net, Modified U-Net, and
Dense-Attention Network (DAN) models for building extraction from TripleSat imagery. The
authors evaluated the performance of these architectures in Kuala Lumpur, Malaysia using
0.8m resolution TripleSat imagery. The modified U-Net achieved the highest accuracy with
an average F1-score of approximately 82.45%, surpassing both DAN with an F1-score of
80.68% and U-Net with an F1-score of 79.82%. Notably, DAN exhibited superior accuracy in
predicting larger buildings, while the modified U-Net demonstrated exceptional precision
for small and medium-sized buildings. These findings highlight the modified U-Net as the
most effective for building extraction, with the DAN excelling in accuracy for larger buildings
in 0.8m resolution TripleSat imagery.
Chapter 2. Background and Literature Review 18
Another comparison was conducted in this research [21] for various Convolutional Neural
Network (CNN) architectures for satellite image segmentation. The authors employed CNNs
to detect geo-objects on satellite images obtained from DSTL, Landsat-8, and PlanetScope
databases. They experimented with three modified versions of CNN architectures to
implement the recognition algorithm. The efficiency of the developed algorithms was
evaluated using aerial photos from the DSTL database. The results revealed that utilizing
complex CNN architectures led to improved segmentation quality for satellite images.
Another important task in this domain for farming is detecting vegetation or trees on
satellite images with high resolution. In this research [22], the authors used the U-net
convolutional neural network architecture and concluded that U-Net offers a means to
leverage the visual strength of texture and local spatial structure in high-resolution satellite
imagery. The research has demonstrated its capability to accurately identify the presence or
absence of trees and large shrubs across expansive landscape areas, achieving an accuracy
of approximately 90%.
There are also some related studies for water bodies extraction such as this research [20].
In this research, the authors introduce a novel approach for segmenting water surfaces from
satellite images using convolutional neural networks. They investigate the application of a
U-Net model and a transfer knowledge-based model. Two different deep-learning methods
are compared for water body segmentation. The first approach explores variations of the U-
Net architecture, while the second approach utilizes distillation to enhance the U-Net
response when training images are limited. The findings reveal that the overall performance
of both models is similar.
Chapter 2. Background and Literature Review 19
The satellite imagery research domain has not made as much progress as other domains due
to the scarcity of large-scale labeled datasets. However, a notable study in this field utilized
OpenStreetMap (OSM) data that is publicly available to train segmentation models for aerial
imagery [9]. Although OSM data is annotated by volunteers over different images and may
not be as precise as manually labeled data for training, the study demonstrated that a
considerable amount of weakly labeled training data could compensate for the lack of high-
quality training data [2]. Therefore, getting high and satisfying results for detecting spatial
data from a small, labeled dataset is crucial and this is part of what is discovered in this
research.
2.6 Conclusion
Most of the previous studies implemented in this domain of detecting spatial information
from satellite imagery proposed a U-Net model for semantic segmentation of satellite
images to detect and extract spatial data such as roads, land cover, or building footprints.
Although the above similar studies of [11,12] used U-NET and Modified U-Net architecture
with the same dataset which is going to be used in this research, there is no comparison
between both proposed models, and which one gives better results for detecting each
spatial feature [13].
In this research, various models such as U-Net and modified U-net would be investigated
with further customization in the models' architecture to better detect various spatial
features from the satellite imagery. Also, this paper investigates different methods and
Chapter 2. Background and Literature Review 20
experiments for best classifying and segmentation of spatial features of a small, labeled
dataset of satellite imagery. Moreover, a comparison of U-Net and Modified U-Net would
be conducted to show which one works better for which kind of spatial feature.
Chapter 3
3 Requirements Analysis
The outcome of this research is to detect spatial information with high accuracy from a small,
labeled dataset of satellite imagery using semantic segmentation. Although this is a
research-oriented project, the requirements are divided into functional and non-functional.
Various functional requirements of this research project showed in Table 3.1 With MoSCoW
requirements analysis which stands for the below points:
Priorities are also considered for these requirements as high for detecting spatial
information and for applying the U-Net model. Medium priorities are stated for the other
required functions.
21
Chapter 3. Requirements Analysis 22
The above Table 3.1 shows the functional requirement of the research project with
priorities (high, Medium, or Low) and MoSCoW (Must, Should, Could, or Won’t).
The non-functional requirements for this research project such as performance and
accuracy shown in Table 3.2 with MoSCoW requirements analysis which stands for Must,
Should, Could, and Won’t as mentioned in detail in the previous section.
Chapter 3. Requirements Analysis 23
Priority is also considered for these requirements as high for detecting highly accurate
results with high F1 scores, and medium for the model to run fast with high performance.
The above Table 3.2 shows the non-functional requirement of this research project with
priorities (high, Medium, or Low) and MoSCoW (Must, Should, Could, or Won’t).
The evaluation metrics mentioned in section 4.3 will be used to evaluate the first
requirement stated in above Table 3.2 to measure the accuracy of the proposed model
of detecting spatial information with highly accurate results and a high pixel-wise
accuracy from satellite imagery.
Chapter 4
4 Methodology
This chapter of the research describes how the aim and objectives will be obtained. It starts
with a section describing the dataset that will be used, followed by the proposed
methodology and approach. The third section is the evaluation metrics by which results are
evaluated against other findings. Lastly, there is a section for technologies and tools that will
be used in this research project.
4.1 Dataset
The dataset utilized in this research was annotated as part of a collaborative project with
the Mohammed Bin Rashid Space Center in Dubai, UAE. It comprises aerial imagery of Dubai
captured by MBRSC satellites and has been annotated with pixel-wise semantic
segmentation across six classes. [13].
The dataset includes 72 images grouped into 8 larger tiles. Each satellite tile has its mask
image which has color labels for landmarks. The list of landmarks is the number of classes
the model can classify in the satellite images. Each satellite tile is further divided into 2*2 or
3*3 or 4*4 images and sometimes 1*3 or 4*5 images.
24
Chapter 4. Methodology 25
Figure 4. 1: Labeled classes in various colors from the satellite image dataset [13]
Figure 4. 2: Sample of satellite image and corresponding mask from the dataset [13]
Overall, the Dubai Semantic Segmentation dataset is a valuable resource for researchers and
practitioners in the computer vision community. Its high level of detail and accuracy make it
a useful tool for a wide range of applications.
Chapter 4. Methodology 26
In this research project, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is
adopted as a structured approach to guide the process. It provides a framework that outlines
the typical phases, tasks, and interdependencies involved in the project.
The above Figure 4.3 shows that CRISP-DM encompasses six phases with arrows denoting
the critical and commonly observed dependencies between them. The sequence of these
phases is not rigid, as projects often iterate and move back and forth between phases based
on requirements. [31].
The research methodology for this study follows the Waterfall Methodology, wherein each
phase of the research is completed sequentially, with clear and defined steps from initiation
to completion.
Semantic segmentation models such as U-Net and Modified U-Net (Inception ResNetV2U-
Net) models are proposed to be applied to the dataset mentioned in the previous section to
detect the various classes of spatial data. A comparison between both models is going to be
investigated in this research. The methodology considers and investigates the challenge of
this small, labeled dataset to obtain high accuracy without overfitting the training data.
Chapter 4. Methodology 27
The proposed methodology for this research project involves the following steps:
4. Semantic segmentation:
• Apply the trained model to the test set to perform semantic segmentation on the
satellite imagery.
• Detect the spatial information of interest from the segmented results such as roads,
building footprints, vegetation, land cover, and water bodies.
• Evaluate the accuracy and reliability of the spatial information extraction.
Chapter 4. Methodology 28
4.3 Evaluation
The requirements of this research project mentioned in Table 3.1 and Table 3.2 evaluated
using the pixel-wise intersection over union and pixel accuracy to assess the performance of
the semantic segmentation models.
The following evaluation metrics of Average pixel-wise intersection over union and Global
Accuracy were used in this research project to assess the performance of the semantic
models:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Chapter 4. Methodology 29
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
TP
𝐼𝑜𝑈 =
TP + FP + FN
The Jaccard Index or Intersection over Union (IoU) serves as a common pixel-wise metric
for evaluating segmentation outcomes. It considers true positives (TP), which are the
accurately predicted pixels for a specific class, false positives (FP), representing
incorrectly predicted pixels for a particular class, and false negatives (FN), indicating
pixels that were predicted to not belong to a specific class but actually do. Assuming
there are m images in the dataset, the average IoU can be calculated as follows: [17]
𝑚
1
mIoU = ∑ IoU𝑖
𝑚
𝑖=1
TP + TN
𝐺𝑙𝑜𝑏𝑎𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
TP + TN + FP + FN
Chapter 4. Methodology 30
Table 4.1 shows the confusion matrix which typically has four entries (TP, FN, FP, and TN)
that are used to calculate the above equation of global accuracy [39].
Predicated Class
Positive Negative
Actual class Positive TP FN
Negative FP TN
Table 4.1 Confusion Matrix [39]
where:
True Positive (TP) represents the cases where the model predicted positive, and the
actual class was positive.
False Positive (FP) represents the cases where the model predicted positive, but the
actual class was negative.
False Negative (FN) represents the cases where the model predicted negative, but the
actual class was positive.
True Negative (TN) represents the cases where the model predicted negative, and the
actual class was negative.
The following are the loss functions that are used to measure how well the model is
performing and to adjust the model's hyper-parameters to achieve highly accurate and
reliable segmentation results.
2. Focal Loss:
The Focal Loss addresses the issue of class imbalance in semantic segmentation tasks.
It down-weights easy-to-classify pixels and reduces the influence of background
pixels that dominate the training. For a single pixel in the semantic segmentation
task, the equation is as the following [41]:
FL(p, q) = - Σ( α * (1 - p_i)^γ * q_i * log(p_i) )
3. Total Loss:
Total loss can be calculated by using the below equation
total_loss=dice_loss+(1*focal_loss)
Dice loss is used to calculate total loss which uses the Dice coefficient (F1 score) to
measure the similarity between the predicted segmentation mask and the ground
truth mask[41].
2 ⋅ ∑𝑖 = 1𝑁(𝑝𝑖 ⋅ 𝑔𝑖) + 𝜖
Dice Loss = 1 −
∑𝑖 = 1𝑁(𝑝𝑖 ⋅ 𝑔𝑖) + 𝜖
The technologies and tools used for this research project are as the following:
• Interactive Development Environment: Google Colab Pro is used with GPU and high
RAM to write, run code, visualize data, and document work all in one notebook in
one place.
• Deep Learning Framework: To implement the deep learning model for semantic
segmentation, popular frameworks such as TensorFlow and Keras are used.
• Geographic Information Systems (GIS) software: QGIS and ArcMap are used for
preprocessing and visualizing satellite imagery.
Chapter 5
5 Implementation
In this research, two deep learning models are applied to a small dataset of satellite imagery
to evaluate their results throughout various experiments implemented with each model.
Before training the models, data processing and preparation techniques are applied to the
images of the dataset to prepare them for the models. At the end of the implementation,
the best model would be proposed, and it can be used to detect spatial features from
satellite images.
The source code has been implemented using Python within the Google Collab environment,
utilizing machine learning libraries such as TensorFlow and Keras with GPU acceleration
instead of CPU. This decision of using GPU is made because the applied deep learning models
are complex and require significant computational power, leading to faster training times
and improved efficiency when using the GPU for computations.
In this research, a small dataset of Dubai satellite imagery was used which has been
annotated with pixel-wise semantic segmentation across six classes which is described in
brief in section 4.1. The dataset contains 72 images grouped into 8 tiles. Each satellite tile
has its mask image which has color labels for 6 spatial features. Each tile is further divided
into 2*2 or 3*3 or 4*4 images and sometimes 1*3 or 4*5 images [13].
32
Chapter 5. Implementation 33
Various methods and techniques have been used to process and prepare the tile images of
the dataset as the following steps:
• The images have been split using the Patchify library into small patches by given patch
cell size and merged patches into Satellite Image. Dividing the image into smaller
patches is important since it reduces memory consumption, facilitates the handling of
large images, and preserves spatial information near boundaries.
• The images have been normalized using MinMaxScaler to set the pixel values within a
specific range of [0, 1]. This standardization ensures stability for the models used in
this research for semantic segmentation which provide equal influence of all pixels.
Moreover, it allows the data to be compatible with the loss functions and activation
functions for the range of [0, 1].
• All tiles and mask images are processed to have sizes that are multiples of the patch
sizes. Images split into patches, and each patch is converted into a Numpy array. Each
image patch is processed individually with normalization and dropping the extra
unnecessary dimension.
• Due to the diverse range of image sizes, consisting of both large and small images, the
image processing approach involved cropping the images to the nearest size divisible
by 256.
• Subsequently, all images were further subdivided into patches with dimensions of
256x256x3.
• The Hex colors of Spatial features were converted to RGB then converted the RGB to
labels from 0 to 5. Therefore, each spatial feature of the masks is labeled from 0 to 5.
• Finally, the dataset is split into 80% for training data and 20 % for testing data. The
random state is set to 100 to ensure reproducibility and consistency in data splitting
across different runs with different experiments, making it easier to compare and
validate results properly.
In this research, U-Net Model, and Inception ResNetV2U-Net model were applied to the
dataset to detect the spatial classes from the dataset.
Chapter 5. Implementation 34
The structure of the U-Net Model is explained in section 2.4.2 with its structure in Figure 2.
6. U-Net is a U-shaped structure with an encoder path to capture features and a decoder
path to produce the final segmentation map. Skip connections enable the model to retain
spatial information and handle objects at different scales. All convolutional layers use the
ReLU while the final layer uses the Softmax activation function to convert the model's output
into probability scores for each class. Through training, the model adjusts these probabilities
to align with the ground truth segmentation masks, enabling accurate pixel-level
classification [8].
concatenates it with the corresponding feature map from the encoder path (c4) using the
concatenate function. The concatenated feature map is subsequently subjected to two 3x3
convolutional layers with ReLU activation (c6). This process is iteratively carried out for
subsequent blocks (u7 to u9) using the feature maps from the encoder path (c3, c2, c1) until
the original spatial dimensions are restored [8].
4. Skip Connections:
The model incorporates skip connections by concatenating feature maps from the encoder
path with their corresponding upsampled feature maps in the decoder path. These skip
connections facilitate the retention of fine-grained spatial information from earlier layers,
contributing to precise segmentation.
5. Output:
The final layer consists of a 1x1 convolutional layer with softmax activation, responsible for
generating pixel-wise probability maps for each class. The number of classes is determined
by the 'n_classes' parameter specified during model creation, offering adaptability for
various segmentation tasks.
The second model applied to the dataset is a modified version of the U-Net Model which is
the Inception ResNetV2U with U-Net. This architecture integrates the Inception ResNetV2
model into the U-Net design as a contracting path. To address deterioration and reduce
training time, it incorporates multiple-sized convolutions with residual connections within
the Inception ResNetV2 block. It also combines the Residual connection and Inception
frameworks. The architecture includes encoder-to-decoder linkages and feature map
concatenation to enhance localization information. Rather than relying on a single high-
dimensional convolutional filter, the model employs multiple blocks with lower levels of
convolution to retain the data's dimensionality. The expansion process involves increasing
the size of feature maps through up-sampling, followed by convolutions and rectified linear
functions. The architecture is finalized by incorporating a fully connected layer for
categorization, resulting in the complete Modified U-Net model. The main objective of this
approach is to merge the benefits of both Inception ResNetV2 and U-Net architectures, to
Chapter 5. Implementation 36
2. Encoder:
The model employs the pre-trained InceptionResNetV2 as the encoder, which consists
of multiple layers to extract high-level features from the input image. Additionally, four
intermediate feature maps (s1, s2, s3, and s4) are obtained from specific layers of the
InceptionResNetV2 model.
3. Bridge Connection:
To facilitate information flow between the encoder and decoder, a bridge feature map
(b1) is derived from a particular layer in the InceptionResNetV2 model.
4. Decoder:
The decoder comprises four blocks: Decoder Block 1, Decoder Block 2, Decoder Block 3,
and Decoder Block 4. Each block takes the bridge feature map (b1) and various encoder
feature maps (s4, s3, s2, s1). The bridge feature map is upsampled and concatenated
with the respective encoder feature map before being processed through a conv_block
function to refine features and capture additional context.
5. Output:
The output from Decoder Block 4 undergoes a dropout layer to mitigate overfitting.
Following that, a 1x1 2D convolutional layer with six output channels is applied. The
softmax activation function is used on this layer to generate a six-channel segmentation
map, representing the probability of each pixel belonging to one of the six classes [40].
Chapter 5. Implementation 37
In this research, 10 experiments were implemented on both the U-Net model and Inception
ResNetV2U-Net model by tuning various hyperparameters with different loss functions and
evaluation metrics. The models’ hyperparameters are optimized to achieve better results
through each experiment.
The first experiment for each model was implemented using initial hyperparameters
optimizer= Adam, loss= total_loss, batch_size= 8, verbose=1, epochs=10, and Evaluaiton=
jaccard_coef then tuning the hyperparameters for each experiment until the 5th experiment
which gave better results as described with more details about each experiment in section
6.1. The learning rate is stable for all experiments with the default learning rate of 0.001
with the Adam optimizer. The dataset is split into 80% for training data and 20% for testing
data since the dataset is small data of only 72 images.
The models’ results were evaluated by using various evaluation metrics such as Dice
Coefficient, average pixel-wise intersection over union (mIoU) called also Jaccard
Coefficient, Cross-Entropy, total loss, and accuracy which are described in detail in section
4.3. The results with the quantitive and qualitative evaluation of these experiments for each
model are discussed with their results in detail in section 6.1. In addition to that, a detailed
comparison with the spatial analysis for the semantic segmentation is conducted between
both models to explore the best model in section 6.2.
Chapter 6
Moreover, a discussion will be held on the comparison between the final proposed models
which achieved better results. Additionally, another baseline comparison will be discussed
between this research results with the existing related research papers and with the
research objectives.
For this model, 5 experiments are conducted and applied to train the model in a systematic
methodology to reach better results for detecting spatial information from this small
dataset. In each experiment, various hyperparameters are optimized until the 5 th
experiment which led to better results. The learning rate used is the default learning rate of
0.001 with the Adam optimizer for all experiments with verbose =1.
In this experiment, the following initial parameters are used to train the U-Net Model. The
following table shows the parameters used to train the U-Net Model with its evaluation
results extracted from epoch number 10.
38
Chapter 6. Evaluation and Results 39
loss= total_loss
batch_size= 8 0.9332 0.8103 0.6223 0.9346 0.8041 0.6166
epochs=10
Evaluaiton= jaccard_coef
Table 6. 1 U-Net Experiment #1 Evaluation
The evaluation metrics in the above table show that the results with these initial parameters
are not the best and it needs to be improved by tuning the hyperparameters.
The above diagrams show the comparison between training and validation data related to
IOU (Jaccard Coefficient), Loss, and accuracy which need to be optimized.
The depicted figures and diagrams above indicate inadequate segmentation of the spatial
features in terms of semantic segmentation. Therefore, in the forthcoming experiments, a
deliberate adjustment of hyperparameters will be undertaken with the aim of achieving
improved outcomes.
In this experiment, Dice Coefficient is used instead of Jaccard Coefficient to evaluate the
model results with another coefficient. The following table shows the parameters used to
train the U-Net Model with its evaluation results extracted from epoch number 10.
Chapter 6. Evaluation and Results 40
loss= total_loss
batch_size= 8 0.9316 0.8166 0.7725 0.9611 0.7272 0.6850
epochs= 10
Evaluation= dice_coef
Table 6. 2 U-Net Experiment #2 Evaluation
The above table show that the Dice Coefficient of training and validation data is higher than
the Jaccard Coefficient from experiment #1. Therefore, in the next experiments, Dice
Coefficient will be used.
The above diagram shows that the Dice coefficient of validation and training dataset is better
than the Jaccard coefficient from experiment #1. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Dice
Coefficient will be used instead of the Jaccard coefficient.
In this experiment, Cross Entropy loss is used instead of total loss which is the total of focal
loss and dice loss. The following table shows the parameters used to train the U-Net Model
with its evaluation results. The following table shows the parameters used to train the U-
Net Model with its evaluation results extracted from epoch number 10.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
The above table shows that the cross-entropy Validation loss is less than the total loss used
in the previous experiment by around 30%. This means that cross-entropy is better than
total loss with this model since it is better to decrease the loss by less than 0.5. Therefore,
cross-entropy will be used in the upcoming experiments.
The above Diagram shows that the cross-entropy loss of the validation and training dataset
is better than the total loss from experiment #2. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Cross
Entropy will be used instead of total loss.
In this experiment, the number of epochs increased to 50 epochs instead of 10 epochs and
the batch size increased to 16 instead of 8. The following table shows the parameters used
to train the U-Net Model with its evaluation results extracted from epoch number 10.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
loss= Cross-Entropy
batch_size= 16 0.3205 0.8866 0.8380 0.5585 0.8366 0.7974
epochs= 50
Evaluaiton= dice_coef
Table 6. 4 Table 6.4 U-Net Experiment #4 Evaluation
The above table shows that loss, coefficient, and accuracy are much better with training and
validation when the number of epochs increased to 50 epochs instead of 10 epochs from
the previous experiment.
Chapter 6. Evaluation and Results 42
The above Diagram shows the evaluation metrics with training and validation dataset
improved from the previous experiment. It shows also that increasing the batch size from 8
to 16 reduces the learners' capacity to generalize.
The above figure shows the semantic segmentation of sample images predicted by the U-
Net model. The results still can be improved by tuning some parameters. Therefore, in the
next experiment, the number of epochs will be increased to 100 epochs to try to get higher
results with efficient segmentation for all 6 classes of spatial features.
In this final experiment, the number of epochs increased to 100 epochs instead of 50 epochs
and the batch size decreased to 8 instead of 16 since increasing the batch size reduces the
learners' capacity to generalize from the previous experiment. The following table shows
the parameters used to train the U-Net Model with its evaluation results extracted from
epoch number 10.
Chapter 6. Evaluation and Results 43
The above table shows that the loss, coefficient, accuracy for training, and Validation
improved by increasing the number of epochs to 100 instead of 50 as in the last experiment.
The validation accuracy reaches around 86.5% and the validation coefficient with 84% which
is higher than the previous experiments.
The above Diagram shows that the loss, coefficient, accuracy for training, and Validation
improved from the last experiment. Although the gap between the validation loss and the
training loss increased after epoch 60, the accuracy and coefficient are improved for both
training and validation.
Chapter 6. Evaluation and Results 44
The above Figure shows that the semantic segmentation is improved from the previous
experiment. The spatial features are predicted properly like the masked images. Moreover,
the last image in Figure 6.8 shows that the model segmented roads which are not there in
the masked image. This means that the model’s results are accepted with this experiment.
The above figure shows the activation heat map for the first image in Figure 6.8 which is
created for the final proposed model of U-Net to understand the activation and gradient
output for one satellite image of the validation dataset. Heat Maps facilitate understanding
the regions that contribute the most to this U-Net model's decision-making for the
predictions of the spatial classes. Hence, these heatmaps show that the U-Net model is
correctly segmenting the spatial classes such as roads, vegetation, and buildings.
Chapter 6. Evaluation and Results 45
The U-Net model used in this experiment #5 is the final proposed U-Net model since the
validation accuracy is high at 86.5% and the validation coefficient at 84% with improved
semantic segmentation results for the spatial features as stated in Figure 6.8.
For this model, 5 experiments are conducted and applied to train the model in a systematic
methodology to reach better results for detecting spatial information from the dataset. In
each experiment, I tried various hyperparameters until I reach the 5th experiment which led
to the best results. The learning rate used is the default learning rate of 0.001 with the Adam
optimizer for all experiments with verbose =1.
In this experiment, the following initial parameters are used to train the Inception
ResNetV2U-Net model. The following table shows the parameters used to train this Model
with its evaluation results extracted from epoch number 10.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
loss= total_loss
batch_size= 8 0.9053 0.8869 0.7423 0.9179 0.8487 0.7052
epochs=10
Evaluaiton=jaccard_coef
Table 6. 6 ResNetV2U-Net Experiment #1 Evaluation
The evaluation metrics in the above table show that the results with these initial
parameters are not optimal and it needs to be improved by tuning the hyperparameters.
The above diagrams show the comparison between training and validation related to IOU
(Jaccard Coefficient), Loss, and accuracy which need to be improved.
Chapter 6. Evaluation and Results 46
The depicted figures and diagrams above indicate inadequate segmentation of the spatial
features in terms of semantic segmentation. Therefore, in the forthcoming experiments, a
deliberate adjustment of hyperparameters will be undertaken with the aim of achieving
improved outcomes.
In this experiment, the Dice Coefficient was used instead of Jaccard Coefficient to evaluate
the Inception ResNetV2U-Net model results with another coefficient. The following table
shows the parameters used to train this Model with its evaluation results extracted from
epoch number 10.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
loss= total_loss
batch_size= 8 0.9041 0.8870 0.8518 0.9183 0.8507 0.8246
epochs= 10
Evaluaiton= dice_coef
Table 6. 7 ResNetV2U-Net Experiment #2 Evaluation
The above results show that the Dice Coefficient is higher than Jaccard Coefficient from
experiment #1 by around 10% for training and validation.
The above Diagram shows that the Dice coefficient of validation and training dataset is
Chapter 6. Evaluation and Results 47
better than the Jaccard coefficient from experiment #1. In addition to that, it can be
improved to reach higher results by tuning other hyperparameters. Therefore, in the next
experiments, Dice Coefficient will be used instead of the Jaccard coefficient.
In this experiment, Cross Entropy loss is used instead of total loss which is the total of focal
loss and dice loss. The following table shows the parameters used to train the Inception
Inception ResNetV2U-Net model with its evaluation results.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
The above table shows that the cross-entropy Validation loss is less than the total loss used
in the previous experiment by around 50%. This means that cross-entropy is better than
total loss with this model since it is better to decrease the loss by less than 0.5. Therefore,
cross-entropy will be used in the upcoming experiments.
The above diagram shows that the cross-entropy loss of the validation and training dataset
is better than the total loss from experiment #2. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Cross
Entropy will be used instead of total loss.
In this experiment, the number of epochs increased to 50 epochs instead of 10 epochs and
the batch size increased to 16 instead of 8. Increasing the batch size reduces the learners'
capacity to generalize. The following table shows the parameters used to train this Model
with its evaluation results extracted from epoch number 10.
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
loss= Cross-Entropy
batch_size= 16 0.0721 0.9720 0.9594 0.6438 0.8784 0.8725
epochs= 50
Evaluaiton= dice_coef
Table 6. 9 ResNetV2U-Net Experiment #4 Evaluation
The above table shows that loss, coefficient, and accuracy are much better with training and
validation when the number of epochs increased to 50 epochs instead of 10 epochs from
the previous experiment.
Figure 6.13 show the evaluation metrics with training and validation dataset improved from
the previous experiment. It shows also that increasing the batch size from 8 to 16 reduces
the learners' capacity to generalize.
The above figure shows the semantic segmentation of sample images predicted by the
Inception ResNetV2U-Net model. The results still can be improved by tuning some
parameters. Therefore, in the next experiment, the number of epochs will be increased to
100 epochs to try to get higher results with efficient segmentation for all 6 classes of spatial
features.
In this final experiment, the number of epochs increased to 100 epochs instead of 50 epochs
and batch size increased to 16 instead of 8. The following table shows the parameters used
to train this Model with its evaluation results extracted from epoch number 10.
The above table and diagrams show that the loss, coefficient, accuracy for training, and
Validation improved by increasing the number of epochs to 100 instead of 50 as in the last
experiment. The validation, accuracy reaches around 87.5%, and the Validation coefficient
87%. The Validation loss is 0.73 which is still high, and this can lead to errors in the semantic
segmentation.
Chapter 6. Evaluation and Results 50
The above figure shows that the semantic segmentation is improved from the previous
experiment but there are still some errors in predicting and segmenting some spatial
features such as roads and vegetation.
In the previous section 6.1, 10 various experiments were conducted for both the U-Net
model and Inception ResNetV2U-Net model to find the best results for each model. In the
Chapter 6. Evaluation and Results 51
upcoming sections, quantitative and qualitative evaluation of both models will be discussed.
Evaluation metrics used in this research are discussed in detail in section 4.3 to evaluate the
semantic segmentation models. The best training models’ results were found to be in the
5th experiment for each model as per the metrics scores in section 6.1. Therefore, a
comparison between both training models from experiment number 5 is conducted to
discover the best model based on the evaluation metrics. The 5th experiment’s
hyperparameters are the same for both models which are optimizer = adam, loss = cross-
entropy, batch_size = 8, verbose = 1, Evaluation-Coefficient = jaccard_coef and epochs = 100
epochs.
Model/Evaluation Metrics U-Net Inception ResNetV2U-Net
Loss 0.1974 0.0415
Accuracy 0.9307 0.9834
Coeffient 0.8985 0.9758
Val_loss 0.5748 0.7337
Val_accuray 0.8645 0.8751
Val_Coeffient 0.8409 0.8720
Accuracy Diagram
Dice-Coefficient Diagram
The above table shows that although the accuracy and coefficient for the training and
validation data in the U-Net model are slightly less than its corresponding in Resentv2U-Net,
Chapter 6. Evaluation and Results 52
the training and validation loss for the U-Net model is much better than that ResentV2U-Net
model. Moreover, the average time taken for the prediction of sample satellite imagery in
the U-Net model is less than the time taken by Inception ResNetV2U-Net. Time taken for
prediction is very crucial when working with a large amount of satellite imagery dataset.
Therefore, the U-Net model is more suitable when applied to a small satellite dataset which
gave a high validation accuracy and coefficient of 86.5% and 84% respectively with an
acceptable loss score. In addition to that, the average time taken to predict one image is less
than the other model.
In this section, semantic segmentation analysis for the spatial features of sample images
taken from the testing dataset will be discussed for the spatial feature classes for both
applied models of U-Net and Inception ResNetV2U-Net. This comparison will clarify which
model is working properly for which spatial feature.
The below table shows a sample of the prediction images which contain the buildings
feature class with the predicted image of both models.
The above figure show that detecting the building feature class in both models is right. This
validates that both models are perfect when working to detect building footprints from
satellite imagery which is very essential for applications such as Geographic information
systems, building base maps, and urban planning.
The below table shows a sample of the prediction images which contain road feature classes
with the predicted image of both models.
The above figure shows the roads in the U-net and Inception ResNetV2U-Net models are
segmented properly. However, there are some roads segmented as unconnected roads in
both models, but these roads are connected in the ground truth images. From the
perspective of geospatial road mapping, unconnected roads can mean that the roads are
still under construction. The reason for this error is the existence of trees or shades on the
roads which can mislead the model while predicting this feature from the satellite imagery.
The below table shows a sample of the prediction images which contain the water feature
class with the predicted image of both models.
Chapter 6. Evaluation and Results 54
The above figure shows that detecting the water areas in both models is exceptionally
good. However, the first image in this table 6.20 shows that the U-Net model is slightly
better at detecting the water area than ResNetV2U-Net since there is some error in
detecting the water area with ResNetV2U-Net. As indicated in the quantitative evaluation
in section 6.2.1, the loss of ResNetV2U-Net is higher than the U-Net model which in turn
can lead to some error in detecting spatial features from the satellite imagery.
The below table shows a sample of the prediction images which contain the Lands (Unpaved
Areas) feature class with the predicted image of both models.
The above figure show that detecting lands or unpaved areas in both models is exceptionally
good. This validates that both models are perfect when working to detect Lands from
satellite imagery which is very essential for applications such as urban planning and land
administration applications.
The below table shows a sample of the prediction images which contain the vegetation
feature class with the predicted image of both models.
The above figure show that the U-Net model is better at detecting vegetation or green
areas rather than Inception ResNetV2U-Net. Even in detecting small details such as small
green areas between other features.
Overall, the U-Net model shows better results in semantic segmentation of spatial features
such as water areas and vegetation spatial features rather than the Inception ResNetV2U-
Net model. The detection of some spatial features in the ResNetV2U-Net model exhibits
higher errors compared to the U-Net model, attributed to the higher loss observed in
ResNetV2U-Net. Both models can show more accurate results when providing fully
Chapter 6. Evaluation and Results 56
accurate masked images for each detail in the satellite image with a large dataset.
In this research, various experiments with detailed quantitative and qualitative analyses are
implemented on the U-Net model and Inception ResentV2Unet model to discover the best
model working with a small dataset of satellite imagery. Moreover, a detailed comparison
was conducted between both models’ results along with Geo-spatial analysis for each
feature detected on the satellite imagery to measure each model's performance.
Although this research [11] implemented on the same dataset, stated that Inception
ResNetV2U-Net is better than U-Model with a validation accuracy of 87% and training
accuracy of 92%, the Inception ResNetV2U-Net model implemented in our research
achieved better results with a validation accuracy of 87.5% and training accuracy of 98%. In
addition to that, in this paper, the U-Net model proved to be more suitable when working
with a small dataset compared to the other model. Although the proposed U-Net model
showed very slightly lower accuracy and coefficient values for both training and validation
data compared to ResNetV2U-Net, it exhibits significantly improved training and validation
loss. Additionally, the U-Net model demonstrates a shorter average prediction time for
sample satellite imagery when compared to Inception ResNetV2U-Net. Moreover, the
proposed U-Net provided higher results of semantic segmentation of the spatial features.
As a result, the U-Net model proves to be more suitable for small satellite datasets.
The U-Net model in this research achieved a high validation accuracy and coefficient of
86.5% and 84%, respectively, with a good loss score. Which exceeds the U-Net model results
of this relevant research [12] implemented in the same dataset with a validation accuracy
of only 77%.
[11, 12] that utilized semantic segmentation models on the same dataset.
This research achieved all objectives stated in section 1.3, U-Net and Modified U-Net
(Inception ResNetV2U-Net) deep learning models applied to explore better results for
semantic segmentation of spatial feature classes from a small dataset of satellite imagery.
Diverse experiments and thorough geospatial analyses, encompassing quantitative and
qualitative evaluations as well as comparisons, are executed to evaluate the performance
and results of both models. In addition to that, a comparison was conducted with existing
studies to discuss the successful results achieved by the proposed model applied in this
research. This paper also validated deep learning's effectiveness in various domains such as
detecting roads, buildings footprints and vegetation, and water areas for Geographic
information systems, building base maps, and urban planning applications.
Chapter 7
This research project successfully applied deep learning models, U-Net and Modified U-Net
(Inception ResNetV2U-Net) to perform semantic segmentation on a small dataset of
satellite imagery. The models exhibited high accuracy in identifying spatial features such as
building footprints, land cover, roads, vegetation, and water bodies. Various experiments
and comprehensive analyses, incorporating both quantitative and qualitative evaluations
and comparisons, are conducted to assess the performance and outcomes of the two
models. Additionally, a comparison with existing studies is conducted to highlight the
successful results attained by the proposed model in this research. Both models achieved
higher results in detecting the spatial features than other existing research papers with the
same dataset. Despite the slightly higher results achieved by the Inception ResNetV2U-Net
model than the U-Net model, with a validation accuracy of 87.5% and a validation
coefficient of 87%, the U-Net model is proven to be more suitable for working with a small
dataset. The U-Net model achieved a high validation accuracy and coefficient of 86.5% and
84%, respectively. In addition to that, the U-Net model exhibited significantly improved and
better training and validation loss than ResNetV2U-Net. Furthermore, a shorter average
prediction time of satellite imagery is demonstrated by the U-Net model when compared
to Inception ResNetV2U-Net. The U-Net model also showcased its capacity to provide
efficient semantic segmentation for the spatial features rather than the modified U-Net and
the existing masked images of the dataset. As a result, the U-Net model is proven to be
more suitable for detecting spatial information from small satellite datasets with high
performance.
Using a small dataset of satellite imagery in this research is one of the most challenges faced
while training the models. The challenge lies in performing semantic segmentation on such
a limited dataset since any deep learning model requires a large dataset to be able to learn
58
Chapter 7 Conclusion and Future Work 59
and provide better results. Moreover, the variety and complexity of spatial features such as
complex buildings or the shades of the buildings or trees in the satellite imagery, make it
very hard for the model to detect some spatial features such as roads properly. Therefore,
providing a large number of masked satellite images for the same dataset is crucial to get
more accurate results. The large annotated and masked satellite dataset has to be for
various areas and geographic regions that will help the model to learn, predict and provide
better results for the semantic segmentation of complex spatial features.
Furthermore, the dataset's satellite images are manually masked and annotated by humans,
which has resulted in wrong masks and segmentation for certain classes in some images.
This misleading data can negatively impact the model's learning process, potentially
affecting the quality of its predictions. The presence of errors in segmenting some spatial
features in the dataset may lead the model to incorrectly predict some spatial features from
satellite images. Therefore, it is recommended to generate masked images automatically by
developing a script or software that can generate masked images for each detail of spatial
feature in the satellite image without the need for manual intervention.
Chapter 6
References
[1] Sisodiya, N., Dube, N., & Thakkar, P. (2020). Next-Generation Artificial Intelligence
Techniques for Satellite Data Processing. International Journal of Advanced Research in
Computer Science and Software Engineering, 10(3), 34-39.
[2] Gupta, A. (2022). Deep Learning for Semantic Feature Extraction in Aerial Imagery.
Journal of Remote Sensing, 14(2), 23-31.
[3] Tuia, D., Wegner, J. D., Hansch, R., Le Saux, B., Yokoya, N., Demir, I., Jacobs, N., Kopersky,
K., Pacifici, F., Raskar, R., Brown, M., & Burhin, M. (2020). EARTHVISION 2020: Large Scale
Computer Vision for Remote Sensing Imagery. Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (pp. 3823-3826). IEEE.
[4] Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., & Johnson, B. A. (2021). Deep learning in remote
sensing applications: A meta-analysis and review. ISPRS Journal of Photogrammetry and
Remote Sensing, 180, 197-213.
[5] Selea, T., & Neagul, M. (2017). Using Deep Networks for Semantic Segmentation of
Satellite Images. In 2017 19th International Symposium on Symbolic and Numeric Algorithms
for Scientific Computing (SYNASC) (pp. 409-415). IEEE. doi: 10.1109/SYNASC.2017.00074.
[6] Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2018). A review of semantic segmentation
using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2),
87-93.
[7] Shelhamer, E., Long, J., & Darrell, T. (2017). Fully Convolutional Networks for Semantic
Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640-
60
References 61
651. [8] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. In Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9351,
pages 234-241. ISBN 9783319245737. doi: 10.1007/978-3-319-24574-4_fn_g28.
[9] Kaiser, P., Wegner, J. D., Lucchi, A., Jaggi, M., Hofmann, T., & Schindler, K. (2017). Learning
Aerial Image Segmentation from Online Maps. IEEE Transactions on Geoscience and Remote
Sensing. doi: 10.1109/TGRS.2017.2719738.
[10] Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., & Garcia-Rodriguez, J.
(2017). A review on deep learning techniques applied to semantic segmentation. arXiv Prepr.
arXiv1704.06857.
[11] Patil, D., Patil, K., Nale, R., & Chaudhari, S. (2022, March). Semantic Segmentation of
Satellite Images using Modified U-Net. In 2022 IEEE Region 10 Symposium (TENSYMP) (pp.
1-6). IEEE. doi: 10.1109/TENSYMP54529.2022.9864504.
[12] Ali, K. and Hussien, S. (2022) Semantic Segmentation of Aerial Images Using U-Net
Architecture.
[13] Human in the Loop. (2023). Semantic Segmentation Dataset. Available at:
https://humansintheloop.org/resources/datasets/semantic-segmentation-dataset-2/
(Accessed: March 9, 2023).
[14] Xu, Y., Xie, Z., Feng, Y., & Chen, Z. (2018). Road extraction from high-resolution remote
sensing imagery using deep learning. Remote Sensing, 10, 1461.
[15] Alsabhan, W., & Alotaiby, T. (2022). Automatic building extraction on satellite images
using Unet and ResNet50.
[16] Li, W., He, C., Fang, J., Zheng, J., Fu, H., & Yu, L. (2019). Semantic Segmentation-Based
Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source
61
References 62
[17] Singh, N. J., & Nongmeikapam, K. (2023). Semantic segmentation of satellite images
using Deep-Unet. Arabian Journal of Science and Engineering, 48, 1193-1205.
[18] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical Image Computing
and Computer-Assisted Intervention.
[19] Wang, X., Huang, X., Chen, C., Zhou, B., He, J., & Chen, T. (2019). Comparison between
UNet, modified UNet, and dense-attention network (DAN) for building extraction from
TripleSat imagery.
[20] Gonzalez, J., Sankaran, K., Ayma, V., & Beltran, C. (2020). Application of Semantic
Segmentation with Few Labels in the Detection of Water Bodies from Perusat-1 Satellite’s
Images. In 2020 IEEE Latin American GRSS & ISPRS Remote Sensing Conference (LAGIRS) (pp.
483-487). Santiago, Chile: IEEE. doi: 10.1109/LAGIRS48042.2020.9165643.
[21] Khryashchev, V., Ivanovsky, L., Pavlov, V., Ostrovskaya, A., & Rubtsov, A. (2018).
Comparison of Different Convolutional Neural Network Architectures for Satellite Image
Segmentation. In 2018 23rd Conference of Open Innovations Association (FRUCT) (pp. 172-
179). Bologna, Italy: IEEE. doi: 10.23919/FRUCT.2018.8588071.
[22] Flood, N., Watson, F., & Collett, L. (2019). Using a U-net convolutional neural network
to map woody vegetation extent from high-resolution satellite imagery across Queensland,
Australia. Remote Sensing, 11(13), 1571.
[23] Gupta, A., Watson, S., & Yin, H. (2020). Deep Learning-based Aerial Image Segmentation
with Open Data for Disaster Impact Assessment. arXiv preprint arXiv:2006.05575.
[24] Storie, C. D., & Henry, C. J. (2018). Deep learning neural networks for land use land cover
mapping. In IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing
62
References 63
[25] Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D.,
& Raskar, R. (2018). DeepGlobe 2018: A challenge to parse the earth through satellite
images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 19-26. doi: 10.1109/CVPRW.2018.00009.
[26] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning Book. MIT Press.
[27] Ballard, D. H., Hinton, G. E., & Sejnowski, T. J. (1983). Parallel vision computation.
Nature.
[28] LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications of neural
network chips and automatic learning. IEEE Communications Magazine, 27(11), 41–46.
[29] Livshin, I. (2019). Learning About Neural Networks. In: Artificial Neural Networks with
Java. Apress, Berkeley, CA.
[30] Mohseni-Dargah, M., Falahati, Z., Dabirmanesh, B., Nasrollahi, P., & Khajeh, K. (2022).
Machine learning in surface plasmon resonance for environmental monitoring. In Artificial
Intelligence and Data Science in Environmental Sensing.
[32] Nikparvar, B., & Thill, J. C. (2021). Machine Learning of Spatial Data. ISPRS International
Journal of Geo-Information, 10(9), 600.
[33] Li, W., He, C., Fang, J., Zheng, J., Fu, H., & Yu, L. (2019). Semantic Segmentation-Based
Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source
63
References 64
[35] Mas, J. F. and Flores, J. J. (2008). The application of artificial neural networks to the
analysis of remotely sensed data. International Journal of Remote Sensing, 29(3), 617-663.
[36] Kedzierski, M., Wierzbicki, D., Sekrecka, A., Fryskowska, A., Walczykowski, P., & Siewert,
J. (2019). Influence of Lower Atmosphere on the Radiometric Quality of Unmanned Aerial
Vehicle Imagery. Remote Sensing, 11(10), 1214.
[37] Mhango, J. K., Harris, E. W., Green, R., & Monaghan, J. M. (2021). Mapping Potato Plant
Density Variation Using Aerial Imagery and Deep Learning Techniques for Precision
Agriculture. Remote Sensing, 13(14), 2705.
[38] Singh, V.P., Srivastava, P.K., Agrawal, S., & Kandpal, M. (2020). Automated Road
Extraction from High-Resolution Satellite Images: A Review. Remote Sensing, 12(7), 1194.
[39] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O'Reilly Media.
[40] Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-
ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence.
[41] Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object
Detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2980-2988.
64
Appendices
Appendix A: Project Plan
Project Schedule
The research project commenced in January and continued until mid-April, with the research
report being completed during this time. Subsequently, the project is scheduled to resume
in early May. The implementation phase encompasses data preparation and processing,
model building, conducting training and testing experiments, evaluation, measurement, and
dissertation writing. The project's submission is planned for mid-August.
The Gantt Chart above shows the project schedule and timeline for all tasks and subtasks of
this research project. The research project will be executed using waterfall methodology, a
sequential approach where each phase is completed before moving on to the next as shown
in the above timeline of Table A.1.
65
Appendices 66
Risk Management
In the above Table A.1, the top right corner of the matrix identifies the top risks that should
get identified and mitigated. A mitigation strategy is required to prevent them from
happening or to mitigate them and reduce their impact.
To implement risk management, several steps need to be done as the following as per the
ISO risk management process [34]:
1. Identify the potential risks of this research project as stated in below Table A.2.
2. Assess the occurrence probability of the risk.
3. Assess the impact level for each risk.
4. Set a strategy to deal with the potential issues.
The above Table A.2 shows the list of risks with their likelihood, Impact, and mitigation
strategy to reduce or mitigate the impact. The risks related to this research project are
presented, with each risk being evaluated based on its impact and likelihood level (low,
medium, or high).
Appendices 68
Professional issue
The code was developed and tested with meticulous attention to high standards, ensuring
clarity through comprehensive comments. It adheres to the codes of conduct set forth by
the British Computing Society. Third-party libraries and software are utilized only in
compliance with their respective licenses. Proper referencing and citations are included for
any external information used.
Legal issue
The permission to access and download the dataset is granted by filling out the form stated
on the Humans in the Loop website [13]. This dataset is dedicated to the public domain
by Humans in the Loop under CC0 1.0 license.
Ethical Issues
As this research-based project does not involve users or sensitive datasets, there is no risk
of breaching ethics codes. Furthermore, safety concerns are not applicable since the project
does not involve users in either the implementation or evaluation process.
Social issues