Detecting Spatial Information From Satellite Imagery Using Deep Learning For Semantic Segmentation

HERIOT-WATT UNIVERSITY
MASTERS THESIS
Detecting Spatial Information from

Satellite Imagery using Deep Learning for
Semantic Segmentation
Author: Supervisor:
Mohamed Othman Dr. Abrar Ullah
(H00383877)
A thesis submitted in fulfilment of the requirements

for the degree of MSc. Computer Software
Engineering
in the
School of Mathematical and Computer Sciences
August 2023
Declaration of Authorship
I, Mohamed Othman, declare that this thesis titled,’ Detecting Spatial Information from
Satellite Imagery using Deep Learning for Semantic Segmentation’ and the work
presented in it is my own. I confirm that this work submitted for assessment is my own
and is expressed in my own words. Any uses made within it of the works of other authors
in any form (e.g., ideas, equations, figures, text, tables, programs) are properly
acknowledged at any point of their use. A list of the references employed is included.
Signed: Mohamed Othman
Date: 21st August 2023
i
“The world is changed by your example not by your opinion.”
Paulo Coelho
ii
Abstract
Detecting spatial information from satellite imagery using deep learning for semantic
segmentation is an important field that is significantly growing due to its importance in
applications such as the automated generation of vector maps, urban planning, and
geographic information systems. In this research, the utilization of deep learning for the
semantic segmentation of spatial information from satellite imagery is explored. The
objective is to devise an efficient and precise method for detecting and categorizing diverse
features on the Earth's surface, including road networks, building footprints, water bodies,
vegetation, and land cover which can be used in automatic map production. The proposed
technique entails training a deep convolutional neural network to detect spatial features
from a small dataset of satellite imagery, followed by a segmentation process to classify the
various spatial features. This study conducts various experiments on satellite imagery to
achieve high accuracy rates that outperform traditional image processing techniques. In
addition, this project also compares various models such as networks with U-shaped
architecture U-Net and modified U-Net (Inception ResNetV2U-Net) with various spatial
features. Both Implemented models achieved higher results than other relevant research
papers. Although the Inception ResNetV2U-Net model produced slightly better results than
U-Net, with a validation accuracy of 87.5% and a validation coefficient of 87%, the U-Net
model achieved also high validation accuracy and coefficient of 86.5% and 84%, respectively.
Additionally, the U-Net model exhibited significantly improved and better training and
validation loss than ResNetV2U-Net. Furthermore, the U-Net model showed a shorter
average prediction time of satellite imagery. Therefore, the U-Net model is proven to be
more suitable for detecting spatial information from small satellite datasets.
iii
Acknowledgements
I would like to show my sincere gratitude to my supervisor Dr. Abrar Ullah for his
invaluable guidance, time, and support. I am also grateful to my family, friends, and
colleagues for their support and encouragement during the entire duration of this
research. Their unwavering support has motivated me to push my limits and strive for
success.
iv
Contents
Declaration of Authorship ................................................................................................................ i

Abstract ........................................................................................................................................... iii
Acknowledgements ......................................................................................................................... iv
Contents ...........................................................................................................................................v
List of Figures ................................................................................................................................. viii
List of Tables .................................................................................................................................... ix
Abbreviations.................................................................................................................................... x
Symbols............................................................................................................................................ xi
1 Introduction ............................................................................................................................. 1
1.1 Overview .................................................................................................................. 1
1.2 Motivations .............................................................................................................. 2
1.3 Aims and Objectives ................................................................................................ 2
1.4 Project Outline ......................................................................................................... 3
2 Background and Literature Review ......................................................................................... 4
2.1 Background .............................................................................................................. 4
2.2 Machine Learning .................................................................................................... 5
2.2.1 Overview ....................................................................................................................... 5
2.2.2 Machine Learning and Spatial Data .............................................................................. 6
2.3 Deep Learning .......................................................................................................... 7
2.3.1 Overview ....................................................................................................................... 7
2.3.2 Neural Networks ........................................................................................................... 8
2.3.3 Convolutional Neural Networks.................................................................................... 9
2.3.4 Computer Vision ......................................................................................................... 12
2.4 Semantic segmentation ......................................................................................... 12
2.4.1 Overview ..................................................................................................................... 12
2.4.2 Encoder-Decoder Architecture ................................................................................... 13
2.5 Related Work ......................................................................................................... 15
2.6 Conclusion.............................................................................................................. 19
3 Requirements Analysis .......................................................................................................... 21
3.1 Functional Requirements....................................................................................... 21
3.2 Non-functional Requirements ............................................................................... 22
4 Methodology ......................................................................................................................... 24
4.1 Dataset ................................................................................................................... 24
v
4.2 Proposed Methodology ......................................................................................... 26
4.3 Evaluation .............................................................................................................. 28
4.3.1 Average pixel-wise intersection over union (mIoU) .......................................................... 28
4.3.2 Global accuracy ........................................................................................................... 29
4.3.3 Loss Functions ............................................................................................................. 30
4.4 Technologies and Tools.......................................................................................... 31
5 Implementation ..................................................................................................................... 32
5.1 Data Processing and Preparation........................................................................... 32
5.2 Semantic Segmentation Models ............................................................................ 33
5.2.1 U-Net Model Architecture .......................................................................................... 34
5.2.2 Inception ResNetV2U-Net Model Architecture .......................................................... 35
5.3 Training and Validation using Various Experiments .............................................. 37
6 Evaluation and Results........................................................................................................... 38
6.1 Experiments Evaluation and Results ...................................................................... 38
6.1.1 U-Net Model Experiments .......................................................................................... 38
6.1.1.1 U-Net Experiment #1...................................................................................................... 38
6.1.1.2 U-Net Experiment #2...................................................................................................... 39
6.1.1.3 U-Net Experiment #3...................................................................................................... 40
6.1.1.4 U-Net Experiment #4...................................................................................................... 41
6.1.1.5 U-Net Experiment #5...................................................................................................... 42
6.1.2 Inception ResNetV2U-Net Model Experiments .......................................................... 45
6.1.2.1 ResNetV2U-Net Experiment #1 ...................................................................... 45
6.1.2.2 ResNetV2U-Net Experiment #2 ...................................................................................... 46
6.2 Comparison between U-Net Model and ResNetV2U-Net Results ......................... 50
6.2.1 Quantitative Evaluation .............................................................................................. 51
6.2.2 Qualitative Evaluation................................................................................................. 52
6.2.2.1 Semantic Segmentation of Buildings ............................................................................. 52
6.2.2.2 Semantic Segmentation of Roads .................................................................................. 53
6.2.2.3 Semantic Segmentation of Water .................................................................................. 53
6.2.2.4 Semantic Segmentation of Lands................................................................................... 54
6.2.2.5 Semantic Segmentation of Vegetation .......................................................................... 55
6.3 Baseline Comparisons with Existing Relevant Research ........................................ 56
6.4 Comparison with Research Objectives .................................................................. 57
7 Conclusion and Future Work ................................................................................................. 58
7.1 Conclusion.............................................................................................................. 58
7.2 Future Work ........................................................................................................... 58
References ..................................................................................................................................... 60
Appendices .................................................................................................................................... 65
vi
Appendix A: Project Plan ........................................................................................................... 65
Project Schedule ......................................................................................................................... 65
Risk Management ....................................................................................................................... 66
Appendix B: Professional, Legal, Ethical, and Social Issues ....................................................... 68
Professional issue ....................................................................................................................... 68
Legal issue................................................................................................................................... 68
Ethical Issues............................................................................................................................... 68
Social issues ................................................................................................................................ 68
vii
List of Figures
FIGURE 2. 1: TYPES OF MACHINE LEARNING (AI) [26].......................................................................................................5

FIGURE 2. 2: DIFFERENT COMPONENTS OF AN AI SYSTEM [26] ...........................................................................................7
FIGURE 2. 3: STRUCTURE OF A FULLY CONNECTED NETWORK (ALL CONNECTIONS ARE WEIGHTED) [26] ...................................11
FIGURE 2. 4: STRUCTURE OF THE FULLY CONVOLUTIONAL NETWORKS [7] ..........................................................................11
FIGURE 2. 5: THE GENERAL STRUCTURE OF AN AUTOENCODER [26] ...................................................................................14
FIGURE 2. 6: U-NET ARCHITECTURE OF ENCODER-DECODER ARCHITECTURE [8] .................................................................14
FIGURE 4. 1: LABELED CLASSES IN VARIOUS COLORS FROM THE SATELLITE IMAGE DATASET [13] .............................................25
FIGURE 4. 2: SAMPLE OF SATELLITE IMAGE AND CORRESPONDING MASK FROM THE DATASET [13] ..........................................25
FIGURE 4. 3: THE LIFE CYCLE MODEL OF CRISP-DM ......................................................................................................26
FIGURE 6. 1 U-NET EXPERIMENT #1 EVALUATION DIAGRAMS .........................................................................................39

FIGURE 6. 2 U-NET EXPERIMENT #1 SEMANTIC SEGMENTATION ......................................................................................39
FIGURE 6. 3 U-NET EXPERIMENT #2 COEFFICIENT DIAGRAM ...........................................................................................40
FIGURE 6. 4 U-NET EXPERIMENT #3 LOSS DIAGRAM......................................................................................................41
FIGURE 6. 9 U-NET MODEL HEAT-MAP ......................................................................................................................44
FIGURE 6. 10 RESNETV2U-NET EXPERIMENT #1 EVALUATION DIAGRAMS ........................................................................45
FIGURE 6. 11 RESNETV2U-NET EXPERIMENT #1 SEMANTIC SEGMENTATION .....................................................................46
FIGURE 6. 12 RESNETV2U-NET EXPERIMENT #2 DICE-COEF DIAGRAM ............................................................................46
FIGURE 6. 13 RESNETV2U-NET EXPERIMENT #3 LOSS ..................................................................................................47
FIGURE 6. 14 RESNETV2U-NET EXPERIMENT #4 DIAGRAMS ..........................................................................................48
FIGURE 6. 16 RESNETV2U-NET EXPERIMENT #5 DIAGRAMS ..........................................................................................49
FIGURE 6. 19 SEMANTIC SEGMENTATION OF BUILDINGS .................................................................................................52
FIGURE 6. 18 SEMANTIC SEGMENTATION OF ROADS ......................................................................................................53
FIGURE 6. 20 SEMANTIC SEGMENTATION OF WATER .....................................................................................................54
FIGURE 6. 21 SEMANTIC SEGMENTATION OF LANDS .......................................................................................................54
FIGURE 6. 22 SEMANTIC SEGMENTATION OF VEGETATION ..............................................................................................55
FIGURE A. 1 PROJECT GANTT CHART ...........................................................................................................................65
viii
List of Tables
TABLE 3. 1: FUNCTIONAL REQUIREMENTS ....................................................................................................................22

TABLE 3. 2: NON-FUNCTIONAL REQUIREMENTS ............................................................................................................23
TABLE 4.1 CONFUSION MATRIX [39] ..........................................................................................................................30
TABLE 6. 1 U-NET EXPERIMENT #1 EVALUATION...........................................................................................................39

TABLE 6. 4 TABLE 6.4 U-NET EXPERIMENT #4 EVALUATION............................................................................................41
TABLE 6. 6 RESNETV2U-NET EXPERIMENT #1 EVALUATION ...........................................................................................45
TABLE 6. 10 RESNETV2U-NET EXPERIMENT #5 EVALUATION .........................................................................................49
TABLE 6. 11 QUANTITATIVE COMPARISONS BETWEEN FINAL MODELS .................................................................................51
TABLE A. 1 RISK MANAGEMENT MATRIX [34] ...............................................................................................................66

TABLE A. 2 PROJECT RISK MANAGEMENT PLAN.............................................................................................................67
ix
Abbreviations
ML Machine Learning
DL Deep Learning
CV Computer vision
RS Remote Sensing
GIS Geographic Information System
ANNs Artificial Neural Networks
FCN Fully Convolutional Network
CNN Convolutional Neural Network
NN Neural Network
ReLU Rectified Linear Unit
DAN Dense-Attention Network
OSM OpenStreetMap
SMART Setting specific, measurable, achievable, relevant, and time-bound
CRISP-DM Cross-Industry Standard Process for Data Mining
IoU Intersection over Union (IoU)
x
Symbols
∑ The summation symbol means adding up a series of values.
α Hyperparameters to control the balancing and the focusing strengths of the
loss.
γ Hyperparameters to control the balancing and the focusing strengths of the
loss.
Log Represents the logarithm function.
xi
Dedicated to all the people I love.
xii
Chapter 1
1 Introduction
1.1 Overview
There are many satellite images captured by a significant number of satellites around our
planet. These satellite images provide essential information for the current situation of
various spatial data such as road network, land use, land cover, agricultural, and other global
environmental changes. This data is being used in a significant number of applications in
different fields like urban planning, environment monitoring, geographic information
systems, agriculture, and many more [1].
The automation process of map production by detecting and extracting spatial information
from satellite imagery using Artificial Intelligence is significant in terms of reducing the time
and cost of making base maps. For instance, satellite imaging is extensively used in mapping
roads, and it takes a significant amount of manual work to map or update all the roads
network on our planet. That is why automatically extracting roads from satellite images is
crucial for keeping maps up to date. Therefore, it is important to optimize the automation
process of road networks since it is a required layer in a lot of mapping activities such as
navigation, route planning, fleet management, traffic management, geographic information
systems, and autonomous driving [38].
Machine learning (ML) and Computer vision (CV) can be used to automate the extraction
process of informative spatial data from satellite imagery [2]. Automation in the field of
satellite imagery analysis and remote sensing (RS) can benefit from the advances in CV and
this collaboration of RS and CV is gaining high interest from researchers in both fields [3].
In addition to that, Deep Learning is significantly being applied to Satellite images to get
more insightful information in many studies [4]. Especially, Semantic Segmentation which is
a process that aims to attach a class label to every pixel, resulting in an image with
1
Chapter 1. Introduction 2
highlighted objects. This process is used to identify and detect different object classes in
satellite images such as buildings, roads, land cover, and water bodies [5].
There are still many challenges with semantic segmentation models applied to satellite
imagery to detect high-resolution spatial information. For instance, using a small, labeled
dataset of satellite imagery instead of a big dataset will make it hard for the model to detect
spatial information with high accuracy since any machine learning model needs a large
dataset to be able to learn properly from it. Moreover, comparing various semantic
segmentation models applied to a small dataset of satellite imagery is yet to be investigated
further.
1.2 Motivations
Detecting spatial information from satellite imagery is essential for automating digital map
production, change detection, and other geographic information systems applications.
Semantic segmentation for deep learning can be used to automate the process of geospatial
data extraction from satellite images to make it useful for many applications. For instance,
the automation process of detecting and extracting spatial information from satellite
imagery will reduce the time and cost of creating digital base maps manually. This topic is
challenging due to the large number of different classes involved in satellite images which
makes it hard to classify.
1.3 Aims and Objectives
This paper aims to approach a proper solution of deep learning for semantic segmentation
to detect spatial information such as building footprints, land cover, roads, vegetation, and
water bodies from a small dataset of satellite imagery. More specifically, the objectives of
this research are as the following:
• Apply deep learning models for semantic segmentation such as U-Net and
Modified U-Net on a small dataset of labeled satellite imagery to detect spatial
information.
Chapter 1. Introduction 3
• Investigate different methods and techniques for best classifying and

segmentation of the spatial features of small, labeled dataset of satellite imagery.
• Perform in-depth examinations by applying semantic segmentation approaches
and techniques to investigate the results of detecting the labeled classes of
satellite imagery.
• Comparing the results of the semantic segmentation approaches applied in this
research with the previous findings of other existing related studies.
• Validate the effectiveness of using deep learning for detecting spatial information
from a small dataset of satellite imagery in areas such as automating map
production, geographic information systems, and urban planning.
1.4 Project Outline
The rest of this paper consists of the following:

• Chapter 2, background, and literature review: Theoretical background is discussed
for this study by reviewing the definitions of machine learning, deep learning, and
semantic segmentation. Followed by a review of the related work with critical
analysis to discuss the essential approaches applied in other papers.
• Chapter 3, Requirements Analysis: This chapter defines the functional and no
functional requirements of this research.
• Chapter 4, Methodology: This chapter shows the methodology for investigating the
solution with the evaluation metrics.
• Chapter 5, Implementation: This chapter shows the implementation process.
• Chapter 6, Evaluation and Results: shows the results of the implementation with its
evaluation metrics.
• Chapter 7, Conclusion and Future Work: this chapter shows the Conclusion of the
whole thesis and findings with the future work and recommendations.
• References: which contain all references.
• Appendices: which contain Appendix A for Project Plan and Appendix B for
professional, legal, Ethical, and social issues.
Chapter 2
2 Background and Literature

Review
In this chapter, some of the core approaches and techniques in Machine Learning, Deep
Learning, and Semantic Segmentation will be introduced to gain an understanding of the
theoretical definitions and background of this research. The related work will be
investigated, and critical analysis will be conducted for similar studies. Finally, a summary of
the literature review will be provided.
2.1 Background
Satellite images are an essential source of information for automating map production,
Geographic Information Systems (GIS), agriculture, and urban planning. Satellite imagery
contains more structured and uniform spatial data compared to traditional images. These
kinds of tasks related to road extraction, building footprint detection, and land cover
classification are based on semantic segmentation models [25]. Data extracted from satellite
imagery is being used in a significant number of applications in different domains like urban
planning, environment monitoring, geographic information systems, fleet management, and
many more [1].
There are many models of semantic segmentation related to spatial data detection and
extraction methods from satellite imagery by using classical Computer Vision algorithms or
other Deep Learning algorithms. Many studies discussed various approaches and models for
detecting spatial information such as roads and building footprints from satellite imagery
which will be discussed in more detail in the related work section of this chapter.
Furthermore, employing a small, labeled dataset of satellite imagery rather than a large one
would render it arduous for the model to discern spatial information with a high degree of
precision. Many studies applied semantic segmentation models to detect various
4
Chapter 2. Background and Literature Review 5
spatial information such as roads, building footprints, or vegetation but still comparing
various semantic segmentation models with small, labeled satellite datasets is yet to be
investigated further. Before we discuss in detail the related work with critical analysis, it is
important to define and clarify some of the common terms used in computer vision, deep
learning, and semantic segmentation in the following upcoming sections.
2.2 Machine Learning
2.2.1 Overview
Understanding the foundational principles of machine learning is crucial to comprehend

deep learning, which is a specific type of machine learning. Machine learning is a subcategory
of artificial intelligence that involves training computer algorithms to learn from data and
generate predictions or decisions without the need for explicit programming. These
algorithms use statistical patterns and inference to identify patterns in data and enhance
their performance over time as they receive more data. The objective of machine learning is
to empower computers to learn from experience and make reliable and accurate predictions
or decisions. Some examples of machine learning applications are image recognition, speech
recognition, natural language processing, and recommendation systems [26].
Figure 2. 1: Types of machine learning (AI) [26]

The above diagram in Figure 2.1 shows how deep learning is a specific type of representation
learning. Deep learning itself falls under the umbrella of machine learning, which is utilized
for many approaches to AI [26].
2.2.2 Machine Learning and Spatial Data
Spatial data refers to data that is associated with a particular location on the Earth's surface,
such as satellite imagery, maps, and geospatial data.
Machine learning can be applied to spatial data detection from satellite imagery. The use of
machine learning algorithms in this context can help automate the process of identifying and
analyzing patterns in satellite images, such as identifying land cover types, detecting changes
in land use over time, and monitoring natural disasters or environmental changes. By feeding
large amounts of satellite imagery data into machine learning algorithms, the algorithms can
learn to recognize patterns and features in the images. This process can be used to develop
predictive models that can help identify areas at risk of natural disasters or environmental
hazards, monitor the health of ecosystems, or support urban planning and development
[32].
One example of the use of machine learning in spatial data detection from satellite imagery
is in the field of precision agriculture. By using machine learning algorithms to analyze
satellite images of crop fields, farmers can identify areas that require water, fertilizer, or
other inputs, which can help optimize crop yields and reduce waste. Overall, machine
learning has the potential to revolutionize the way we analyze and use spatial data from
satellite imagery to support a wide range of applications and industries [37].
Machine learning can be used to automate the process of spatial data extraction of satellite
images. However, there are a lot of satellite images with many labels which makes it very
hard to use popular machine learning to extract spatial data. In addition to that this data is
high dimension and difficult to process [2].
2.3 Deep Learning
2.3.1 Overview
Deep learning is a machine learning technique that has been inspired by our understanding
of the human brain, as well as statistics and applied mathematics. Its development has
spanned several decades, and in recent years, it has gained significant popularity and
practicality. The increasing power of computers, larger datasets, and advances in training
techniques for deeper networks have contributed to this growth. There are both challenges
and opportunities to enhance deep learning further and explore new frontiers. [26].
Figure 2. 2: Different components of an AI system [26]

The above diagram in Figure 2.2 shows how different components of an AI system relate to
each other for different systems with the shaded portions indicating which parts learn from
data [26].
Deep learning, which is a contemporary method for supervised learning, offers a potent
framework. By incorporating more layers and units within a layer, a deep network can
portray progressively complex functions. This approach is adept at managing assignments
that necessitate the mapping of input vectors to output vectors and can be executed
effortlessly and expeditiously by humans. Nevertheless, this requires extensive models and
labeled training data. Tasks that are intricate and cannot be described as mapping one vector
to another, or that demand substantial human contemplation and reasoning, cannot be
presently accomplished through deep learning [26].
One application of deep learning in spatial data analysis is in the field of remote sensing. By
using deep learning algorithms to analyze satellite imagery, it is possible to identify and
classify various features on the Earth's surface, such as land cover types, vegetation density,
and urban areas. This information can be used to monitor changes in the environment over
time, such as deforestation or urbanization, and to inform decision-making in fields such as
urban planning, agriculture, and environmental conservation [33].
2.3.2 Neural Networks
The architecture of an artificial intelligence neural network is designed to imitate the

structure of a human brain network through a schematic representation. It comprises layers
of neurons that are connected in a specific direction. Artificial neural networks (ANNs)
include input, hidden, and output layers with interconnected neurons (nodes) that aim to
simulate the functioning of the human brain. ANNs are composed of a group of linked nodes
(known as neurons) that analyze data by performing a sequence of mathematical
calculations [26].
Neurons in a neural network are arranged in layers, each layer comprising numerous
neurons that fulfill a specific function. The input layer accepts the data for processing, while
the output layer generates the final outcome of the network. The layers between the input
and output layers are referred to as hidden layers, and they perform intermediate
computations to transform the input data into the intended output. During training, a neural
network modifies the weights and biases of its neurons to minimize the discrepancy
between the predicted output and the desired output. This process is termed
backpropagation, which entails passing the error from the output layer through the hidden
layers to adjust the weights and biases of the neurons [26].
The utilization of artificial neural networks (ANNs) has gained significant popularity as a
means of analyzing remotely sensed data such as satellite and aerial images. Considerable
advancements have been achieved in image classification through the application of neural
networks. [35].
2.3.3 Convolutional Neural Networks
The convolutional neural network (CNN) is a deep learning method used for image
processing, particularly for computed tomography, X-ray images, and magnetic resonance
imaging. It consists of convolutional, pooling, and fully connected layers. Filters (kernels)
slide over preprocessed signals in the convolutional layer, generating a feature map. The
pooling layer reduces dimensionality to prevent overfitting and reduce computational load.
The final layer utilizes activation functions to introduce nonlinearity to the outputs. [30].
Deep Learning methods such as Fully convolutional networks (FCN) are used increasingly for
addressing semantic segmentation problems. FCN replaces the fully connected layers found
at the end of the classification networks with convolutional layers to output a spatial
segmentation map [7].
Convolutional networks have played a crucial role in the development of deep learning by
applying brain-inspired insights. They achieved impressive performance before the potential
of deep models was fully realized. In addition to that, convolutional networks were pioneers
in commercial applications of neural networks and continue to lead in the practical
implementation of deep learning. [26].
A convolutional neural network (CNN) is a form of Neural Network (NNs) that form the core
of the concept of deep learning. Deep learning has given impressive results in many areas,
but there are still many drawbacks with CNN in practice. For instance, a considerable volume
of manually annotated data is still necessary for training CNN models. [2].
Convolutional Neural Networks (CNNs) have proven to be highly effective in processing data
with grid-like structures, such as time series and image data. Their specialized use of
convolution operations sets them apart from traditional neural networks that rely on general
matrix multiplication. CNNs have demonstrated remarkable success in practical applications
and continue to be a prominent choice for various tasks in the field of deep learning. [28].
Convolutional networks are well-suited for neural networks to handle data with a grid-
structured topology, and they can be scaled to large sizes. This approach has been
particularly successful for two-dimensional image data. However, for processing one-
dimensional sequential data, another specialized form of the neural network framework is
used, known as recurrent neural networks [26].
A Convolutional Neural Network (CNN) is composed of convolutional and pooling layers,

which include Rectified Linear Unit (ReLU) non-linearities. These are subsequently followed
by fully connected layers and a softmax classifier. The key layer in a CNN is the convolutional
layer, which produces a set of feature maps. This layer introduces the idea of local
connectivity and weight sharing, where the units in each feature map are connected to local
patches in the preceding layer using filter banks. The convolution operation is
mathematically described by the equation provided below [2].
S(i,j) = (I*K)(i,j) = ∑∑ I(m,n)K(i-m, j-n)
Where I is the 2D image input, K is the 2D kernel or filter bank and S is the feature map
output [2].
Figure 2. 3: Structure of a Fully Connected Network (all connections are weighted) [26]
The above Figure 2.3 shows the FCN structure of the Input layer, hidden layers, and output
layer.
Figure 2. 4: Structure of the Fully Convolutional Networks [7]
The above Figure 2.4 shows the most common layer of a fully connected layer, where all the
nodes in adjacent layers are fully connected and there are no interconnections between
neurons in the same layer [7].
Fully Convolutional Networks replace the fully connected layers at the end of classification
networks with convolutional layers to output a spatial segmentation map. Since this
segmentation map has a lower resolution as compared to the original image, it is up-sampled
to produce the final segmentation output. In order to recover finer details that are lost
during the down-sampling phase, the outputs from some of the earlier down-sampling layers
are added to the output as shown in Figure 2.4 [7,2].
2.3.4 Computer Vision
Computer Vision is a field of artificial intelligence that enables computer systems to extract
information from images or videos. Semantic segmentation is a task for partitioning an
image into segments to be able to detect objects from the images by assigning a semantic
label to each pixel of an image. It is used to identify the boundaries of objects in images [2].
Deep learning has exhibited remarkable accuracy in computer vision tasks and holds
tremendous potential for efficiently processing vast amounts of earth observation satellite
image data in automated workflows. [24].
The field of computer vision has been a thriving research area for deep learning applications,
primarily due to the inherent complexity of vision, which humans and animals effortlessly
perform but poses significant challenges for computers [27]. Computer vision is a highly
expansive discipline that encompasses diverse image-processing techniques and a multitude
of applications. Its scope spans from emulating human visual capabilities, like facial
recognition, to pioneering novel visual abilities. Common benchmark tasks for evaluating
deep learning algorithms in computer vision include optical character recognition and object
recognition. [26].
One of the most challenges in the history of computer vision is Semantic Segmentation
because it requires the algorithm to not only detect objects in an image but also to precisely
segment them into their individual parts. Unlike object detection, where the goal is to
identify the location of an object in an image, semantic segmentation requires pixel-level
labeling of each object in the image. Furthermore, the need for high precision and accuracy
in semantic segmentation makes it particularly challenging. Even small errors in the
segmentation of an object can have significant consequences in downstream applications,
such as autonomous driving, where a misclassified object could result in a collision [6].
2.4 Semantic segmentation
2.4.1 Overview
Image segmentation is a computer vision task of partitioning an image into segments or

regions to abstract the image and make it easier to analyze. It can be used to identify the
boundaries of objects or to find salient regions in images. Semantic segmentation is a special

case of segmentation where knowledge is also extracted for each segment. More precisely,
semantic segmentation is the task of assigning a semantic label to each pixel of an image [2].
The most popular architecture for semantic segmentation is the Encoder-decoder
architecture as stated in Figure 2.6 below.
Likewise, in other Deep learning models, the accuracy of semantic segmentation models
depends on various factors, including the quality of the training dataset, the architecture
and parameters of the CNN model, and the complexity of the scene being analyzed.
Additionally, satellite imagery can present challenges such as atmospheric interference or
differences in lighting conditions, which may require preprocessing or specialized techniques
to address [36].
2.4.2 Encoder-Decoder Architecture
Commonly employed techniques for semantic segmentation involve the utilization of

encoder-decoder architectures. Encoders are utilized to grasp a higher-level depiction of the
image, with a lower resolution than the original input image, and frequently, classification
architectures are employed as encoders. Meanwhile, decoders are utilized to up-sample the
image to its initial resolution and retrieve intricate details during the process [2].
An autoencoder is a neural network designed to replicate its input as output. It includes a

hidden layer, denoted as h, which represents a code used for encoding the input. The
network can be divided into two components: an encoder function, h = f(x), and a decoder
that generates a reconstruction, r = g(h). This architecture is depicted in Figure 2.5. If an
autoencoder simply learns to set g(f(x)) = x perfectly, it lacks practical value. Instead,
autoencoders are intentionally limited to approximate rather than exact replication. They
are typically constrained to only copy input that resembles the training data. By prioritizing
which input aspects to replicate, autoencoders often discover useful properties of the data.
Modern autoencoders have extended the notion of encoder and decoder to include
stochastic mappings, such as pencoder(h | x) and pdecoder(x | h) [26].
Figure 2. 5: The general structure of an autoencoder [26]
The structure shown in Figure 2.5 maps an input x to an output, referred to as the
reconstruction r, using an internal representation or code h. It comprises two components:
the encoder f (which maps x to h) and the decoder (which maps h to r). [26].
The U-Net architecture, consisting of an Encoder-Decoder structure, incorporates two

interconnected paths: the encoder path and the decoder path. The encoder path focuses on
extracting features from the input image, while the decoder path reconstructs the image
using the extracted features. To allow the decoder path to access features from the encoder
path at different scales, the paths are connected via skip connections. This enables the
network to capture both local and global information in the input image, improving its ability
to perform segmentation tasks accurately [8].
Figure 2. 6: U-Net Architecture of Encoder-Decoder Architecture [8]

The above figure shows the lowest-resolution image, measuring 32x32 pixels. Each blue box
corresponds to a multi-channel feature map, displaying the number of channels at the top.
The x-y size is indicated at the lower left of each box. White boxes represent copied feature
maps, while the arrows depict the different operations involved. [8].
A network with U-shaped architecture (U-Net) proposed in [18], is a widely used deep
learning model for semantic segmentation, featuring encoder and decoder blocks connected
by skip connections, akin to fully convolutional networks. Initially developed for medical
image segmentation, U-Net has found success in satellite imagery segmentation with
impressive outcomes. [8].
2.5 Related Work
Detecting spatial information from satellite imagery using deep learning for semantic
segmentation is a rapidly growing research area with tremendous potential for applications
in fields such as automating map production, geographic information systems, urban
planning, agriculture, and environmental monitoring. Deep learning for semantic
segmentation models is being used to detect and classify spatial objects such as road
extraction to be utilized in GIS maps dataset updates. Other common tasks in this field are
extracting land covers and buildings footprints using different models of deep learning for
semantic segmentation.
However, the effectiveness of deep learning models for the semantic segmentation of
satellite imagery is subject to several challenges and limitations that require careful
consideration and analysis. One of the challenges is applying semantic segmentation models
on small, labeled satellite imagery dataset rather than large dataset that need time to be
labeled manually. Moreover, implementing further investigation and experiments of various
models to detect different spatial information are yet to be discovered with a small dataset.
The effectiveness of the model is also based on the kind of spatial feature which can be
detected from satellite imagery such as roads, building footprints, land cover, vegetation,
and water bodies.
In similar research [11] for Semantic Segmentation of satellite images using Modified U-Net,
they mentioned proposed a solution for generating automatic area segmentation and shows
high accuracy for six classes: Building, land, road, vegetation, water, and miscellaneous. The
baseline U-Net model is enhanced by incorporating the Inception ResNet V2 model in its
encoder, resulting in increased mathematical and structural complexity. The performance of
this modified model is assessed using the Dice coefficient and pixel accuracy, yielding values
of 82 percent and 87 percent respectively. However, no comparison is made between their
proposed model and other existing models for semantic segmentation.
In another similar research [12] on semantic segmentation of aerial images using the U-Net
architecture, the authors highlight that the conventional semantic segmentation process
fails to provide accurate details due to the low resolution of aerial images. To address this
issue, they propose an algorithm based on the U-Net architecture. The U-Net model
comprises two distinct paths, namely the compression path (encoder) and the symmetric
expanding path (decoder). The encoder is made up of a stack of convolutional and maximal
pooling layers and is responsible for capturing the image's context. On the other hand, the
decoder utilizes transposed convolutions to enable precise localization and dense
prediction. Unlike traditional models, U-NET is an end-to-end fully convolutional network,
meaning it doesn't contain any dense layers, and therefore can handle images of any size.
The effectiveness of the proposed U-NET model was evaluated by comparing its accuracy to
that of previous methods using dense prediction to enhance an image. The process of
prediction of pixels in the border region was accurate and fast [12]. However, there is no
comparison between both proposed algorithms, and which one gives better results for
detecting each spatial feature.
One of the most common tasks in this domain is detecting roads. In this paper [14], a novel
deep convolutional neural network was introduced for road extraction from high-resolution
remote sensing imagery. The model utilized a U-Net architecture and incorporated
DenseNet as the feature extractor, resulting in improved accuracy of road network
extraction by capturing both local and global road information. Comparative evaluations
were conducted against state-of-the-art semantic segmentation methods, including FCN, U-
Net, and DeepLab V3+. The experimental results demonstrated the proposed network's
accurate and effective road extraction capabilities, surpassing other machine learning and
deep learning approaches in terms of precision, recall, and harmonic mean. However, they
did not compare the results with other models and with previous research using small and
large labeled image datasets.
Another common task in this field is extracting land covers. In this study [15], the authors
used the U-net architecture and image segmentation with different encoders. The
comparison of results aligns with the study, demonstrating an accuracy of 82.2% and
improved precision in extracting buildings within high-density residential areas. Other
remarkable findings are that the effectiveness of the Unet-ResNet50 model in data
comprehension surpasses that of the standard U-Net model. Nonetheless, comparing the
proposed model with other models such as the modified U-Net is not explored in their
research.
In this research on extracting building footprints [16], The authors introduced a U-Net-based
approach for semantic segmentation to accurately extract building footprints from satellite
images. The U-Net model was enhanced by integrating multiple strategies including data
augmentation, data processing techniques, and the integration of GIS map data and satellite
images. The proposed method achieved a notable F1-score of 0.704, surpassing the top
three solutions in the SpaceNet Building Detection Competition by 1.1% to 12.5%. Moreover,
it outperformed the standard U-Net-based method by 3.0% to 9.2%. Even if there are
remarkable results were achieved such as in this research, but still, they did not try and
compare their results with different models of semantic segmentation.
Various models are compared in this research [19] between U-Net, Modified U-Net, and
Dense-Attention Network (DAN) models for building extraction from TripleSat imagery. The
authors evaluated the performance of these architectures in Kuala Lumpur, Malaysia using
0.8m resolution TripleSat imagery. The modified U-Net achieved the highest accuracy with
an average F1-score of approximately 82.45%, surpassing both DAN with an F1-score of
80.68% and U-Net with an F1-score of 79.82%. Notably, DAN exhibited superior accuracy in
predicting larger buildings, while the modified U-Net demonstrated exceptional precision
for small and medium-sized buildings. These findings highlight the modified U-Net as the
most effective for building extraction, with the DAN excelling in accuracy for larger buildings
in 0.8m resolution TripleSat imagery.
Another comparison was conducted in this research [21] for various Convolutional Neural
Network (CNN) architectures for satellite image segmentation. The authors employed CNNs
to detect geo-objects on satellite images obtained from DSTL, Landsat-8, and PlanetScope
databases. They experimented with three modified versions of CNN architectures to
implement the recognition algorithm. The efficiency of the developed algorithms was
evaluated using aerial photos from the DSTL database. The results revealed that utilizing
complex CNN architectures led to improved segmentation quality for satellite images.
A comparison of various segmentation models was conducted, and a framework was

proposed for identifying damaged areas and accessible roads in post-disaster scenarios using
satellite imagery discussed in this study [23]. In post-disaster scenarios, the proposed
framework identified affected areas by analyzing the difference in predicted segmentation
masks. The framework achieved an impressive F-score of 94.76, surpassing the highest F-
score of 73.98 achieved by the segmentation networks. They conducted a systematic review
of the effects of using different encoder and decoder architectures.
Another important task in this domain for farming is detecting vegetation or trees on
satellite images with high resolution. In this research [22], the authors used the U-net
convolutional neural network architecture and concluded that U-Net offers a means to
leverage the visual strength of texture and local spatial structure in high-resolution satellite
imagery. The research has demonstrated its capability to accurately identify the presence or
absence of trees and large shrubs across expansive landscape areas, achieving an accuracy
of approximately 90%.
There are also some related studies for water bodies extraction such as this research [20].
In this research, the authors introduce a novel approach for segmenting water surfaces from
satellite images using convolutional neural networks. They investigate the application of a
U-Net model and a transfer knowledge-based model. Two different deep-learning methods
are compared for water body segmentation. The first approach explores variations of the U-
Net architecture, while the second approach utilizes distillation to enhance the U-Net
response when training images are limited. The findings reveal that the overall performance
of both models is similar.
The satellite imagery research domain has not made as much progress as other domains due
to the scarcity of large-scale labeled datasets. However, a notable study in this field utilized
OpenStreetMap (OSM) data that is publicly available to train segmentation models for aerial
imagery [9]. Although OSM data is annotated by volunteers over different images and may
not be as precise as manually labeled data for training, the study demonstrated that a
considerable amount of weakly labeled training data could compensate for the lack of high-
quality training data [2]. Therefore, getting high and satisfying results for detecting spatial
data from a small, labeled dataset is crucial and this is part of what is discovered in this
research.
2.6 Conclusion
Most of the previous studies implemented in this domain of detecting spatial information
from satellite imagery proposed a U-Net model for semantic segmentation of satellite
images to detect and extract spatial data such as roads, land cover, or building footprints.
Although the above similar studies of [11,12] used U-NET and Modified U-Net architecture
with the same dataset which is going to be used in this research, there is no comparison
between both proposed models, and which one gives better results for detecting each
spatial feature [13].
Various Convolutional Neural Network Architectures for Satellite Image Segmentation

compared in [19] suggested that In terms of 0.8m resolution TripleSat imagery, the Dense-
Attention Network (DAN) excels in predicting large buildings, while the modified UNet
demonstrates exceptional precision for small and medium-sized buildings. The researchers
of [21] found that complicated CNN allows for an increase in the quality of segmentation of
satellite images. Although [23] conducted a systematic review of the effects of using
different encoder and decoder architectures, there is still a lack of comparing different
encoder and decoder architectures with small, labeled datasets to detect spatial
information.
In this research, various models such as U-Net and modified U-net would be investigated
with further customization in the models' architecture to better detect various spatial
features from the satellite imagery. Also, this paper investigates different methods and
experiments for best classifying and segmentation of spatial features of a small, labeled
dataset of satellite imagery. Moreover, a comparison of U-Net and Modified U-Net would
be conducted to show which one works better for which kind of spatial feature.
Chapter 3
3 Requirements Analysis
The outcome of this research is to detect spatial information with high accuracy from a small,
labeled dataset of satellite imagery using semantic segmentation. Although this is a
research-oriented project, the requirements are divided into functional and non-functional.
3.1 Functional Requirements
Utilizing specific, measurable, achievable, relevant, and time-bound (SMART) objectives is

an effective method for planning the necessary steps to achieve the goals of this research.
This approach facilitates the transformation of ideas into actionable steps.
Various functional requirements of this research project showed in Table 3.1 With MoSCoW
requirements analysis which stands for the below points:
• Must: Minimally usable and must have requirement

• Should: A requirement of high priority
• Could: A desirable requirement
• Won’t: This requirement would be for future work or will not satisfy
Priorities are also considered for these requirements as high for detecting spatial
information and for applying the U-Net model. Medium priorities are stated for the other
required functions.
21
Chapter 3. Requirements Analysis 22
No Requirement description MoSCoW Priority Status

2 Apply the U-Net model to the dataset. M H Done
3 Apply the Modified U-Net (Inception S M Done
ResNetV2U-Net) model to the dataset.
4 Investigate different methods and S M Done
techniques for best classifying and
segmentation of the spatial features.
5 Perform in-depth experiments by applying S M Done
semantic segmentation approaches and
techniques to investigate the results of
detecting the labeled classes of satellite
imagery.
6 The developed models will need to be S M Done
evaluated using appropriate evaluation
metrics, such as pixel accuracy, Intersection
over Union (IoU), and F1 score, to assess
their accuracy, performance, and
generalization capability.
7 Conduct comparative analysis: compare S M Done
different models, such as U-Net and
modified U-Net with various spatial
features, to determine their effectiveness in
achieving the research objectives.
Table 3. 1: Functional Requirements
The above Table 3.1 shows the functional requirement of the research project with
priorities (high, Medium, or Low) and MoSCoW (Must, Should, Could, or Won’t).
3.2 Non-functional Requirements
The non-functional requirements for this research project such as performance and
accuracy shown in Table 3.2 with MoSCoW requirements analysis which stands for Must,
Should, Could, and Won’t as mentioned in detail in the previous section.
Chapter 3. Requirements Analysis 23
Priority is also considered for these requirements as high for detecting highly accurate
results with high F1 scores, and medium for the model to run fast with high performance.
No Requirement description MoSCoW Priority Status

1 The resulting model is required to detect highly M H Done
accurate results with a high pixel-wise
accuracy. This will be evaluated by using the
F1-score Accuracy and Intersection over Union
(IoU) or Jaccard coefficient to assess the
performance of models as mentioned in
section 4.3.
2 The proposed model is required to run faster S M Done
than other existing related models applied on
a small dataset of satellite imagery.
3 The performance of the models required to be S M Done
high.
5 Provide comprehensive documentation of the S M Done
developed approach, including the proposed
methodology and results.
Table 3. 2: Non-Functional Requirements
The above Table 3.2 shows the non-functional requirement of this research project with
priorities (high, Medium, or Low) and MoSCoW (Must, Should, Could, or Won’t).
The evaluation metrics mentioned in section 4.3 will be used to evaluate the first
requirement stated in above Table 3.2 to measure the accuracy of the proposed model
of detecting spatial information with highly accurate results and a high pixel-wise
accuracy from satellite imagery.
Chapter 4
4 Methodology
This chapter of the research describes how the aim and objectives will be obtained. It starts
with a section describing the dataset that will be used, followed by the proposed
methodology and approach. The third section is the evaluation metrics by which results are
evaluated against other findings. Lastly, there is a section for technologies and tools that will
be used in this research project.
4.1 Dataset
The dataset utilized in this research was annotated as part of a collaborative project with
the Mohammed Bin Rashid Space Center in Dubai, UAE. It comprises aerial imagery of Dubai
captured by MBRSC satellites and has been annotated with pixel-wise semantic
segmentation across six classes. [13].
The dataset includes 72 images grouped into 8 larger tiles. Each satellite tile has its mask
image which has color labels for landmarks. The list of landmarks is the number of classes
the model can classify in the satellite images. Each satellite tile is further divided into 2*2 or
3*3 or 4*4 images and sometimes 1*3 or 4*5 images.
The images are labeled into 6 classes:

1. Building: #3C1098
2. Road: #6EC1E4
3. Land (unpaved area): #8429F6
4. Water: #E2A929
5. Vegetation: #FEDD3A
6. Unlabeled: #9B9B9
24
Chapter 4. Methodology 25
Figure 4. 1: Labeled classes in various colors from the satellite image dataset [13]
Figure 4. 2: Sample of satellite image and corresponding mask from the dataset [13]
Overall, the Dubai Semantic Segmentation dataset is a valuable resource for researchers and
practitioners in the computer vision community. Its high level of detail and accuracy make it
a useful tool for a wide range of applications.
4.2 Proposed Methodology
In this research project, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is
adopted as a structured approach to guide the process. It provides a framework that outlines
the typical phases, tasks, and interdependencies involved in the project.
Figure 4. 3: The life cycle model of CRISP-DM
The above Figure 4.3 shows that CRISP-DM encompasses six phases with arrows denoting
the critical and commonly observed dependencies between them. The sequence of these
phases is not rigid, as projects often iterate and move back and forth between phases based
on requirements. [31].
The research methodology for this study follows the Waterfall Methodology, wherein each
phase of the research is completed sequentially, with clear and defined steps from initiation
to completion.
Semantic segmentation models such as U-Net and Modified U-Net (Inception ResNetV2U-
Net) models are proposed to be applied to the dataset mentioned in the previous section to
detect the various classes of spatial data. A comparison between both models is going to be
investigated in this research. The methodology considers and investigates the challenge of
this small, labeled dataset to obtain high accuracy without overfitting the training data.
The proposed methodology for this research project involves the following steps:
1. Data collection, processing, and preparation:

• Identify the satellite imagery dataset that contains the spatial information of interest.
• Pre-process the datasets to correct distortions, remove noise, and enhance features.
• Collect and label ground-truth data for training and validation.
2. Model selection and adaptation:

• Identify the appropriate semantic segmentation models for the research question
and the data. U-Net and Modified U-Net models are proposed to be applied to the
dataset of satellite imagery.
• Adapt the pre-trained models to the satellite imagery domain.
• Optimize the model parameters and hyperparameters to achieve the desired
performance.
3. Training and validation with various in-depth experiments:

• Split the data into training, validation, and testing sets.
• Train the model on the training set.
• Validate the model on the validation set to assess its accuracy and generalization
capability.
• Hyperparameter tuning: In case the model's performance is unsatisfactory, it can be
enhanced by either adjusting the hyperparameters or by adopting a different
architecture.
• Iterate the training and validation process to improve the model performance.
4. Semantic segmentation:
• Apply the trained model to the test set to perform semantic segmentation on the
satellite imagery.
• Detect the spatial information of interest from the segmented results such as roads,
building footprints, vegetation, land cover, and water bodies.
• Evaluate the accuracy and reliability of the spatial information extraction.
5. Result analysis and comparison:

• Analyze the spatial patterns and trends revealed by the extracted information.
• Report findings and conclusions and make recommendations based on the analysis.
4.3 Evaluation
Evaluation is of utmost importance in validating and ensuring the effectiveness of project

decisions and assumptions. The proposed model undergoes comparison with other models
using diverse metrics and implementation details. Experimental results were evaluated by
benchmarking them against the outcomes of previous similar work.
The requirements of this research project mentioned in Table 3.1 and Table 3.2 evaluated
using the pixel-wise intersection over union and pixel accuracy to assess the performance of
the semantic segmentation models.
The following evaluation metrics of Average pixel-wise intersection over union and Global
Accuracy were used in this research project to assess the performance of the semantic
models:
4.3.1 Average pixel-wise intersection over union (mIoU)

The performance of the models is evaluated pixel-wise using the F1-score and
Intersection over Union (IoU) metrics to assess accuracy. The Dice coefficient, also
known as the F1 score, is used to measure the similarity between the predicted
segmentation mask and the ground truth mask. It is calculated by taking twice the
intersection of the predicted and ground truth masks and dividing it by the sum of the
pixels in both masks. The Dice coefficient ranges from 0 to 1, with higher values
indicating superior model performance. [18].
The F1-score follows the below equation.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
The IoU is defined by the following equation [17]
TP
𝐼𝑜𝑈 =
TP + FP + FN
The Jaccard Index or Intersection over Union (IoU) serves as a common pixel-wise metric
for evaluating segmentation outcomes. It considers true positives (TP), which are the
accurately predicted pixels for a specific class, false positives (FP), representing
incorrectly predicted pixels for a particular class, and false negatives (FN), indicating
pixels that were predicted to not belong to a specific class but actually do. Assuming
there are m images in the dataset, the average IoU can be calculated as follows: [17]
𝑚
1
mIoU = ∑ IoU𝑖
𝑚
𝑖=1
4.3.2 Global accuracy

Pixel accuracy serves as another metric for evaluating semantic segmentation. It
measures the percentage of correctly predicted pixels in the image. This metric assesses
how accurately the predicted segmentation mask aligns with the ground truth mask.
Pixel accuracy is obtained by dividing the number of correctly classified pixels by the total
number of pixels in the image. It ranges from 0 to 1, with higher values indicating
superior model performance. The global accuracy is calculated by considering all classes
and can be obtained using the values from the confusion matrix in Table 4.1. We can
calculate the global accuracy as follows [17]:
TP + TN
𝐺𝑙𝑜𝑏𝑎𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
TP + TN + FP + FN
Table 4.1 shows the confusion matrix which typically has four entries (TP, FN, FP, and TN)
that are used to calculate the above equation of global accuracy [39].
Predicated Class
Positive Negative
Actual class Positive TP FN
Negative FP TN
Table 4.1 Confusion Matrix [39]
where:
True Positive (TP) represents the cases where the model predicted positive, and the
actual class was positive.
False Positive (FP) represents the cases where the model predicted positive, but the
actual class was negative.
False Negative (FN) represents the cases where the model predicted negative, but the
actual class was positive.
True Negative (TN) represents the cases where the model predicted negative, and the
actual class was negative.
4.3.3 Loss Functions
The following are the loss functions that are used to measure how well the model is
performing and to adjust the model's hyper-parameters to achieve highly accurate and
reliable segmentation results.
1. Categorical Cross-Entropy Loss:

The Cross-Entropy Loss measures the dissimilarity between predicted class
probabilities and corresponding ground truth labels. Cross-Entropy Loss Equation for
a single pixel is calculated as the following equation [41]:
CE(p, q) =( q_i * log(p_i) )
2. Focal Loss:
The Focal Loss addresses the issue of class imbalance in semantic segmentation tasks.
It down-weights easy-to-classify pixels and reduces the influence of background
pixels that dominate the training. For a single pixel in the semantic segmentation
task, the equation is as the following [41]:
FL(p, q) = - Σ( α * (1 - p_i)^γ * q_i * log(p_i) )
3. Total Loss:
Total loss can be calculated by using the below equation
total_loss=dice_loss+(1*focal_loss)
Dice loss is used to calculate total loss which uses the Dice coefficient (F1 score) to
measure the similarity between the predicted segmentation mask and the ground
truth mask[41].
2 ⋅ ∑𝑖 = 1𝑁(𝑝𝑖 ⋅ 𝑔𝑖) + 𝜖
Dice Loss = 1 −
∑𝑖 = 1𝑁(𝑝𝑖 ⋅ 𝑔𝑖) + 𝜖
4.4 Technologies and Tools
The technologies and tools used for this research project are as the following:
• Programming Languages: This research project be implemented using Python 3,

which is commonly used for deep learning and data analysis.
• Interactive Development Environment: Google Colab Pro is used with GPU and high
RAM to write, run code, visualize data, and document work all in one notebook in
one place.
• Data Preparation Tools: Numpy, Sklearn, Patchify, and Matplotlib.
• Deep Learning Framework: To implement the deep learning model for semantic
segmentation, popular frameworks such as TensorFlow and Keras are used.
• Geographic Information Systems (GIS) software: QGIS and ArcMap are used for
preprocessing and visualizing satellite imagery.
Chapter 5
5 Implementation
In this research, two deep learning models are applied to a small dataset of satellite imagery
to evaluate their results throughout various experiments implemented with each model.
Before training the models, data processing and preparation techniques are applied to the
images of the dataset to prepare them for the models. At the end of the implementation,
the best model would be proposed, and it can be used to detect spatial features from
satellite images.
The source code has been implemented using Python within the Google Collab environment,
utilizing machine learning libraries such as TensorFlow and Keras with GPU acceleration
instead of CPU. This decision of using GPU is made because the applied deep learning models
are complex and require significant computational power, leading to faster training times
and improved efficiency when using the GPU for computations.
5.1 Data Processing and Preparation
In this research, a small dataset of Dubai satellite imagery was used which has been
annotated with pixel-wise semantic segmentation across six classes which is described in
brief in section 4.1. The dataset contains 72 images grouped into 8 tiles. Each satellite tile
has its mask image which has color labels for 6 spatial features. Each tile is further divided
into 2*2 or 3*3 or 4*4 images and sometimes 1*3 or 4*5 images [13].
The images are labeled into 6 classes as the following:

1. Buildings feature class color is #3C1098
2. Road feature class color is #6EC1E4
3. Land (unpaved area) feature class color is #8429F6
4. Water feature class color is #E2A929
5. Vegetation feature class color is #FEDD3A
6. Unlabeled feature class color is #9B9B9
32
Chapter 5. Implementation 33
Various methods and techniques have been used to process and prepare the tile images of
the dataset as the following steps:
• The images have been split using the Patchify library into small patches by given patch
cell size and merged patches into Satellite Image. Dividing the image into smaller
patches is important since it reduces memory consumption, facilitates the handling of
large images, and preserves spatial information near boundaries.
• The images have been normalized using MinMaxScaler to set the pixel values within a
specific range of [0, 1]. This standardization ensures stability for the models used in
this research for semantic segmentation which provide equal influence of all pixels.
Moreover, it allows the data to be compatible with the loss functions and activation
functions for the range of [0, 1].
• All tiles and mask images are processed to have sizes that are multiples of the patch
sizes. Images split into patches, and each patch is converted into a Numpy array. Each
image patch is processed individually with normalization and dropping the extra
unnecessary dimension.
• Due to the diverse range of image sizes, consisting of both large and small images, the
image processing approach involved cropping the images to the nearest size divisible
by 256.
• Subsequently, all images were further subdivided into patches with dimensions of
256x256x3.
• The Hex colors of Spatial features were converted to RGB then converted the RGB to
labels from 0 to 5. Therefore, each spatial feature of the masks is labeled from 0 to 5.
• Finally, the dataset is split into 80% for training data and 20 % for testing data. The
random state is set to 100 to ensure reproducibility and consistency in data splitting
across different runs with different experiments, making it easier to compare and
validate results properly.
5.2 Semantic Segmentation Models
In this research, U-Net Model, and Inception ResNetV2U-Net model were applied to the
dataset to detect the spatial classes from the dataset.
5.2.1 U-Net Model Architecture
The structure of the U-Net Model is explained in section 2.4.2 with its structure in Figure 2.
6. U-Net is a U-shaped structure with an encoder path to capture features and a decoder
path to produce the final segmentation map. Skip connections enable the model to retain
spatial information and handle objects at different scales. All convolutional layers use the
ReLU while the final layer uses the Softmax activation function to convert the model's output
into probability scores for each class. Through training, the model adjusts these probabilities
to align with the ground truth segmentation masks, enabling accurate pixel-level
classification [8].
The Implementation of the U-Net architecture is as the following components [8]:

1. Input:
The model receives input images with spatial dimensions represented by image_height and
image_width, along with image_channels denoting the number of channels (commonly 1 for
grayscale or 3 for RGB images).
2. Encoder (Contraction) Path:

The encoder path conducts the input image through a sequence of convolutional blocks.
Each block contains two 3x3 convolutional layers with Rectified Linear Unit (ReLU)
activation. The initial block (c1) employs 16 filters and incorporates a dropout layer with a
rate of 0.2 for regularization. A subsequent 3x3 convolutional layer with ReLU activation is
applied to c1, followed by a 2x2 max-pooling operation to downsample the spatial
dimensions. This process is reiterated for subsequent blocks (c2 to c5), featuring an
incremental number of filters (32, 64, 128, 256) to capture multi-scale context information
[8].
3. Decoder (Expansion) Path:

The decoder path is designed to upsample the feature maps and reconstruct the
segmentation masks with precise boundaries. It utilizes transposed convolutional layers
(Conv2DTranspose) for upsampling the feature maps. The first layer (u6) performs
upsampling on the feature map obtained from the final encoder block (c5) and then
concatenates it with the corresponding feature map from the encoder path (c4) using the
concatenate function. The concatenated feature map is subsequently subjected to two 3x3
convolutional layers with ReLU activation (c6). This process is iteratively carried out for
subsequent blocks (u7 to u9) using the feature maps from the encoder path (c3, c2, c1) until
the original spatial dimensions are restored [8].
4. Skip Connections:
The model incorporates skip connections by concatenating feature maps from the encoder
path with their corresponding upsampled feature maps in the decoder path. These skip
connections facilitate the retention of fine-grained spatial information from earlier layers,
contributing to precise segmentation.
5. Output:
The final layer consists of a 1x1 convolutional layer with softmax activation, responsible for
generating pixel-wise probability maps for each class. The number of classes is determined
by the 'n_classes' parameter specified during model creation, offering adaptability for
various segmentation tasks.
5.2.2 Inception ResNetV2U-Net Model Architecture
The second model applied to the dataset is a modified version of the U-Net Model which is
the Inception ResNetV2U with U-Net. This architecture integrates the Inception ResNetV2
model into the U-Net design as a contracting path. To address deterioration and reduce
training time, it incorporates multiple-sized convolutions with residual connections within
the Inception ResNetV2 block. It also combines the Residual connection and Inception
frameworks. The architecture includes encoder-to-decoder linkages and feature map
concatenation to enhance localization information. Rather than relying on a single high-
dimensional convolutional filter, the model employs multiple blocks with lower levels of
convolution to retain the data's dimensionality. The expansion process involves increasing
the size of feature maps through up-sampling, followed by convolutions and rectified linear
functions. The architecture is finalized by incorporating a fully connected layer for
categorization, resulting in the complete Modified U-Net model. The main objective of this
approach is to merge the benefits of both Inception ResNetV2 and U-Net architectures, to
enhance the segmentation performance. [40].
The Implementation of the Inception ResNetV2U-Net architecture is as the following

components [40]:
1. Input:
The input to the model is an image of size (image_height, image_width,
image_channels).
2. Encoder:
The model employs the pre-trained InceptionResNetV2 as the encoder, which consists
of multiple layers to extract high-level features from the input image. Additionally, four
intermediate feature maps (s1, s2, s3, and s4) are obtained from specific layers of the
InceptionResNetV2 model.
3. Bridge Connection:
To facilitate information flow between the encoder and decoder, a bridge feature map
(b1) is derived from a particular layer in the InceptionResNetV2 model.
4. Decoder:
The decoder comprises four blocks: Decoder Block 1, Decoder Block 2, Decoder Block 3,
and Decoder Block 4. Each block takes the bridge feature map (b1) and various encoder
feature maps (s4, s3, s2, s1). The bridge feature map is upsampled and concatenated
with the respective encoder feature map before being processed through a conv_block
function to refine features and capture additional context.
5. Output:
The output from Decoder Block 4 undergoes a dropout layer to mitigate overfitting.
Following that, a 1x1 2D convolutional layer with six output channels is applied. The
softmax activation function is used on this layer to generate a six-channel segmentation
map, representing the probability of each pixel belonging to one of the six classes [40].
5.3 Training and Validation using Various Experiments
In this research, 10 experiments were implemented on both the U-Net model and Inception
ResNetV2U-Net model by tuning various hyperparameters with different loss functions and
evaluation metrics. The models’ hyperparameters are optimized to achieve better results
through each experiment.
The first experiment for each model was implemented using initial hyperparameters
optimizer= Adam, loss= total_loss, batch_size= 8, verbose=1, epochs=10, and Evaluaiton=
jaccard_coef then tuning the hyperparameters for each experiment until the 5th experiment
which gave better results as described with more details about each experiment in section
6.1. The learning rate is stable for all experiments with the default learning rate of 0.001
with the Adam optimizer. The dataset is split into 80% for training data and 20% for testing
data since the dataset is small data of only 72 images.
The models’ results were evaluated by using various evaluation metrics such as Dice
Coefficient, average pixel-wise intersection over union (mIoU) called also Jaccard
Coefficient, Cross-Entropy, total loss, and accuracy which are described in detail in section
4.3. The results with the quantitive and qualitative evaluation of these experiments for each
model are discussed with their results in detail in section 6.1. In addition to that, a detailed
comparison with the spatial analysis for the semantic segmentation is conducted between
both models to explore the best model in section 6.2.
Chapter 6
6 Evaluation and Results

In this chapter, the results of the final experiment #5 of the two implemented models on the
small dataset will be conducted from the perspective of quantitative and qualitative
evaluations. In addition to that, a comprehensive analysis of the evaluation metrics and
semantic segmentation for each class of spatial features will be performed.
Moreover, a discussion will be held on the comparison between the final proposed models
which achieved better results. Additionally, another baseline comparison will be discussed
between this research results with the existing related research papers and with the
research objectives.
6.1 Experiments Evaluation and Results
6.1.1 U-Net Model Experiments
For this model, 5 experiments are conducted and applied to train the model in a systematic
methodology to reach better results for detecting spatial information from this small
dataset. In each experiment, various hyperparameters are optimized until the 5 th
experiment which led to better results. The learning rate used is the default learning rate of
0.001 with the Adam optimizer for all experiments with verbose =1.
6.1.1.1 U-Net Experiment #1
In this experiment, the following initial parameters are used to train the U-Net Model. The
following table shows the parameters used to train the U-Net Model with its evaluation
results extracted from epoch number 10.
38
Chapter 6. Evaluation and Results 39
Hyperparameters Loss Accuracy Coefficient Val_loss Val_accuray Val_Coeffient
loss= total_loss
batch_size= 8 0.9332 0.8103 0.6223 0.9346 0.8041 0.6166
epochs=10
Evaluaiton= jaccard_coef
Table 6. 1 U-Net Experiment #1 Evaluation
The evaluation metrics in the above table show that the results with these initial parameters
are not the best and it needs to be improved by tuning the hyperparameters.
Figure 6. 1 U-Net Experiment #1 Evaluation Diagrams
The above diagrams show the comparison between training and validation data related to
IOU (Jaccard Coefficient), Loss, and accuracy which need to be optimized.
Figure 6. 2 U-Net Experiment #1 Semantic Segmentation
The depicted figures and diagrams above indicate inadequate segmentation of the spatial
features in terms of semantic segmentation. Therefore, in the forthcoming experiments, a
deliberate adjustment of hyperparameters will be undertaken with the aim of achieving
improved outcomes.
In this experiment, Dice Coefficient is used instead of Jaccard Coefficient to evaluate the
model results with another coefficient. The following table shows the parameters used to
train the U-Net Model with its evaluation results extracted from epoch number 10.
loss= total_loss
batch_size= 8 0.9316 0.8166 0.7725 0.9611 0.7272 0.6850
epochs= 10
Evaluation= dice_coef
The above table show that the Dice Coefficient of training and validation data is higher than
the Jaccard Coefficient from experiment #1. Therefore, in the next experiments, Dice
Coefficient will be used.
Figure 6. 3 U-Net Experiment #2 Coefficient Diagram
The above diagram shows that the Dice coefficient of validation and training dataset is better
than the Jaccard coefficient from experiment #1. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Dice
Coefficient will be used instead of the Jaccard coefficient.
In this experiment, Cross Entropy loss is used instead of total loss which is the total of focal
loss and dice loss. The following table shows the parameters used to train the U-Net Model
with its evaluation results. The following table shows the parameters used to train the U-
Net Model with its evaluation results extracted from epoch number 10.
loss= Cross Entropy

batch_size= 8 0.5825 0.8016 0.7152 0.6446 0.7852 0.6967
epochs= 10
Evaluaiton= dice_coef
The above table shows that the cross-entropy Validation loss is less than the total loss used
in the previous experiment by around 30%. This means that cross-entropy is better than
total loss with this model since it is better to decrease the loss by less than 0.5. Therefore,
cross-entropy will be used in the upcoming experiments.
Figure 6. 4 U-Net Experiment #3 Loss Diagram
The above Diagram shows that the cross-entropy loss of the validation and training dataset
is better than the total loss from experiment #2. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Cross
Entropy will be used instead of total loss.
In this experiment, the number of epochs increased to 50 epochs instead of 10 epochs and
the batch size increased to 16 instead of 8. The following table shows the parameters used
to train the U-Net Model with its evaluation results extracted from epoch number 10.
loss= Cross-Entropy
batch_size= 16 0.3205 0.8866 0.8380 0.5585 0.8366 0.7974
epochs= 50
Table 6. 4 Table 6.4 U-Net Experiment #4 Evaluation
The above table shows that loss, coefficient, and accuracy are much better with training and
validation when the number of epochs increased to 50 epochs instead of 10 epochs from
the previous experiment.
The above Diagram shows the evaluation metrics with training and validation dataset
improved from the previous experiment. It shows also that increasing the batch size from 8
to 16 reduces the learners' capacity to generalize.
The above figure shows the semantic segmentation of sample images predicted by the U-
Net model. The results still can be improved by tuning some parameters. Therefore, in the
next experiment, the number of epochs will be increased to 100 epochs to try to get higher
results with efficient segmentation for all 6 classes of spatial features.
In this final experiment, the number of epochs increased to 100 epochs instead of 50 epochs
and the batch size decreased to 8 instead of 16 since increasing the batch size reduces the
learners' capacity to generalize from the previous experiment. The following table shows
the parameters used to train the U-Net Model with its evaluation results extracted from
epoch number 10.
loss= Cross Entropy

batch_size= 8 0.1974 0.9307 0.8985 0.5748 0.8645 0.8409
epochs=100
The above table shows that the loss, coefficient, accuracy for training, and Validation
improved by increasing the number of epochs to 100 instead of 50 as in the last experiment.
The validation accuracy reaches around 86.5% and the validation coefficient with 84% which
is higher than the previous experiments.
The above Diagram shows that the loss, coefficient, accuracy for training, and Validation
improved from the last experiment. Although the gap between the validation loss and the
training loss increased after epoch 60, the accuracy and coefficient are improved for both
training and validation.
The above Figure shows that the semantic segmentation is improved from the previous
experiment. The spatial features are predicted properly like the masked images. Moreover,
the last image in Figure 6.8 shows that the model segmented roads which are not there in
the masked image. This means that the model’s results are accepted with this experiment.
Figure 6. 9 U-Net Model Heat-Map
The above figure shows the activation heat map for the first image in Figure 6.8 which is
created for the final proposed model of U-Net to understand the activation and gradient
output for one satellite image of the validation dataset. Heat Maps facilitate understanding
the regions that contribute the most to this U-Net model's decision-making for the
predictions of the spatial classes. Hence, these heatmaps show that the U-Net model is
correctly segmenting the spatial classes such as roads, vegetation, and buildings.
The U-Net model used in this experiment #5 is the final proposed U-Net model since the
validation accuracy is high at 86.5% and the validation coefficient at 84% with improved
semantic segmentation results for the spatial features as stated in Figure 6.8.
6.1.2 Inception ResNetV2U-Net Model Experiments
For this model, 5 experiments are conducted and applied to train the model in a systematic
methodology to reach better results for detecting spatial information from the dataset. In
each experiment, I tried various hyperparameters until I reach the 5th experiment which led
to the best results. The learning rate used is the default learning rate of 0.001 with the Adam
optimizer for all experiments with verbose =1.
6.1.2.1 ResNetV2U-Net Experiment #1
In this experiment, the following initial parameters are used to train the Inception
ResNetV2U-Net model. The following table shows the parameters used to train this Model
with its evaluation results extracted from epoch number 10.
loss= total_loss
batch_size= 8 0.9053 0.8869 0.7423 0.9179 0.8487 0.7052
epochs=10
Evaluaiton=jaccard_coef
Table 6. 6 ResNetV2U-Net Experiment #1 Evaluation
The evaluation metrics in the above table show that the results with these initial
parameters are not optimal and it needs to be improved by tuning the hyperparameters.
Figure 6. 10 ResNetV2U-Net Experiment #1 Evaluation Diagrams
The above diagrams show the comparison between training and validation related to IOU
(Jaccard Coefficient), Loss, and accuracy which need to be improved.
Figure 6. 11 ResNetV2U-Net Experiment #1 Semantic Segmentation
The depicted figures and diagrams above indicate inadequate segmentation of the spatial
features in terms of semantic segmentation. Therefore, in the forthcoming experiments, a
deliberate adjustment of hyperparameters will be undertaken with the aim of achieving
improved outcomes.
In this experiment, the Dice Coefficient was used instead of Jaccard Coefficient to evaluate
the Inception ResNetV2U-Net model results with another coefficient. The following table
shows the parameters used to train this Model with its evaluation results extracted from
epoch number 10.
loss= total_loss
batch_size= 8 0.9041 0.8870 0.8518 0.9183 0.8507 0.8246
epochs= 10
The above results show that the Dice Coefficient is higher than Jaccard Coefficient from
experiment #1 by around 10% for training and validation.
Figure 6. 12 ResNetV2U-Net Experiment #2 Dice-Coef Diagram
The above Diagram shows that the Dice coefficient of validation and training dataset is
better than the Jaccard coefficient from experiment #1. In addition to that, it can be
improved to reach higher results by tuning other hyperparameters. Therefore, in the next
experiments, Dice Coefficient will be used instead of the Jaccard coefficient.
In this experiment, Cross Entropy loss is used instead of total loss which is the total of focal
loss and dice loss. The following table shows the parameters used to train the Inception
Inception ResNetV2U-Net model with its evaluation results.
loss= Cross Entropy

batch_size= 8 0.4074 0.8606 0.7910 0.4560 0.8387 0.7766
epochs= 10
The above table shows that the cross-entropy Validation loss is less than the total loss used
in the previous experiment by around 50%. This means that cross-entropy is better than
total loss with this model since it is better to decrease the loss by less than 0.5. Therefore,
cross-entropy will be used in the upcoming experiments.
Figure 6. 13 ResNetV2U-Net Experiment #3 Loss
The above diagram shows that the cross-entropy loss of the validation and training dataset
is better than the total loss from experiment #2. In addition to that, it can be improved to
reach higher results by tuning other hyperparameters. In the next experiments, Cross
Entropy will be used instead of total loss.

In this experiment, the number of epochs increased to 50 epochs instead of 10 epochs and
the batch size increased to 16 instead of 8. Increasing the batch size reduces the learners'
capacity to generalize. The following table shows the parameters used to train this Model
with its evaluation results extracted from epoch number 10.
loss= Cross-Entropy
batch_size= 16 0.0721 0.9720 0.9594 0.6438 0.8784 0.8725
epochs= 50
The above table shows that loss, coefficient, and accuracy are much better with training and
validation when the number of epochs increased to 50 epochs instead of 10 epochs from
the previous experiment.
Figure 6. 14 ResNetV2U-Net Experiment #4 Diagrams
Figure 6.13 show the evaluation metrics with training and validation dataset improved from
the previous experiment. It shows also that increasing the batch size from 8 to 16 reduces
the learners' capacity to generalize.

The above figure shows the semantic segmentation of sample images predicted by the
Inception ResNetV2U-Net model. The results still can be improved by tuning some
parameters. Therefore, in the next experiment, the number of epochs will be increased to
100 epochs to try to get higher results with efficient segmentation for all 6 classes of spatial
features.
In this final experiment, the number of epochs increased to 100 epochs instead of 50 epochs
and batch size increased to 16 instead of 8. The following table shows the parameters used
to train this Model with its evaluation results extracted from epoch number 10.
loss= Cross Entropy

batch_size= 8 0.0415 0.9834 0.9758 0.7337 0.8751 0.8720
epochs=100
Figure 6. 16 ResNetV2U-Net Experiment #5 Diagrams
The above table and diagrams show that the loss, coefficient, accuracy for training, and
Validation improved by increasing the number of epochs to 100 instead of 50 as in the last
experiment. The validation, accuracy reaches around 87.5%, and the Validation coefficient
87%. The Validation loss is 0.73 which is still high, and this can lead to errors in the semantic
segmentation.
The above figure shows that the semantic segmentation is improved from the previous
experiment but there are still some errors in predicting and segmenting some spatial
features such as roads and vegetation.
Overall, InceptionResNetV2U-Net in this experiment is the final proposed model since it

achieved a high validation accuracy of 87.5% and a validation coefficient of 87%. However,
the validation loss is 0.73 which is still high and can lead to errors in the semantic
segmentation.
6.2 Comparison between U-Net Model and ResNetV2U-Net Results
In the previous section 6.1, 10 various experiments were conducted for both the U-Net
model and Inception ResNetV2U-Net model to find the best results for each model. In the
upcoming sections, quantitative and qualitative evaluation of both models will be discussed.
6.2.1 Quantitative Evaluation
Evaluation metrics used in this research are discussed in detail in section 4.3 to evaluate the
semantic segmentation models. The best training models’ results were found to be in the
5th experiment for each model as per the metrics scores in section 6.1. Therefore, a
comparison between both training models from experiment number 5 is conducted to
discover the best model based on the evaluation metrics. The 5th experiment’s
hyperparameters are the same for both models which are optimizer = adam, loss = cross-
entropy, batch_size = 8, verbose = 1, Evaluation-Coefficient = jaccard_coef and epochs = 100
epochs.
Model/Evaluation Metrics U-Net Inception ResNetV2U-Net
Loss 0.1974 0.0415
Accuracy 0.9307 0.9834
Coeffient 0.8985 0.9758
Val_loss 0.5748 0.7337
Val_accuray 0.8645 0.8751
Val_Coeffient 0.8409 0.8720
Accuracy Diagram
Dice-Coefficient Diagram
Average to predict 1 image 21ms 35ms

Table 6. 11 Quantitative comparisons between final models
The above table shows that although the accuracy and coefficient for the training and
validation data in the U-Net model are slightly less than its corresponding in Resentv2U-Net,
the training and validation loss for the U-Net model is much better than that ResentV2U-Net
model. Moreover, the average time taken for the prediction of sample satellite imagery in
the U-Net model is less than the time taken by Inception ResNetV2U-Net. Time taken for
prediction is very crucial when working with a large amount of satellite imagery dataset.
Therefore, the U-Net model is more suitable when applied to a small satellite dataset which
gave a high validation accuracy and coefficient of 86.5% and 84% respectively with an
acceptable loss score. In addition to that, the average time taken to predict one image is less
than the other model.
6.2.2 Qualitative Evaluation
In this section, semantic segmentation analysis for the spatial features of sample images
taken from the testing dataset will be discussed for the spatial feature classes for both
applied models of U-Net and Inception ResNetV2U-Net. This comparison will clarify which
model is working properly for which spatial feature.
6.2.2.1 Semantic Segmentation of Buildings
The below table shows a sample of the prediction images which contain the buildings
feature class with the predicted image of both models.
Figure 6. 18 Semantic Segmentation of Buildings

The above figure show that detecting the building feature class in both models is right. This
validates that both models are perfect when working to detect building footprints from
satellite imagery which is very essential for applications such as Geographic information
systems, building base maps, and urban planning.
6.2.2.2 Semantic Segmentation of Roads
The below table shows a sample of the prediction images which contain road feature classes
with the predicted image of both models.
Figure 6. 19 Semantic Segmentation of Roads
The above figure shows the roads in the U-net and Inception ResNetV2U-Net models are
segmented properly. However, there are some roads segmented as unconnected roads in
both models, but these roads are connected in the ground truth images. From the
perspective of geospatial road mapping, unconnected roads can mean that the roads are
still under construction. The reason for this error is the existence of trees or shades on the
roads which can mislead the model while predicting this feature from the satellite imagery.
6.2.2.3 Semantic Segmentation of Water
The below table shows a sample of the prediction images which contain the water feature
class with the predicted image of both models.
Figure 6. 20 Semantic Segmentation of Water
The above figure shows that detecting the water areas in both models is exceptionally
good. However, the first image in this table 6.20 shows that the U-Net model is slightly
better at detecting the water area than ResNetV2U-Net since there is some error in
detecting the water area with ResNetV2U-Net. As indicated in the quantitative evaluation
in section 6.2.1, the loss of ResNetV2U-Net is higher than the U-Net model which in turn
can lead to some error in detecting spatial features from the satellite imagery.
6.2.2.4 Semantic Segmentation of Lands
The below table shows a sample of the prediction images which contain the Lands (Unpaved
Areas) feature class with the predicted image of both models.
Figure 6. 21 Semantic Segmentation of Lands

The above figure show that detecting lands or unpaved areas in both models is exceptionally
good. This validates that both models are perfect when working to detect Lands from
satellite imagery which is very essential for applications such as urban planning and land
administration applications.
6.2.2.5 Semantic Segmentation of Vegetation
The below table shows a sample of the prediction images which contain the vegetation
feature class with the predicted image of both models.
Figure 6. 22 Semantic Segmentation of Vegetation
The above figure show that the U-Net model is better at detecting vegetation or green
areas rather than Inception ResNetV2U-Net. Even in detecting small details such as small
green areas between other features.
Overall, the U-Net model shows better results in semantic segmentation of spatial features
such as water areas and vegetation spatial features rather than the Inception ResNetV2U-
Net model. The detection of some spatial features in the ResNetV2U-Net model exhibits
higher errors compared to the U-Net model, attributed to the higher loss observed in
ResNetV2U-Net. Both models can show more accurate results when providing fully
accurate masked images for each detail in the satellite image with a large dataset.
6.3 Baseline Comparisons with Existing Relevant Research
In this research, various experiments with detailed quantitative and qualitative analyses are
implemented on the U-Net model and Inception ResentV2Unet model to discover the best
model working with a small dataset of satellite imagery. Moreover, a detailed comparison
was conducted between both models’ results along with Geo-spatial analysis for each
feature detected on the satellite imagery to measure each model's performance.
Although this research [11] implemented on the same dataset, stated that Inception
ResNetV2U-Net is better than U-Model with a validation accuracy of 87% and training
accuracy of 92%, the Inception ResNetV2U-Net model implemented in our research
achieved better results with a validation accuracy of 87.5% and training accuracy of 98%. In
addition to that, in this paper, the U-Net model proved to be more suitable when working
with a small dataset compared to the other model. Although the proposed U-Net model
showed very slightly lower accuracy and coefficient values for both training and validation
data compared to ResNetV2U-Net, it exhibits significantly improved training and validation
loss. Additionally, the U-Net model demonstrates a shorter average prediction time for
sample satellite imagery when compared to Inception ResNetV2U-Net. Moreover, the
proposed U-Net provided higher results of semantic segmentation of the spatial features.
As a result, the U-Net model proves to be more suitable for small satellite datasets.
The U-Net model in this research achieved a high validation accuracy and coefficient of
86.5% and 84%, respectively, with a good loss score. Which exceeds the U-Net model results
of this relevant research [12] implemented in the same dataset with a validation accuracy
of only 77%.
Furthermore, this research incorporates a comprehensive comparison and systematic

experimentation along with spatial analysis of semantic segmentation of spatial features to
evaluate both the U-Net and the modified U-Net (Inception ResNetV2U-Net) models. These
thorough evaluations and analyses are not implemented in other relevant research papers
[11, 12] that utilized semantic segmentation models on the same dataset.
6.4 Comparison with Research Objectives
This research achieved all objectives stated in section 1.3, U-Net and Modified U-Net
(Inception ResNetV2U-Net) deep learning models applied to explore better results for
semantic segmentation of spatial feature classes from a small dataset of satellite imagery.
Diverse experiments and thorough geospatial analyses, encompassing quantitative and
qualitative evaluations as well as comparisons, are executed to evaluate the performance
and results of both models. In addition to that, a comparison was conducted with existing
studies to discuss the successful results achieved by the proposed model applied in this
research. This paper also validated deep learning's effectiveness in various domains such as
detecting roads, buildings footprints and vegetation, and water areas for Geographic
information systems, building base maps, and urban planning applications.
Chapter 7
7 Conclusion and Future Work

7.1 Conclusion
This research project successfully applied deep learning models, U-Net and Modified U-Net
(Inception ResNetV2U-Net) to perform semantic segmentation on a small dataset of
satellite imagery. The models exhibited high accuracy in identifying spatial features such as
building footprints, land cover, roads, vegetation, and water bodies. Various experiments
and comprehensive analyses, incorporating both quantitative and qualitative evaluations
and comparisons, are conducted to assess the performance and outcomes of the two
models. Additionally, a comparison with existing studies is conducted to highlight the
successful results attained by the proposed model in this research. Both models achieved
higher results in detecting the spatial features than other existing research papers with the
same dataset. Despite the slightly higher results achieved by the Inception ResNetV2U-Net
model than the U-Net model, with a validation accuracy of 87.5% and a validation
coefficient of 87%, the U-Net model is proven to be more suitable for working with a small
dataset. The U-Net model achieved a high validation accuracy and coefficient of 86.5% and
84%, respectively. In addition to that, the U-Net model exhibited significantly improved and
better training and validation loss than ResNetV2U-Net. Furthermore, a shorter average
prediction time of satellite imagery is demonstrated by the U-Net model when compared
to Inception ResNetV2U-Net. The U-Net model also showcased its capacity to provide
efficient semantic segmentation for the spatial features rather than the modified U-Net and
the existing masked images of the dataset. As a result, the U-Net model is proven to be
more suitable for detecting spatial information from small satellite datasets with high
performance.
7.2 Future Work
Using a small dataset of satellite imagery in this research is one of the most challenges faced
while training the models. The challenge lies in performing semantic segmentation on such
a limited dataset since any deep learning model requires a large dataset to be able to learn
58
Chapter 7 Conclusion and Future Work 59
and provide better results. Moreover, the variety and complexity of spatial features such as
complex buildings or the shades of the buildings or trees in the satellite imagery, make it
very hard for the model to detect some spatial features such as roads properly. Therefore,
providing a large number of masked satellite images for the same dataset is crucial to get
more accurate results. The large annotated and masked satellite dataset has to be for
various areas and geographic regions that will help the model to learn, predict and provide
better results for the semantic segmentation of complex spatial features.
Furthermore, the dataset's satellite images are manually masked and annotated by humans,
which has resulted in wrong masks and segmentation for certain classes in some images.
This misleading data can negatively impact the model's learning process, potentially
affecting the quality of its predictions. The presence of errors in segmenting some spatial
features in the dataset may lead the model to incorrectly predict some spatial features from
satellite images. Therefore, it is recommended to generate masked images automatically by
developing a script or software that can generate masked images for each detail of spatial
feature in the satellite image without the need for manual intervention.
Chapter 6
References
[1] Sisodiya, N., Dube, N., & Thakkar, P. (2020). Next-Generation Artificial Intelligence
Techniques for Satellite Data Processing. International Journal of Advanced Research in
Computer Science and Software Engineering, 10(3), 34-39.
[2] Gupta, A. (2022). Deep Learning for Semantic Feature Extraction in Aerial Imagery.
Journal of Remote Sensing, 14(2), 23-31.
[3] Tuia, D., Wegner, J. D., Hansch, R., Le Saux, B., Yokoya, N., Demir, I., Jacobs, N., Kopersky,
K., Pacifici, F., Raskar, R., Brown, M., & Burhin, M. (2020). EARTHVISION 2020: Large Scale
Computer Vision for Remote Sensing Imagery. Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (pp. 3823-3826). IEEE.
[4] Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., & Johnson, B. A. (2021). Deep learning in remote
sensing applications: A meta-analysis and review. ISPRS Journal of Photogrammetry and
Remote Sensing, 180, 197-213.
[5] Selea, T., & Neagul, M. (2017). Using Deep Networks for Semantic Segmentation of
Satellite Images. In 2017 19th International Symposium on Symbolic and Numeric Algorithms
for Scientific Computing (SYNASC) (pp. 409-415). IEEE. doi: 10.1109/SYNASC.2017.00074.
[6] Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2018). A review of semantic segmentation
using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2),
87-93.
[7] Shelhamer, E., Long, J., & Darrell, T. (2017). Fully Convolutional Networks for Semantic
Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640-
60
References 61
651. [8] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. In Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9351,
pages 234-241. ISBN 9783319245737. doi: 10.1007/978-3-319-24574-4_fn_g28.
[9] Kaiser, P., Wegner, J. D., Lucchi, A., Jaggi, M., Hofmann, T., & Schindler, K. (2017). Learning
Aerial Image Segmentation from Online Maps. IEEE Transactions on Geoscience and Remote
Sensing. doi: 10.1109/TGRS.2017.2719738.
[10] Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., & Garcia-Rodriguez, J.
(2017). A review on deep learning techniques applied to semantic segmentation. arXiv Prepr.
arXiv1704.06857.
[11] Patil, D., Patil, K., Nale, R., & Chaudhari, S. (2022, March). Semantic Segmentation of
Satellite Images using Modified U-Net. In 2022 IEEE Region 10 Symposium (TENSYMP) (pp.
1-6). IEEE. doi: 10.1109/TENSYMP54529.2022.9864504.
[12] Ali, K. and Hussien, S. (2022) Semantic Segmentation of Aerial Images Using U-Net
Architecture.
[13] Human in the Loop. (2023). Semantic Segmentation Dataset. Available at:
https://humansintheloop.org/resources/datasets/semantic-segmentation-dataset-2/
(Accessed: March 9, 2023).
[14] Xu, Y., Xie, Z., Feng, Y., & Chen, Z. (2018). Road extraction from high-resolution remote
sensing imagery using deep learning. Remote Sensing, 10, 1461.
[15] Alsabhan, W., & Alotaiby, T. (2022). Automatic building extraction on satellite images
using Unet and ResNet50.
[16] Li, W., He, C., Fang, J., Zheng, J., Fu, H., & Yu, L. (2019). Semantic Segmentation-Based
Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source
61
References 62
GIS Data. Remote Sensing, 11, 403.
[17] Singh, N. J., & Nongmeikapam, K. (2023). Semantic segmentation of satellite images
using Deep-Unet. Arabian Journal of Science and Engineering, 48, 1193-1205.
[18] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical Image Computing
and Computer-Assisted Intervention.
[19] Wang, X., Huang, X., Chen, C., Zhou, B., He, J., & Chen, T. (2019). Comparison between
UNet, modified UNet, and dense-attention network (DAN) for building extraction from
TripleSat imagery.
[20] Gonzalez, J., Sankaran, K., Ayma, V., & Beltran, C. (2020). Application of Semantic
Segmentation with Few Labels in the Detection of Water Bodies from Perusat-1 Satellite’s
Images. In 2020 IEEE Latin American GRSS & ISPRS Remote Sensing Conference (LAGIRS) (pp.
483-487). Santiago, Chile: IEEE. doi: 10.1109/LAGIRS48042.2020.9165643.
[21] Khryashchev, V., Ivanovsky, L., Pavlov, V., Ostrovskaya, A., & Rubtsov, A. (2018).
Comparison of Different Convolutional Neural Network Architectures for Satellite Image
Segmentation. In 2018 23rd Conference of Open Innovations Association (FRUCT) (pp. 172-
179). Bologna, Italy: IEEE. doi: 10.23919/FRUCT.2018.8588071.
[22] Flood, N., Watson, F., & Collett, L. (2019). Using a U-net convolutional neural network
to map woody vegetation extent from high-resolution satellite imagery across Queensland,
Australia. Remote Sensing, 11(13), 1571.
[23] Gupta, A., Watson, S., & Yin, H. (2020). Deep Learning-based Aerial Image Segmentation
with Open Data for Disaster Impact Assessment. arXiv preprint arXiv:2006.05575.
[24] Storie, C. D., & Henry, C. J. (2018). Deep learning neural networks for land use land cover
mapping. In IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing
62
References 63
Symposium (pp. 3445-3448).
[25] Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D.,
& Raskar, R. (2018). DeepGlobe 2018: A challenge to parse the earth through satellite
images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 19-26. doi: 10.1109/CVPRW.2018.00009.
[26] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning Book. MIT Press.
[27] Ballard, D. H., Hinton, G. E., & Sejnowski, T. J. (1983). Parallel vision computation.
Nature.
[28] LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications of neural
network chips and automatic learning. IEEE Communications Magazine, 27(11), 41–46.
[29] Livshin, I. (2019). Learning About Neural Networks. In: Artificial Neural Networks with
Java. Apress, Berkeley, CA.
[30] Mohseni-Dargah, M., Falahati, Z., Dabirmanesh, B., Nasrollahi, P., & Khajeh, K. (2022).
Machine learning in surface plasmon resonance for environmental monitoring. In Artificial
Intelligence and Data Science in Environmental Sensing.
[31] IBM. (2021). CRISP-DM Help Overview. Available at:

https://www.ibm.com/docs/en/spss-modeler/saas?topic=dm-crisp-help-overview
(Accessed: 1 March 2023).
[32] Nikparvar, B., & Thill, J. C. (2021). Machine Learning of Spatial Data. ISPRS International
Journal of Geo-Information, 10(9), 600.
[33] Li, W., He, C., Fang, J., Zheng, J., Fu, H., & Yu, L. (2019). Semantic Segmentation-Based
Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source
63
References 64
GIS Data. Remote Sensing, 11(4), 403.
[34] International Organization for Standardization. (2018). Risk management - Guidelines

(ISO 31000:2018).
[35] Mas, J. F. and Flores, J. J. (2008). The application of artificial neural networks to the
analysis of remotely sensed data. International Journal of Remote Sensing, 29(3), 617-663.
[36] Kedzierski, M., Wierzbicki, D., Sekrecka, A., Fryskowska, A., Walczykowski, P., & Siewert,
J. (2019). Influence of Lower Atmosphere on the Radiometric Quality of Unmanned Aerial
Vehicle Imagery. Remote Sensing, 11(10), 1214.
[37] Mhango, J. K., Harris, E. W., Green, R., & Monaghan, J. M. (2021). Mapping Potato Plant
Density Variation Using Aerial Imagery and Deep Learning Techniques for Precision
Agriculture. Remote Sensing, 13(14), 2705.
[38] Singh, V.P., Srivastava, P.K., Agrawal, S., & Kandpal, M. (2020). Automated Road
Extraction from High-Resolution Satellite Images: A Review. Remote Sensing, 12(7), 1194.
[39] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O'Reilly Media.
[40] Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-
ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence.
[41] Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object
Detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2980-2988.
64
Appendices
Appendix A: Project Plan
Project Schedule
The research project commenced in January and continued until mid-April, with the research
report being completed during this time. Subsequently, the project is scheduled to resume
in early May. The implementation phase encompasses data preparation and processing,
model building, conducting training and testing experiments, evaluation, measurement, and
dissertation writing. The project's submission is planned for mid-August.
Figure A. 1 Project Gantt Chart
The Gantt Chart above shows the project schedule and timeline for all tasks and subtasks of
this research project. The research project will be executed using waterfall methodology, a
sequential approach where each phase is completed before moving on to the next as shown
in the above timeline of Table A.1.
65
Appendices 66
Risk Management
Effective risk management is a vital aspect of project management. By proactively

addressing potential risks, time and costs can be saved, uncertainties can be anticipated in
advance, and efforts can be made to find solutions before they manifest.
High Acceptable risk Unacceptable risk Unacceptable risk
Medium Acceptable risk Acceptable risk Unacceptable risk
Likelihood Low Acceptable risk Acceptable risk Acceptable risk
Low Medium High

Impact
Table A. 1 Risk management matrix [34]
In the above Table A.1, the top right corner of the matrix identifies the top risks that should
get identified and mitigated. A mitigation strategy is required to prevent them from
happening or to mitigate them and reduce their impact.
To implement risk management, several steps need to be done as the following as per the
ISO risk management process [34]:
1. Identify the potential risks of this research project as stated in below Table A.2.
2. Assess the occurrence probability of the risk.
3. Assess the impact level for each risk.
4. Set a strategy to deal with the potential issues.
The above steps are stated in detail in below Table A.2

No Risk Description Likelihood Impact Mitigation Strategy
1 Dataset unavailable L H Request the dataset ahead and back it
up.
2 Incorrect to build models L H Modify the models’ code then
Train and test again.
Appendices 67
3 Difficulties in tuning the M H Tune the model with various

model parameters and do systematic
experiments.
4 Project Falls Behind L H Allow some flexibility in the Project
Schedule Schedule.
5 Project plan changes M H Rearrange the plan to align with the
project's progress.
6 Software bugs M M Any software always has some bugs.
The key mitigation here is to address
the conditions when the bug appears
and find resolutions to it.
7 Source code loss L H Keep backups
8 Difficulties in tuning the M H Try other techniques for tuning the
Model that leads to model with various hyperparameters.
difficulties in obtaining
good results
9 The supervisor goes on M H Arrange ahead and agree on how to
leave communicate via email, phone, or
Microsoft Teams.
Table A. 2 Project Risk Management Plan
The above Table A.2 shows the list of risks with their likelihood, Impact, and mitigation
strategy to reduce or mitigate the impact. The risks related to this research project are
presented, with each risk being evaluated based on its impact and likelihood level (low,
medium, or high).
Appendices 68
Appendix B: Professional, Legal, Ethical, and Social Issues
Professional issue
The code was developed and tested with meticulous attention to high standards, ensuring
clarity through comprehensive comments. It adheres to the codes of conduct set forth by
the British Computing Society. Third-party libraries and software are utilized only in
compliance with their respective licenses. Proper referencing and citations are included for
any external information used.
Legal issue
The permission to access and download the dataset is granted by filling out the form stated
on the Humans in the Loop website [13]. This dataset is dedicated to the public domain
by Humans in the Loop under CC0 1.0 license.
Ethical Issues
As this research-based project does not involve users or sensitive datasets, there is no risk
of breaching ethics codes. Furthermore, safety concerns are not applicable since the project
does not involve users in either the implementation or evaluation process.
Social issues
The research does not include any social issues.

Detecting Spatial Information From Satellite Imagery Using Deep Learning For Semantic Segmentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Spatial Information From Satellite Imagery Using Deep Learning For Semantic Segmentation

Uploaded by

Copyright:

Available Formats

HERIOT-WATT UNIVERSITY

Detecting Spatial Information from

A thesis submitted in fulfilment of the requirements

School of Mathematical and Computer Sciences

Signed: Mohamed Othman

Date: 21st August 2023

Declaration of Authorship ................................................................................................................ i

FIGURE 2. 1: TYPES OF MACHINE LEARNING (AI) [26].......................................................................................................5

FIGURE 6. 1 U-NET EXPERIMENT #1 EVALUATION DIAGRAMS .........................................................................................39

FIGURE A. 1 PROJECT GANTT CHART ...........................................................................................................................65

TABLE 3. 1: FUNCTIONAL REQUIREMENTS ....................................................................................................................22

TABLE 4.1 CONFUSION MATRIX [39] ..........................................................................................................................30

TABLE 6. 1 U-NET EXPERIMENT #1 EVALUATION...........................................................................................................39

TABLE A. 1 RISK MANAGEMENT MATRIX [34] ...............................................................................................................66

GIS Geographic Information System

ANNs Artificial Neural Networks

FCN Fully Convolutional Network

CNN Convolutional Neural Network

ReLU Rectified Linear Unit

DAN Dense-Attention Network

SMART Setting specific, measurable, achievable, relevant, and time-bound

CRISP-DM Cross-Industry Standard Process for Data Mining

IoU Intersection over Union (IoU)

1.3 Aims and Objectives

• Investigate different methods and techniques for best classifying and

1.4 Project Outline

The rest of this paper consists of the following:

2 Background and Literature

2.2 Machine Learning

Understanding the foundational principles of machine learning is crucial to comprehend

Figure 2. 1: Types of machine learning (AI) [26]

2.2.2 Machine Learning and Spatial Data

2.3 Deep Learning

Figure 2. 2: Different components of an AI system [26]

2.3.2 Neural Networks

The architecture of an artificial intelligence neural network is designed to imitate the

2.3.3 Convolutional Neural Networks

A Convolutional Neural Network (CNN) is composed of convolutional and pooling layers,

S(i,j) = (I*K)(i,j) = ∑∑ I(m,n)K(i-m, j-n)

Figure 2. 4: Structure of the Fully Convolutional Networks [7]

2.3.4 Computer Vision

2.4 Semantic segmentation

Image segmentation is a computer vision task of partitioning an image into segments or

boundaries of objects or to find salient regions in images. Semantic segmentation is a special

2.4.2 Encoder-Decoder Architecture

Commonly employed techniques for semantic segmentation involve the utilization of

An autoencoder is a neural network designed to replicate its input as output. It includes a

Figure 2. 5: The general structure of an autoencoder [26]

The U-Net architecture, consisting of an Encoder-Decoder structure, incorporates two

Figure 2. 6: U-Net Architecture of Encoder-Decoder Architecture [8]

2.5 Related Work

A comparison of various segmentation models was conducted, and a framework was

Various Convolutional Neural Network Architectures for Satellite Image Segmentation

3.1 Functional Requirements

Utilizing specific, measurable, achievable, relevant, and time-bound (SMART) objectives is

• Must: Minimally usable and must have requirement

No Requirement description MoSCoW Priority Status

3.2 Non-functional Requirements

No Requirement description MoSCoW Priority Status

The images are labeled into 6 classes:

4.2 Proposed Methodology

Figure 4. 3: The life cycle model of CRISP-DM