You are on page 1of 29

Machine Learning for Visual Spatio-Temporal

Data

Seminar Paper
On
Object Detection in Remote Sensing

Submitted by

Name: Roquia Salam

Student Id Number: 542552

Email address: rsalam@uni-muenster.de

Page 1 of 29
Table of Contents
Abstract.................................................................................................................................................. 3
1. Introduction ....................................................................................................................................... 3
Figure 1: Flowchart showing the steps of OD in RS using DL (Source: Li et al., 2022) ............... 7
2. Related Work .................................................................................................................................... 8
3. Method Description: Faster R-CNN ............................................................................................. 11
Figure 2: The typical architecture of the Faster R-CNN algorithm (Source: Deng et al., 2018) .. 11
4. Results .............................................................................................................................................. 14
Figure 3: Images of cars collected from (a) and (b) CDNet 2014 datasets (in blizzard and
snowfall, respectively); (c) and (d) LISA datasets (in sunny and dense weather, respectively)
(Source: Ghosh, 2021) .................................................................................................................. 14
4.1 Dataset........................................................................................................................................ 15
4.2 Methodology .............................................................................................................................. 15
Figure 4: Architecture of the last two parts of the Faster R-CNN incorporating multiple RPNs to
create the RoI and Fast R-CNN (Source: Ghose, 2021) ............................................................... 15
Table 1: Optimizing parameters of the Faster R-CNN incorporating four RPNs (Source: Ghose,
2021) ............................................................................................................................................. 16
4.3 Outputs getting from the testing stage .................................................................................... 16
Figure 5: Correct detection of vehicles while on road by applying the Faster R-CNN
incorporating four RPNs on a few test images of the DAWN dataset in (a) Fog, (b) Snow, and (c)
Rain conditions. The red bounding box indicates the ground-truth and the green one indicates the
detecting box (Source: Ghosh, 2021) ............................................................................................ 17
Figure 6: A few incorrect detections of vehicles in (a) Wet snow, (b) Normal weather, and (c)
Blizzard conditions. The red bounding box indicates the ground-truth and the green one indicates
the detecting box (Source: Ghosh, 2021) ...................................................................................... 17
4.4 Performance evaluation of the model...................................................................................... 17
Table 2: Values of the performance metrics produced from the output of the Faster R-CNN with
multiple RPNs (Source: Ghosh, 2021) .......................................................................................... 18
Figure 7: ROC curve using Faster R-CNN with several RPNs (Source: Ghosh, 2021) ............... 18
4.5 Comparison of the performance of the used model with the other models for detecting
vehicles ............................................................................................................................................. 18
Table 3: Comparison between the Faster R-CNN and other models to detect vehicles on the road
...................................................................................................................................................... 19
5. Discussion and Conclusion ............................................................................................................. 20
References ............................................................................................................................................ 22

Page 2 of 29
Abstract
Object detection (OD) in remote sensing plays a pivotal role in numerous real-world
applications, for instance, agricultural monitoring, smart city planning, environmental
monitoring, disaster management, and so on. However, several significant challenges, such as
scale variation, occlusion, viewpoint variation, limited datasets, and so on, have been identified
for detecting objects that impact the performance and applicability of the techniques used for
OD. With the development of deep learning, there has been a substantial improvement in this
sector. So, in this study, a comprehensive review of the paradigm shift in OD techniques along
with the challenges, potential datasets, steps, and opportunities has been made. Besides, a
widely used algorithm Faster R-CNN is discussed broadly that had achieved much
acceptability for producing reliable results with a few error rates. Moreover, an application
work is presented to demonstrate the performance of the Faster R-CNN model. It is found from
that application work that the overall accuracy of the model performing on the various datasets
(LISA, DAWN, and CDNet 2014) collected from different weather conditions (sunny, cloudy,
rainy, foggy, sandstorms, snow, and blizzard) ranged from 89.21% to 95.42% indicating that
the overall performance of the model is very good. It is also found that the model outperformed
other models like YOLOv3, RetinaNet, Mask R-CNN, Casc R-CNN, TridenNet, and so on,
performing the same task on the same datasets. The outcome of this study would help the
scholars to get an idea to select the accurate model, for OD with a view to solving the real-
world problems, which can produce the most reliable and accurate results.

Keywords: Object Detection, Remote Sensing, Deep Learning, Convolutional Neural


Network, Faster R-CNN

1. Introduction
In computer vision, the term “object detection” has gained much popularity in recent decades
which is the technique of determining one or more objects from an image or video (Amit et al.,
2020). With the advancement of science and technology, the quality and quantity of remote
sensing (RS) images have been significantly increased which accelerates the utilization of the
OD technique by a number of scientists and researchers throughout the world to locate objects
for a number of purposes (Li et al., 2020). OD in the RS field has been playing a huge
contribution to some vital application fields (Cheng and Han, 2016), for instance, natural
hazards and disaster management (Pi et al., 2020); agricultural monitoring (Yi et al., 2021);

Page 3 of 29
urban planning (Moschos et al., 2023; Knura et al., 2021); military operation (Janakiramaiah
et al., 2023); and so on.

Over the past two decades, the techniques of OD have shifted to deep learning (DL) based
detection from traditional detection for a number of reasons. In this regard, from 1990-2014 is
known as the conventional detection period and after 2014 is dominated by DL-based detection
(Zou et al., 2023). In the context of OD, several significant challenges have been identified that
impact the performance and applicability of DL-based models (Ding et al., 2021). Scale
variation poses difficulties in accurately detecting objects of different sizes, while occlusion
and cluttered backgrounds hinder precise localization (Lowe, 1999). Viewpoint variation and
deformations further challenge the model's ability to recognize objects under varying
conditions (Wang et al, 2017). Additionally, limited training data and class imbalance affect the
model's generalization and can lead to biased predictions (Ghosh, 2021). Real-time processing
requirements demand a careful balance between accuracy and inference speed. Generalization
of new environments and anomaly detection in security applications add complexity to the task.
Furthermore, considering computational resource constraints is crucial for the feasibility of
practical deployments. Understanding and addressing these challenges are essential for
developing robust and effective object detection systems suitable for diverse real-world
applications (Ding et al., 2021). For this reason, scholars have been proposing improved models
to overcome these issues.

Because of the unavailability of an effective image representation, handicraft features were the
basis for early-stage object detection (Zou et al., 2023). In 2001, first time in the history of
computer vision, Viola and Jones have gained a remarkable achievement in the actual detection
of the faces of humans without any issue (Dunder et al., 2017). In 2005, Dalal and Triggs have
proposed a new technique of OD called histogram of oriented gradients (HOG) which was
primarily developed for pedestrian detection (Cintra et al., 2018). This HOG detector is
regarded as the base of many object detectors. In 2008, Felzenszwalb et al. (2008) has proposed
a technique “Deformable Part-Based Model (DPM)” for OD which is known as the paragon of
the conventional OD technique which is later improved by Girshick (2012). Because of
Felzenszwalb and Girshick’s invaluable contribution to improving the technique of OD, the
Pascal Visual Object Classes Challenge (VOC2010) awarded both of them in the category of
“lifetime achievement” in 2010 (Zou et al., 2023). The conventional technique is not robust
enough, having the limitation of decent generalization ability as well as a lack of changing

Page 4 of 29
illumination (Zhiqiang and Jun, 2017). Apart from these, the progress of this technique has
been very slow for the period from 2010 to 2012 (Zhiqiang and Jun, 2017). So, a number of
researchers has been starting contributing to improve the performance of the technique for
object detection. Then, a very popular DL algorithm Convolutional Neural network (CNN) has
emerged as a powerful and successful technique of OD because it has a very stronger ability
for generalization and discrimination of the extracted features (Szegedy et al., 2013).

In 2012, Krizhevsky et al, (2012) have been applied CNN to a classification problem and they
found that this CNN algorithm produced an error at a rate of 15.3% which is a much better
performance compared to the conventional technique (the error rate is 26.2%) and they have
won the first prize at the ImageNet competition in 2012 (Li et al., 2020). This achievement has
been regarded as the turning point of OD and since then CNN has been playing as the dominant
technique over the conventional technique in the field of OD and the era of DL-based OD has
started (Zhiqiang and Jun, 2017). Even in a CNN-based OD system, there are two parts which
are CNN-based two stages and one-stage detection (Zou et al., 2023). In one-stage OD, the
whole process has been performed in a single step which has boosted up the speed of
completing the computation (Li et al., 2022). On the other hand, in two-stages detection, the
whole process has been divided into two different steps. In the first step, several possible
proposals have been generated and in the second step, all the generated proposals have been
refined to make the output as the final results (Li et al., 2020).

Again, Girshick et al. (2016; 2014) have come to the lead in 2014 and have proposed the Region
with CNN (R-CNN) which is included in the two-stage OD process. The whole process of R-
CNN is not complex. At first, this algorithm produces approximately 2000 proposals by
utilizing the algorithm called selective search (SS) (Li et al., 2022). Then these proposals are
fed into a pre-trained CNN model called AlexNet which allows the algorithm to extract the
features. At last, a linear support vector machine (SVM) is used to predict the existence of any
object. However, because of a higher number of proposals, the speed of this algorithm for
detecting objects is awfully slow (Zou et al., 2023). To address the issue of a fixed-sized input
in RCNN, He et al. (2016) have developed an algorithm named spatial pyramid pooling
networks (SPPNet) which has a computational speed of 20 times faster than R-CNN. Although
SPPNet is a better version of R-CNN, it has still some drawbacks. To address these drawbacks,
Girshick (2015) further proposed another algorithm “Fast R-CNN” which increases the speed
of training and testing phases along with the accuracy improvement, thus improving both R-

Page 5 of 29
CNN and SPPNet. Although Fast R-CNN has better performance than previous algorithms,
still this algorithm is limited to its slow speed. So, there is a need to develop an algorithm that
can overcome this limitation. To overcome this limitation, Ren et al. (2017) have proposed a
more sophisticated algorithm Faster R-CNN which can input the entire image at once and thus
reduces the computation time resulting in higher speed. Followed by Faster R-CNN, more
improvement has been done in the technique to OD later, such as Mask R-CNN and Feature
Pyramid Networks (FPNs) (Zou et al., 2023).

One-stage OD algorithms have been developed by excluding the generation of proposals to


reduce the computation time. In 2015, the very first one-stage OD algorithm is You Only Look
Once (YOLO) proposed by Joseph et al. (Redmon et al., 2016). This algorithm is comparatively
faster than any other two-stage algorithm as it applies only one convolutional network to the
whole image. Although it is faster than other algorithms in terms of computation speed, this
algorithm has its limitation in producing higher accuracy over the two-stage OD algorithms.
Subsequently, Liu et al. (2016) have proposed a new algorithm named Single-Shot Multibox
Detector (SSD) which improves the accuracy as well as increases the speed of the computation
time over the previous one-stage OD algorithm. This SSD algorithm is more effective for
detecting small-scale objects. Following Liu et al. (2016), Lin et al. (2020) have contributed to
improving the accuracy and speed of the one-stage OD algorithm and have proposed RetinaNet
in 2017 which has adopted a loss function “focal loss” resulting in considerable accuracy while
comparing with the two-stage OD algorithm as well as has a comparable computation speed.
All the above-mentioned algorithms are suffering from category imbalance, a huge number of
hyperparameters that should be designed by hand, along with a lengthy merging time period
(Zou et al., 2023). So, to address these issues, the CornerNet algorithm is proposed by Law and
Deng (2018) where they have utilized the additional embedding information for the generation
of the essential bounding boxes (BBox). This algorithm outperforms the previous one-stage
OD algorithms. Zhou et al. (2019) have proposed the CenterNet algorithm which is end-to-end
differentiable, elegant, simple, and faster as well as can detect 3D objects. All the above-
mentioned algorithms have greatly contributed to the computer vision technique in improving
the OD process in the field of RS.

In the field of RS, there are five main steps to complete the OD by utilizing DL (Figure 1). The
five steps are a) data preprocessing which includes data augmentation (geometric, color, and
blurred transformation) and clipping when there is a small dataset but DL requires a larger

Page 6 of 29
dataset to perform reliably; b) feature extraction and processing which is known as the
foundation of the DL, there are a number of algorithms using for feature extraction, for
instance, AlexNet (Krizhevsky et al., 2012), ResNet (He et al., 2016), DenseNet (Huang et al.,
2017) and so on; c) BBox generation which is essential to detect the exact location of an object
in the image; d) detection which filters out the useless information using a fully connected
convolutional layer (two-stage algorithm) or executes the convolutional operation on the last
layer; and finally, e) post-processing which is not mandatory to get the final outputs though (Li
et al., 2022).

Figure 1: Flowchart showing the steps of OD in RS using DL (Source: Li et al., 2022)

Datasets are the main resource for OD in RS, as the performance of the DL algorithm is directly
associated with the size and quality of the dataset used for the respective study (Sun et al.,
2022). Cheng et al. (2014) released the NWPU VHR-10 dataset collected from Google Earth
having a number of 715 images with 10 categories. Because of its small instances (only 3775),
this dataset is not very ideal for representing the real earth. Zhang et al. (2019) introduced a
dataset called HRRSD which has a very high spatial resolution of 0.13m to 1.2m and was
collected from both Google Earth and Baidu Maps. However, because of the small size (227 X
227 pixels) of the images, there is a need to have other sources that have larger size images.
Xia et el. (2018) introduced a larger pool of original datasets namely “Datasets for Object
Detection in Aerial Images (DOTA)” where a number of 2806 images of 800 X 800 to 4000 X
4000-pixel size of 15 categories have been stored from a number of satellites and platforms. Li
et al. (2020) proposed a publicly opened dataset derived from Google Earth for Detection in
Optical Remote (DIOR) sensing which consists of 23463 images of 20 different categories and

Page 7 of 29
a total number of 1200 images (800 X 800 pixels) included in each category. The two above-
mentioned datasets are typically used for generic OD in RS. Zhu et al. (2018) have released a
large pool of datasets collected by drones, known as VisDrone2018, which are primarily for
addressing the gap of changing viewpoints, scales, and occlusion. This VisDrone dataset has a
collection of 10209 images along with 263 videos which are annotated much better. The dataset
adds a new value in computer vision for single and multi-object tracking. The SAR ship
detection dataset (SSDD) is the first released open data source for detecting ships from SAR
data (Zhang et al., 2021). The images stored in SSDD have a very high spatial resolution to
medium resolution (1m to 15m) with a size of 500 X 500 pixels. This dataset has a total of
1160 images and each image contains more than 2 ships. There are other data sources of SAR
images used for OD which are the SAR ship datasets (Wang et al., 2019), AIR-SARShip-1.0
(Xian et al., 2019)), SpaceNet 6Multi-Sensor All-Weather (Shermeyer et al., 2020)), and so on.
To date, the source of the most diverse datasets is released by Lam et al. (2018) called xView.
The data of this dataset is in the category of very high resolution with a spatial resolution of
0.3m and the data are collected by the WorldView 3 satellite. It has large objects of 1 million
with a class of 60 with an image of 2000 X 2000 to 4000 X 4000 pixels size.

In the later sections of this study, the popular two-stage algorithm Faster R-CNN is discussed
broadly to get how this algorithm works. Besides, an application conducted by the Faster R-
CNN is discussed in the result section. Apart from these, it is justified by comparing with other
algorithms why the Faster R-CNN is the best algorithm in RS for OD.

2. Related Work
Zhigang et al. (2018) have used the region-based fully convolutional network (R-FCN) for
identifying four different types of vehicles suvs, buses, vans, and coups. It is outperformed in
terms of computational time along with the target rate over the traditional techniques of OD.
The target detection rate was 87.48%. However, the R-FCN requires a larger dataset to perform
more efficiently which was a limitation of their study. Besides, it is challenging to detect the
small-sized object by utilizing this method and the authors have recommended solving this
issue by the researchers who would work later on the same topic of interest.

Ren et al. (2018) have proposed a modified Faster R-CNN to address the issue of having poor
performance while using the Faster R-CNN directly on small-sized OD in optical RS. They
have modified a stage of Faster R-CNN which is known as RPN and they have utilized the

Page 8 of 29
method of “random rotation” of data augmentation in the training stage. This algorithm is found
as an effective algorithm to detect small-scale objects.

Liu et al. (2020) have proposed an improved Faster R-CNN algorithm to detect small-sized
objects of overpasses, oil tanks, aircraft, and playgrounds. In that study, they have compared
the performance of their proposed algorithm to the Faster R-CNN algorithm. It is found that
their proposed algorithm is capable of producing higher precision of 87.9% which is higher
than 45.1% of the Faster R-CNN, as in the improved Faster R-CNN there is an option of multi-
scale detection structure which is not possible in Faster R-CNN.

Yi et al. (2021) have investigated detecting an agricultural pest named “grasshopper plagues”
using a probabilistic Faster R-CNN. This algorithm outperforms in terms of mean precision
(0.91), maximum F1 score (0.92), reduced false positive rate, and mean missed rate (0.36).
This kind of study is important for managing agricultural productivity sustainably and ensuring
food security.

Farid et al. (2023) have detected vehicles on roads by utilizing the algorithm YOLO-v5 using
the data available for public use namely COCO, DAWN, and PKU with a view to improving
the traffic system. The authors have found their algorithm outperformed in terms of accuracy
and computation time over other conventional techniques for detecting vehicles. During their
study, there was a challenge to annotate all the images manually.

Chen et al. (2020) have proposed a method combining CNN with a modified Generative
Adversarial Network (GAN) for detecting small ships accurately. In their study, they artificially
generated some images based on the original images by utilizing a Gaussian Mixture
Wasserstein GAN with Gradient Penalty. Then, they implemented their proposed method on
both the original and generated data. The results they have found after analyses indicate that
the proposed method is reliable to detect small-scaled sufficiently with robust and improved
results. This method can be applicable to ensure the ship's safety autonomously.

Liu et al. (2020) have proposed a new method for detecting aircraft of different sizes, types,
backgrounds, and poses, based on very high-resolution satellite images, which is an important
task in military operations. They have proposed a method combining corner clustering with
CNN for addressing the limitation of classical algorithms of aircraft detection where region
generation and feature extraction have to be done manually. In this process, they have also
applied image segmentation techniques for minimizing the interference existing in the

Page 9 of 29
background of the respective images. They have found this method robust in producing higher
precision with fewer false alarms. However, they did not consider adjusting the position of the
aircraft.

Körschens et al. (2018) have combined both DL-based CNN and ML-based SVM algorithms
to automatically detect elephants as a part of monitoring biodiversity, using the approach
known as face detection. Their proposed algorithm is able to achieve top 1 and top 10
accuracies of 74% and 88%, respectively, during the testing stage.

Nava et al. (2021) have utilized CNN on both optical and SAR images for detecting earthquake-
triggered landslides. Both pre-and post-event images were used in their study to achieve their
goal. It is explored that the performance of the CNN model outperformed in regard to an overall
accuracy when optical images are used for landslide detection. The overall accuracy of the
model is 98.96% when using the optical images whereas the overall accuracy has been reduced
a little bit to 96% when using the SAR images.

Younis et al. (2020) have utilized the SSD algorithm with the MobileNet DL model to detect
three different categories of objects such as chairs, cars, and people from video. The purpose
of their study is to enhance the accuracy produced by using only the SSD algorithm in detecting
objects. The results showed that the mean precision of the proposed algorithm for chairs is
71.07%, for cars is 99.76%, and for people is 97.76% which indicates that the accuracy is
increased during the computation processing which is very much needed in monitoring daily
indoor as well as outdoor objects.

Li et al. (2021) have used a meta-learning-based algorithm called YOLO-v3 for detecting
objects using very few datasets to introduce an alternative model to CNN where usually a larger
dataset is needed to get reliable accuracy. It is explored by the analyses that this proposed model
is effective in producing an acceptable performance using only a limited number of images.
Moreover, it is found that this model performed better than some of the baseline methods for
OB in RS.

Janakiramaiah et al. (2023) have introduced a CNN-based architecture called capsule network
(CapsNet) for detecting five unlike categories of military objects during the period of war by
utilizing a small amount of data. It is calculated that the proposed algorithm can produce an
accuracy of 96.54% which indicates that this algorithm paves the way to implement the

Page 10 of 29
technique of automatic OD during the time of war. This model can be implemented in other
military operations as well.

3. Method Description: Faster R-CNN

In this section, one of the popular algorithms Faster R-CNN (Ren et al., 2017) has been
discussed. This region-based CNN algorithm has mainly three parts to complete the full
processing (Figure 2). These three parts are a) VGG layers named by the Visual Geometry
Group which is the foundation block of the Faster R-CNN; b) a region proposal network (RPN)
which is used for generating region proposal efficiently; and finally, c) a Fast R-CNN for final
classification along with bounding box regression for detecting final objects. More details
explanation is given below:

Figure 2: The typical architecture of the Faster R-CNN algorithm (Source: Deng et al., 2018)

a) Convolution layers: This layer is a part of the VGG which is generated initially to
commence the process based on the input image (Figure 2). This convolutional layer is shared
by both the RPN and Fast R-CNN in the next stage (Gad et al., 2020). This VGG layer consists
of 16 to 19 convolutional layers along with max pooling. A 2D feature map is produced from
the convolutional layers by using the appropriate filter sizes (such as 3 X 3 pixels window).
Whereas, max pooling (such as 2 X 2 pixels window) contributes to eliminating the pixels
which have low values for reducing the feature’s quantity (Khazri, 2019). So, after these
convolutional layers and max pooling execution, one feature map is created which is used
separately in the RPN and Fast R-CNN.

Page 11 of 29
b) RPN: the RPN has taken the feature map created from the VGG layers as an input for further
processing to create an output called “region proposals (RP)” that should be used in the last
part of the Faster R-CNN algorithm (Ren et al., 2017). A small network has been slid over the
feature map getting from the VGG layers. Every sliding window is mapped to a lower-
dimensional vector. Next, this created vector is sibling into two fully connected (FC) layers (1
X 1 pixel) which are the regression box and the classification box. In every sliding window, k
region proposals have been predicted simultaneously. Each proposal has been parametrized by
a reference box named “anchor box”.

There are two elements in each anchor box which are scale and aspect ratio. Ren et al. (2017)
have recommended using 3 scales and 3 aspect ratios, resulting in 9 anchors in total from each
sliding window. The classification box is primarily a binary classifier that has two outputs, one
is background and another one is an object. All outputs of the classifier have two elements. If
the first element and the second element are 1 and 0, respectively, then the RP is a background
and if the values will be opposite for the first and second elements, then the RP is an object.

For training the RPN, either a positive or a negative label is assigned to each anchor based on
the Intersection-Over-Union (IoU). The IoU is the ratio of the area which is intersected by the
anchor box and the ground truth box (GTB). A positive label is assigned to the anchor when it
has the highest IoU overlapped by the GTB as well as when an anchor has an IoU overlap
higher than the value of 0.7 with any GTB. A negative label is assigned to an anchor when it
has an IoU lower than the value of 0.3 for all GTBs. A positive anchor label means the output
is an object. During the training stage, the anchors which have neither positive nor negative
labels are not contributed.

Here, with a view to minimizing the functions of an object the following multi-task loss of the
Fast R-CNN algorithm is used as well. The statistical equation of the loss function is noted
below:

1 1
L({ pi },{ti }) =
N cls
L
i
cls ( pi , pi* ) +   pi*Lreg (ti , ti* ) …………………………. (1)
N reg i

Where, i is the anchor. pi , pi* , ti , and ti* represent the predicted probability, ground truth label,

a vector of the predicted bounding box, and GTB with positive anchor, respectively. Lcls is the

classification loss over background or an object, Lreg demonstrates the regression loss. The

Page 12 of 29
output of the classification and regression layers consists of pi and ti respectively. The two

terms are normalized with N cls and Nreg ,and a balancing weight λ.

For the regression part, parameterization of four coordinates has been adopted and the
equations (2 to 9) are as follows:

x − xa
tx = …………………………………………………..………..………. (2)
wa

y − ya
ty = …………………………………………………….…….………… (3)
ha

w
tw = log …………………………………………………..……………...…... (4)
wa

h
th = log ……………………………………………………..………….……... (5)
ha

x* − xa
t x* = ………………………………………………………..……………… (6)
wa

y * − ya
t *y = ………………………………………………………..……………… (7)
wa

w*
tw* = log ………………………………………………………….…………..…. (8)
wa

h*
t = log ………………………………………………………………………….. (9)
*
h
ha
Where, x, y, w, and h denote the two coordinates of the box center, width, and height. Variables
x, xa, and x∗ are for the predicted box, anchor box, and GTB, respectively.

c) Fast R-CNN: In this stage, the feature layer produced in VGG layers and the RP produced
in RPN is used for executing the whole process of the Fast R-CNN (Figure 2). Then, the feature
layer is fed into the (Region of Interest) RoI pooling layer with one pyramid level. The RoI
pooling layer extracts the feature vector from the feature map (RP) generated in RPN in the
same length but not the same as the input. The features generated from the RoI are then
transferred to the FC layer. Two branches From the FC layer have come out: classification by
SoftMax classifier and the regression by the BBox. Finally, the output created from both the
SoftMax classifier and BBox regression is used for calculating the loss by the following loss
function (Girshick, 2015)

Page 13 of 29
L( p, u, t u , v) = Lcls ( p, u ) + [u  1]Lloc (t u , v) ………………………………… (10)

Where, Lcls represents the classification loss which is also known as the log loss of the true

.class u and Lloc means the localization loss.

Lloc (t u , v) = 
i x , y , w, h
smoothL1 (tiu − vi ) ……………………………………….. (11)

Where, x, y, w, and h are the coordinates (x and y), width, and height of the BBox produced
from the regression.

if | x | 1
0.5 x 2
smoothL1 ( x) = { ………………………..………………… (12)
| x | −0.5 otherwise

As the RoI extract features from all the proposals one by one, the computation time is needed
more than normal which resulting the whole Faster R-RCNN computation a bit slower (Gad,
2020).

4. Results

Figure 3: Images of cars collected from (a) and (b) CDNet 2014 datasets (in blizzard and
snowfall, respectively); (c) and (d) LISA datasets (in sunny and dense weather, respectively)
(Source: Ghosh, 2021)
In the field of sustainable transportation, road safety is very important. In this regard, it is
essential to have an automated method to monitor the road to detect vehicles in all situations.

Page 14 of 29
A practical application of Faster R-CNN for detecting vehicles on roads in different weather is
conducted by Ghosh (2021). In his study, Ghosh (2021) used four different-sized RPNs for
producing the RoI to get which one is perfect to detect diverse-sized vehicles.

4.1 Dataset

There are three different datasets have been used for this study which are DAWN, CDNet 2014,
and LISA. The DAWN datasets have a number of 1000 images collected from four different
weather conditions which are sandstorm, fog, rain, and snow. The CDNet 2014 datasets contain
videos captured in both good (2500 consecutive frames of sunny weather) and bad (3500, 6500,
and 7000 consecutive frames of wet snow, snowfall, and blizzard conditions, respectively)
weather. Finally, the LISA datasets have videos captured in three different weather situations.
About 1600, 300, and 300 consecutive frames are retrieved on sunny evenings, cloudy
mornings, and sunny afternoons, respectively. Figure 3 shows the four sample images of
vehicles collected from CDNet 2014 and LISA. Each dataset collected from the three different
sources is divided into two parts as a ratio of 3:1 for training and testing, respectively.

4.2 Methodology

Figure 4: Architecture of the last two parts of the Faster R-CNN incorporating multiple RPNs
to create the RoI and Fast R-CNN (Source: Ghose, 2021)

Before creating several RPNs, the feature map is created by using the convolutional layer called
VGG layers. Then the feature map is used as input of the RPN for creating ROI. Figure 4

Page 15 of 29
demonstrates the architecture of the rest two parts of the algorithm which are RPNs and Fast
R-CNN.

The dimensions of the field of the RPN1, RPN2, RPN3, and RPN4 are 128 X 128; 312 X 312;
574 X 574; and 968 X 968, respectively. Whereas, the dimension of the input image (feature
map) is 1200 X 1200. Then, four arrays generated from four RPNs are merged into one single
by a RoI merger layer. Next, this merged array is fed into the RoI pooling. Next, the rest two
steps of the Fast R-CNN, FC layer, and the classification for the vehicle detection along with
the regression for loss estimation have been performed in a similar way in the Faster R-CNN
architecture.

Table 1: Optimizing parameters of the Faster R-CNN incorporating four RPNs (Source:
Ghose, 2021)
Hyper parameters Value

Batch size 512


Overlap threshold for ROI 0.8
Number of RPNs -4
Number of ROIs 80

Learning Rate 0.05


Weight decay for regularization 0.005

For getting reliable results, parameter optimization is necessary. Table 1 shows the optimizing
parameters used in executing the Faster R-CNN incorporating four RPNs.

4.3 Outputs getting from the testing stage

Most of the test images have produced true positive results in several weather conditions by
applying the Faster R-CNN with several RPNs (Figure 5). However, a small number of test
images have yielded false positive results. In these cases, the used architecture failed to
accurately detect vehicles and mistakenly enclosed other objects instead. The errors occurred
in various weather conditions, including wet snow or blizzard environments with extremely
poor visibility, as well as normal weather conditions where obstructions hindered the system's
ability to correctly identify vehicles in the images. Some examples of these inaccurate
detections are illustrated in Figure 6 for instance.

Page 16 of 29
Figure 5: Correct detection of vehicles while on road by applying the Faster R-CNN
incorporating four RPNs on a few test images of the DAWN dataset in (a) Fog, (b) Snow, and
(c) Rain conditions. The red bounding box indicates the ground-truth and the green one
indicates the detecting box (Source: Ghosh, 2021)

Figure 6: A few incorrect detections of vehicles in (a) Wet snow, (b) Normal weather, and (c)
Blizzard conditions. The red bounding box indicates the ground-truth and the green one
indicates the detecting box (Source: Ghosh, 2021)

4.4 Performance evaluation of the model

The performance of the used architecture has been evaluated using some statistical metrics
which are accuracy, precision, recall, and the receiver operation characteristics (ROC) curve.
Table 2 shows the overall accuracy, precision, and recall of the model. It is shown in Table 2
that the overall accuracy of the model performing on all the datasets collected from different
weather conditions ranged from 89.21% to 95.42% indicating that the overall performance of
the model is very good. A higher precision rate (89.04% to 95.28%) indicates that the model
can detect the vehicles in the true positive place. Besides, a higher precision rate (89.12% to
95.34%) indicates that the used model has found a large number of true positive vehicles
successfully. Overall, the model can detect vehicles more correctly using the LISA datasets.

Page 17 of 29
Table 2: Values of the performance metrics produced from the output of the Faster R-CNN
with multiple RPNs (Source: Ghosh, 2021)

Figure 7 demonstrates the ROC indicating the ration of the sensitivity (true positive rate) and
specificity (false positive rate) of the model. Even in ROC curve, it is visible that the
performance of the model is the best applying on the LISA datasets.

Figure 7: ROC curve using Faster R-CNN with several RPNs (Source: Ghosh, 2021)

4.5 Comparison of the performance of the used model with the other models for
detecting vehicles

The performance of the model may vary in using different datasets. More specifically same
model can be performed better in a dataset, at the same time can be performed bad using other
datasets (Wang et al., 2022). So, for comparing the performance of a specific algorithm with
other models for a specific purpose, it is always better to consider the work where the same
datasets have been used (Ghosh, 2021). So, in this section, a comparison has been shown

Page 18 of 29
between the performance of the Faster R-CNN and other models used for detecting vehicles on
roads using the same data as the data used by Ghosh (2021).

Table 3: Comparison between the Faster R-CNN and other models to detect vehicles on the
road
Reference Method Backbone Mean Mean Mean
precision precision precision
(%) (LISA (%) (DAWN (%) (CDNet
datasets) dataset) 2014
dataset)
Cai and Vasconcelos Casc-RCNN ResNet-101-FPN 85.21 81.31 83.04
(2018)
Hu et al. (2018) SINet VGGNet-16 83.45 79.23 80.82
Hassaballah et al. LBP NA 86.72 82.46 83.38
(2020)
Chabot et al. (2017) Deep MANTA VGGNet-16 84.28 80.54 81.46
Xiang et al. (2017) SubCategory ResNet-101 81.82 78.34 79.94
He et al. (2017) Mask R-CNN ResNeXt-101 86.38 81.74 83.12
Lin et al. (2017) RetinaNet ResNet-101-FPN 74.62 71.23 72.54
Redmon and Farhadi YOLOv3 DarkNet-53 93.78 88.54 89.72
(2018)
Li et al. (2019) TridenNet ResNet-101 79.58 75.42 77.18
Ghosh (2021) Faster R-CNN Faster R-CNN 96.16 89.48% 91.20
with four RPNs

Table 3 demonstrates the performance of the Faster R-CNN and other models used by several
scholars at different times to detect vehicles on roads. This scholarly comparison evaluates
various object detection methods and their performance on three datasets: LISA, DAWN, and
CDNet 2014. The Mean Precision metric is employed to assess the accuracy of each method.
Ghosh (2021) with Faster R-CNN utilizing four RPNs achieves the highest mean precision
across all datasets, demonstrating superior accuracy compared to other methods. Redmon and
Farhadi (2018) using YOLOv3 also stand out with exceptional results on the LISA dataset.
ResNet-101 and ResNeXt-101 backbones are prevalent in several methods, consistently
delivering competitive performances. On the other hand, VGGNet-16 and DarkNet-53
backbones yield good results but fall slightly behind the ResNet counterparts. Nonetheless,
some methods like Lin et al. (2017) - RetinaNet and Li et al. (2019) - TridenNet exhibit
relatively lower mean precision, particularly on the LISA dataset. Among all the mentioned

Page 19 of 29
works in Table 3, it is found that the RetinaNet model is the least performed model performing
in the three different datasets. In conclusion, the choice of the most suitable method depends
on the specific application requirements and dataset characteristics, necessitating continuous
exploration of cutting-edge techniques to enhance object detection accuracy.

5. Discussion and Conclusion

These days, OD is an essential part of surveillance systems to detect objects as a part of


computer vision (Shivappriya et al., 2021). Now, OD is not only restricted to use in the field of
manufacturing and robotics but has also extensively been used in medical imaging, home
automation, security and safety control, autonomous driving, traffic monitoring, hazard and
disaster monitoring, smart city planning, monitoring agriculture and so on (Janakiramaiah et
al., 2023; Shivappriya et al., 2021; Yi et al., 2021; Talukdar et al., 2018). However, this
transformation of utilizing OD in real-world applications has not happened in a single day. As
the initial architecture of OD techniques was quite simple and only some simple problems
could be solved. Apart from this, methods developed for natural image detection cannot be
performed for detecting RS images (Ding et al., 2021). Moreover, with the passing of time,
there is an emerging need to use OD techniques in some complex applications in RS and those
initial techniques of OD were unable to perform in this regard. So, many scholars have been
contributing to improving the techniques of OD over the past two decades so that this can be
extensively used in the border aspect.

In this regard, the transformation has been made from using traditional techniques to deep
learning-based techniques with a view to applying them efficiently in the RS field (Zou et al.,
2023). In the era of deep learning, a number of methods have been developed and proposed by
a number of scholars (Li et al., 2020). Therefore, the field of OD has witnessed significant
advancements in deep learning-based models, including R-FCN, YOLO, SSD, RetinaNet, Fast
R-CNN, Faster R-CNN, and so on each offering distinct features and capabilities (Li et al.,
2022). However, all models are not performed well in almost all applications in terms of the
difficulties in the management of large RS datasets, extracting features, the data collected from
diverse weathers, computation speed and efficiency, generalizability, producing compatible
accuracy, transfer learning, adaptability, and so on (Luo et al., 2021; Ghosh et al., 2021). In this
case, the Faster R-CNN emerges as the outperformed model in considering most of the above-
mentioned challenges over other models (Yi et al., 2021).

Page 20 of 29
Its three-stage design, incorporating VGG, RPN, and Fast R-CNN, leads to improved
localization accuracy and robustness, which is particularly beneficial for detecting small and
densely packed objects prevalent in remote sensing imagery (Ren et al., 2017). Furthermore,
Faster R-CNN consistently exhibits higher precision and recall scores, outperforming other
models in challenging scenarios (Ghosh, 2021). While one-stage models like YOLO and SSD
boast real-time processing capabilities, Faster R-CNN strikes a favorable balance between
accuracy and speed, bolstered by recent optimizations, making it suitable for real-time
applications (Luo et al., 2021). Moreover, Faster R-CNN's transfer learning capabilities on pre-
trained models enable robust object detection with limited remote sensing data (Raman et al.,
2023). Its adaptability to various environmental conditions, weather, and sensor types further
reinforces its practicality in diverse RS tasks (Ghosh, 2021, Luo et al., 2021). In conclusion,
after comprehensive analysis it can be stated that Faster R-CNN is the most effective and
reliable model for object detection in remote sensing, presenting researchers and practitioners
with an ideal solution to address their detection challenges.

Page 21 of 29
References

Amit, Y., Felzenszwalb, P., & Girshick, R. (2020). Object detection. Computer Vision: A
Reference Guide, 1-9, https://doi.org/10.1007/978-3-030-03243-2_660-1.

Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
6154-6162), https://doi.org/10.48550/arXiv.1712.00726.

Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., & Chateau, T. (2017). Deep manta: A
coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular
image. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 2040-2049), https://doi.org/10.48550/arXiv.1703.07570.

Chen, Z., Chen, D., Zhang, Y., Cheng, X., Zhang, M., & Wu, C. (2020). Deep learning for
autonomous ship-oriented small ship detection. Safety Science, 130, 104812,
https://doi.org/10.1016/j.ssci.2020.104812.

Cheng, G., & Han, J. (2016). A survey on object detection in optical remote sensing
images. ISPRS journal of photogrammetry and remote sensing, 117, 11-28, Cheng, G.,
& Han, J. (2016). A survey on object detection in optical remote sensing images. ISPRS
journal of photogrammetry and remote sensing, 117, 11-28,
https://doi.org/10.1016/j.isprsjprs.2016.03.014.

Cheng, G., Han, J., Zhou, P., & Guo, L. (2014). Multi-class geospatial object detection and
geographic image classification based on collection of part detectors. ISPRS Journal of
Photogrammetry and Remote Sensing, 98, 119-132,
https://doi.org/10.1016/j.isprsjprs.2014.10.002.

Cintra, R. J., Duffner, S., Garcia, C., & Leite, A. (2018). Low-complexity approximate
convolutional neural networks. IEEE transactions on neural networks and learning
systems, 29(12), 5981-5992, https://doi.org/10.1109/TNNLS.2018.2815435.

Deng, Z., Sun, H., Zhou, S., Zhao, J., Lei, L., & Zou, H. (2018). Multi-scale object detection
in remote sensing imagery with convolutional neural networks. ISPRS journal of
photogrammetry and remote sensing, 145, 3-22,
https://doi.org/10.1016/j.isprsjprs.2018.04.003.

Page 22 of 29
Ding, J., Wang, J., Yang, W., & Xia, G. S. (2021). Object detection in remote sensing. Deep
Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing,
Climate Science, and Geosciences, 67-89, https://doi.org/10.1002/9781119646181.ch6.

Dundar, A., Jin, J., Martini, B., & Culurciello, E. (2016). Embedded streaming deep neural
networks accelerator with applications. IEEE transactions on neural networks and
learning systems, 28(7), 1572-1583, https://doi.org/10.1109/TNNLS.2016.2545298.

Farid, A., Hussain, F., Khan, K., Shahzad, M., Khan, U., & Mahmood, Z. (2023). A Fast and
Accurate Real-Time Vehicle Detection Method Using Deep Learning for
Unconstrained Environments. Applied Sciences, 13(5), 3059,
https://doi.org/10.3390/app13053059.

Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008, June). A discriminatively trained,
multiscale, deformable part model. In 2008 IEEE conference on computer vision and
pattern recognition (pp. 1-8). IEEE, https://doi.org/10.1109/CVPR.2008.4587597.

Gad, A. F. (2020). Faster R-CNN Explained for Object Detection Tasks. PaperSpace. [URL:
https://blog.paperspace.com/faster-r-cnn-explained-object-detection]

Ghosh, R. (2021). On-road vehicle detection in varying weather conditions using faster R-CNN
with several region proposal networks. Multimedia Tools and Applications, 80(17),
25985-25999, https://doi.org/10.1007/s11042-021-10954-5.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on


computer vision (pp. 1440-1448), https://doi.org/10.48550/arXiv.1504.08083.

Girshick, R. B. (2012). From rigid templates to grammars: Object detection with structured
models. The University of Chicago.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate
object detection and semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 580-587),
https://doi.org/10.48550/arXiv.1311.2524

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region-based convolutional networks
for accurate object detection and segmentation. IEEE transactions on pattern analysis
and machine intelligence, 38(1), 142-158,
https://doi.org/10.1109/TPAMI.2015.2437384.

Page 23 of 29
Hassaballah, M., Kenk, M. A., & El-Henawy, I. M. (2020). Local binary pattern-based on-road
vehicle detection in urban traffic scene. Pattern Analysis and Applications, 23(4), 1505-
1521, https://doi.org/10.1007/s10044-020-00874-9.

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision (pp. 2961-2969),
https://doi.org/10.48550/arXiv.1703.06870.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE transactions on pattern analysis and machine
intelligence, 37(9), 1904-1916, https://doi.org/10.1109/TPAMI.2015.2389824.

Hu, X., Xu, X., Xiao, Y., Chen, H., He, S., Qin, J., & Heng, P. A. (2018). SINet: A scale-
insensitive convolutional neural network for fast vehicle detection. IEEE transactions
on intelligent transportation systems, 20(3), 1010-1019,
https://doi.org/10.1109/TITS.2018.2838132.

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 4700-4708), https://doi.org/10.1109/CVPR.2017.243.

Janakiramaiah, B., Kalyani, G., Karuna, A., Prasad, L. N., & Krishna, M. (2023). Military
object detection in defense using multi-level capsule networks. Soft Computing, 27(2),
1045-1059, https://doi.org/10.1007/s00500-021-05912-0.

Khazri, A. (2019, April 9). Faster R-CNN Object Detection. Towards Data Science. [URL:
https://towardsdatascience.com/faster-rcnn-object-detection-f865e5ed7fc4]

Knura, M., Kluger, F., Zahtila, M., Schiewe, J., Rosenhahn, B., & Burghardt, D. (2021). Using
object detection on social media images for urban bicycle infrastructure planning: a
case study of Dresden. ISPRS International Journal of Geo-Information, 10(11), 733,
https://doi.org/10.3390/ijgi10110733.

Körschens, M., Barz, B., & Denzler, J. (2018). Towards automatic identification of elephants
in the wild. arXiv preprint arXiv:1812.04418,
https://doi.org/10.48550/arXiv.1812.04418.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.

Page 24 of 29
Lam, Darius, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric,
Yaroslav Bulatov, and Brendan McCord. "xview: Objects in context in overhead
imagery." arXiv preprint arXiv:1802.07856 (2018),
https://doi.org/10.48550/arXiv.1802.07856.

Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings
of the European conference on computer vision (ECCV) (pp. 734-750),
https://doi.org/10.48550/arXiv.1808.01244.

Li, K., Wan, G., Cheng, G., Meng, L., & Han, J. (2020). Object detection in optical remote
sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and
remote sensing, 159, 296-307, https://doi.org/10.1016/j.isprsjprs.2019.11.023.

Li, X., Deng, J., & Fang, Y. (2021). Few-shot object detection on remote sensing images. IEEE
Transactions on Geoscience and Remote Sensing, 60, 1-14,
https://doi.org/10.1109/TGRS.2021.3051383.

Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019). Scale-aware trident networks for object
detection. In Proceedings of the IEEE/CVF international conference on computer
vision (pp. 6054-6063), https://doi.org/10.48550/arXiv.1901.01892.

Li, Z., Wang, Y., Zhang, N., Zhang, Y., Zhao, Z., Xu, D., ... & Gao, Y. (2022). Deep learning-
based object detection techniques for remote sensing images: A survey. Remote
Sensing, 14(10), 2385, https://doi.org/10.3390/rs14102385.

Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object
detection. In Proceedings of the IEEE international conference on computer vision (pp.
2980-2988), https://doi.org/10.48550/arXiv.1708.02002.

Liu, Q., Xiang, X., Wang, Y., Luo, Z., & Fang, F. (2020). Aircraft detection in remote sensing
image based on corner clustering and deep learning. Engineering Applications of
Artificial Intelligence, 87, 103333, https://doi.org/10.1016/j.engappai.2019.103333.

Liu, R., Yu, Z., Mo, D., & Cai, Y. (2020, July). An improved faster-RCNN algorithm for object
detection in remote sensing images. In 2020 39th Chinese Control Conference
(CCC) (pp. 7188-7192). IEEE, https://doi.org/10.23919/CCC50068.2020.9189024.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd:
Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European

Page 25 of 29
Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I
14 (pp. 21-37). Springer International Publishing, https://doi.org/10.1007/978-3-319-
46448-0_2.

Lowe, D. G. (1999, September). Object recognition from local scale-invariant features. In


Proceedings of the seventh IEEE international conference on computer vision (Vol. 2,
pp. 1150-1157). IEEE, https://doi.org/10.1109/ICCV.1999.790410.

Luo, J. Q., Fang, H. S., Shao, F. M., Zhong, Y., & Hua, X. (2021). Multi-scale traffic vehicle
detection based on faster R–CNN with NAS optimization and feature
enrichment. Defence Technology, 17(4), 1542-1554,
https://doi.org/10.1016/j.dt.2020.10.006.

Moschos, S., Charitidis, P., Doropoulos, S., Avramis, A., & Vologiannidis, S. (2023).
StreetScouting dataset: A Street-Level Image dataset for finetuning and applying
custom object detectors for urban feature detection. Data in Brief, 48, 109042,
https://doi.org/10.1016/j.dib.2023.109042.

Nava, L., Monserrat, O., & Catani, F. (2021). Improving landslide detection on SAR data
through deep learning. IEEE Geoscience and Remote Sensing Letters, 19, 1-5,
https://doi.org/10.1109/LGRS.2021.3127073.

Pi, Y., Nath, N. D., & Behzadan, A. H. (2020). Convolutional neural networks for object
detection in aerial imagery for disaster response and recovery. Advanced Engineering
Informatics, 43, 101009, https://doi.org/10.1016/j.aei.2019.101009.

Raman, D. R., Vidhya, S., & Navinkumar, N. M. (2023, May). Satellite Imagery-based
Prediction of Poverty using Faster R-CNN. In 2023 7th International Conference on
Intelligent Computing and Control Systems (ICICCS) (pp. 41-46). IEEE,
https://doi.org/10.1109/ICICCS56967.2023.10142570.

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767, https://doi.org/10.48550/arXiv.1804.02767.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-
time object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 779-788), https://doi.org/10.48550/arXiv.1506.02640.

Page 26 of 29
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-cnn: Towards real-time object detection
with region proposal networks. Advances in neural information processing systems, 28,
https://doi.org/10.48550/arXiv.1506.01497.

Ren, Y., Zhu, C., & Xiao, S. (2018). Small object detection in optical remote sensing images
via modified faster R-CNN. Applied Sciences, 8(5), 813,
https://doi.org/10.3390/app8050813.

Shermeyer, J., Hogan, D., Brown, J., Van Etten, A., Weir, N., Pacifici, F., ... & Lewis, R. (2020).
SpaceNet 6: Multi-sensor all weather mapping dataset. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 196-
197), https://doi.org/10.48550/arXiv.2004.06500.

Shivappriya, S. N., Priyadarsini, M. J. P., Stateczny, A., Puttamadappa, C., & Parameshachari,
B. D. (2021). Cascade object detection and remote sensing object detection method
based on trainable activation function. Remote Sensing, 13(2), 200,
https://doi.org/10.3390/rs13020200.

Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., ... & Fu, K. (2022). FAIR1M: A
benchmark dataset for fine-grained object recognition in high-resolution remote
sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 184, 116-
130, https://doi.org/10.1016/j.isprsjprs.2021.12.004.

Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object
detection. Advances in neural information processing systems, 26.

Talukdar, J., Gupta, S., Rajpura, P. S., & Hegde, R. S. (2018, February). Transfer learning for
object detection using state-of-the-art deep neural networks. In 2018 5th international
conference on signal processing and integrated networks (SPIN) (pp. 78-83). IEEE,
https://doi.org/10.1109/SPIN.2018.8474198.

Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via
adversary for object detection. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 2606-2615),
https://doi.org/10.1109/CVPR.2017.324.

Wang, Y., Bashir, S. M. A., Khan, M., Ullah, Q., Wang, R., Song, Y., ... & Niu, Y. (2022).
Remote sensing image super-resolution and object detection: Benchmark and state of

Page 27 of 29
the art. Expert Systems with Applications, 197, 116793,
https://doi.org/10.1016/j.eswa.2022.116793.

Wang, Y., Wang, C., Zhang, H., Dong, Y., & Wei, S. (2019). A SAR dataset of ship detection
for deep learning under complex backgrounds. remote sensing, 11(7), 765,
https://doi.org/10.3390/rs11070765.

Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., ... & Zhang, L. (2018). DOTA: A
large-scale dataset for object detection in aerial images. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 3974-3983),
https://doi.org/10.48550/arXiv.1711.10398.

Xian, S. U. N., Zhirui, W. A. N. G., Yuanrui, S. U. N., Wenhui, D. I. A. O., Yue, Z. H. A. N. G.,
& Kun, F. U. (2019). AIR-SARShip-1.0: High-resolution SAR ship detection

dataset. 雷达学报, 8(6), 852-863, https://doi.org/10.12000/JR19097.

Xiang, Y., Choi, W., Lin, Y., & Savarese, S. (2017). Subcategory-aware convolutional neural
networks for object proposals and detection. In 2017 IEEE winter conference on
applications of computer vision (WACV) (pp. 924-933). IEEE,
https://doi.org/10.1109/WACV.2017.108.

Yi, D., Su, J., & Chen, W. H. (2021). Probabilistic faster R-CNN with stochastic region
proposing: Towards object detection and recognition in remote sensing
imagery. Neurocomputing, 459, 290-301,
https://doi.org/10.1016/j.neucom.2021.06.072.

Younis, A., Shixin, L., Jn, S., & Hai, Z. (2020, January). Real-time object detection using pre-
trained deep learning models MobileNet-SSD. In Proceedings of 2020 the 6th
international conference on computing and data engineering (pp. 44-48),
https://doi.org/10.1145/3379247.3379264.

Zhang, T., Zhang, X., Li, J., Xu, X., Wang, B., Zhan, X., ... & Wei, S. (2021). SAR ship
detection dataset (SSDD): Official release and comprehensive data analysis. Remote
Sensing, 13(18), 3690, https://doi.org/10.3390/rs13183690.

Zhang, Y., Yuan, Y., Feng, Y., & Lu, X. (2019). Hierarchical and robust convolutional neural
network for very high-resolution remote sensing object detection. IEEE Transactions

Page 28 of 29
on Geoscience and Remote Sensing, 57(8), 5535-5548,
https://doi.org/10.1109/TGRS.2019.2900302.

Zhigang, Z., Huan, L., Pengcheng, D., Guangbing, Z., Nan, W., & Wei-Kun, Z. (2018, June).
Vehicle target detection based on R-FCN. In 2018 Chinese Control And Decision
Conference (CCDC) (pp. 5739-5743). IEEE.

Zhiqiang, W., & Jun, L. (2017, July). A review of object detection based on convolutional
neural network. In 2017 36th Chinese control conference (CCC) (pp. 11104-11109).
IEEE, https://doi.org/10.23919/ChiCC.2017.8029130.

Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint
arXiv:1904.07850, https://doi.org/10.48550/arXiv.1904.07850

Zhu, P., Wen, L., Du, D., Bian, X., Ling, H., Hu, Q., ... & Song, Z. (2018). Visdrone-det2018:
The vision meets drone object detection in image challenge results. In Proceedings of
the European Conference on Computer Vision (ECCV) Workshops (pp. 0-0).

Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object detection in 20 years: A
survey. Proceedings of the IEEE, https://doi.org/10.1109/JPROC.2023.3238524.

Page 29 of 29

You might also like