You are on page 1of 12

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 4707312

Detecting Occluded and Dense Trees in Urban


Terrestrial Views With a High-Quality Tree
Detection Dataset
Yongzhen Wang , Xuefeng Yan , Hexiang Bao , Yiping Chen , Senior Member, IEEE, Lina Gong ,
Mingqiang Wei , Senior Member, IEEE, and Jonathan Li , Senior Member, IEEE

Abstract— Urban trees are often densely planted along the urban tree detection network (OD-UTDNet), urban tree detection
two sides of a street. When observing these trees from a fixed (UTD), UTD dataset.
view, they are inevitably occluded with each other and the
passing vehicles. The high density and occlusion of urban tree I. I NTRODUCTION
scenes significantly degrade the performance of object detectors.
This article raises an intriguing learning-related question—if
a module is developed to enable the network to adaptively
cope with occluded and unoccluded regions while enhancing
T HERE is a proverb in China: “Up above there is heaven;
down below there are Suzhou and Hangzhou.” It means
the two cities have picturesque nature. Besides Suzhou and
its feature extraction capabilities, can the performance of Hangzhou, many cities in the world are beautiful and attractive
a cutting-edge detection model be improved? To answer it,
a lightweight yet effective object detection network is proposed in which trees play an important role. The trees in the city
for discerning occluded and dense urban trees, called occluded are called the lungs of the city. There is a great demand for
and dense-urban tree detection network (OD-UTDNet). The determining the quantity and monitoring the health status of
main contribution is a newly designed dilated attention cross urban trees. Accordingly, tree detection becomes an essential
stage partial (DACSP) module. DACSP can expand the fields of and fundamental prerequisite for resource appraisal and urban
view of OD-UTDNet for paying more attention to the unoccluded
region, while enhancing the network’s feature extraction ability vegetation management [1]–[4].
in the occluded region. This work further explores both the The conventional tree detection methods are mainly manual
self-calibrated (SC) convolution module and GFocal loss, and can only be conducted one by one, which is costly
which enhance OD-UTDNet’s ability to resolve the challenging and time-consuming. Since then, remote sensing techniques
problem of high densities and occlusions. Finally, to facilitate the such as aerial photography and light detection and ranging
detection task of urban trees, a high-quality urban tree detection
dataset is established, named Urban Tree Detection (UTD); to (LiDAR) have been developed for assessing urban trees, both
our knowledge, this is the first time. Extensive experiments directly and indirectly [5]–[9]. Although these efforts are
show clear improvements of the proposed OD-UTDNet over effective in detecting urban trees, the collection of remote
12 representative object detectors on UTD. The code and dataset sensing data is difficult and costly to commission. Many
are available at https://github.com/yz-wang/OD-UTDNet. works attempt to study image-based urban tree detection in
Index Terms— Dilated attention cross stage partial module light of the convenient acquisition of red, green, and blue
(DACSP), high density and occlusion, occluded and dense- (RGB) images. Lin et al. [10] propose a detection frame-
work for detecting individual trees in unmanned aerial vehi-
cle (UAV) images. Chen et al. [11] develop an improved
Manuscript received 20 December 2021; revised 5 May 2022; accepted species-based particle swarm optimization algorithm termed
14 June 2022. Date of publication 20 June 2022; date of current version
30 June 2022. This work was supported in part by the National Natural KDT-SPSO for palm tree detection. K-Dimensional Tree-
Science Foundation of China under Grant 62172218; in part by the Joint Species-based Particle Swarm Optimization (KDT-SPSO) uses
Fund of National Natural Science Foundation of China and Civil Aviation a K-Dimensional (KD)-tree structure to accelerate the nearest
Administration of China under Grant U2033202; in part by the 14th Five-Year
Planning Equipment Pre-Research Program under Grant JCKY2020605C003; neighbor search and obtain promising detection results. How-
and in part by the Free Exploration of Basic Research Project, Local Science ever, for these conventional wisdom of tree detection, users
and Technology Development Fund Guided by the Central Government of have to tweak parameters multiple times to obtain satisfied
China under Grant 2021Szvup060. (Corresponding author: Xuefeng Yan.)
Yongzhen Wang, Xuefeng Yan, Hexiang Bao, and Lina Gong are with the detection results in practical scenarios. This inconvenience
School of Computer Science and Technology, Nanjing University of Aeronau- heavily discounts the efficiency and user experience.
tics and Astronautics, Nanjing 210016, China (e-mail: wangyz@nuaa.edu.cn; With advances in deep learning, numerous excellent neuron-
yxf@nuaa.edu.cn; bhxhhxx@nuaa.edu.cn; gonglina@nuaa.edu.cn).
Yiping Chen is with the Fujian Key Laboratory of Sensing and Computing network-based detection frameworks have emerged in the field
for Smart Cities, School of Informatics, Xiamen University, Xiamen 361005, of object detection [12]–[16], which bring great opportunities
China (e-mail: chenyiping@xmu.edu.cn). for urban tree detection. Different from the conventional
Mingqiang Wei is with the Shenzhen Research Institute, Nanjing Uni-
versity of Aeronautics and Astronautics, Shenzhen 518038, China (e-mail: detection methods, learning-based detectors commonly use
mingqiang.wei@gmail.com). convolutional neural networks (CNNs) to detect trees directly
Jonathan Li is with the Department of Geography and Environmental from the captured images in an end-to-end fashion. Accord-
Management and Department of Systems Design Engineering, University of
Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: junli@uwaterloo.ca). ingly, these algorithms can produce promising detection results
Digital Object Identifier 10.1109/TGRS.2022.3184300 in a variety of scenarios. However, since the majority of urban
1558-0644 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

trees are planted along roads, coupled with heavy occlusion duced in the model to cope with the problem of dense
in complex and crowded urban spaces, most existing detectors tree detection.
cannot accurately detect dense urban trees. As demonstrated 3) We establish and release an object detection dataset of
in Fig. 1, urban trees may be occluded by vehicles, and trees urban tree images, dubbed UTD. To the best of our
may be occluded by each other. Furthermore, most existing knowledge, this is the first image dataset that focuses
learning-based detectors have numerous parameters and high on urban tree detection.
computational costs, making them unsuitable for deployment 4) The proposed OD-UTDNet is compared with 12 repre-
on various memory-constrained mobile devices. To this end, sentative state-of-the-art object detection algorithms via
this work aims to develop a lightweight yet effective detection extensive experiments. Consistently and substantially,
network for urban tree detection, especially under occluded OD-UTDNet performs favorably against them.
and dense scenarios. The rest of this work is organized as follows. In Section II,
In this article, we respond to an intriguing learning-related the related work is briefly reviewed from three aspects: con-
question. That is, developing a module to enable the network ventional tree detection techniques, deep-learning-based tree
to adaptively cope with occluded regions, while enhancing its detection techniques, and general object detection techniques.
feature extraction capability will improve the performance of Section III describes the details of the proposed OD-UTDNet
the cutting-edge detection models. A novel detection network for occluded and dense tree detection. Section IV first exhibits
is proposed for discerning occluded and dense urban trees, the established UTD dataset and then presents the conducted
called OD-UTDNet. Note that the trees here are mostly experiments and discusses the results, followed by conclusions
street trees and those commonly found in city parks, which and future work in Section V.
are usually mature trees with clear stems. In urban spaces,
in addition to complex and heavy traffic, most of the trees II. R ELATED W ORK
planted along the roads are usually quite dense, which creates This section roughly divides the discussion into three parts:
a severe occlusion problem for tree detection (see Fig. 1). conventional tree detection techniques, deep-learning-based
To address this issue, we develop a dilated attention cross tree detection techniques, and general object detection
stage partial (DACSP) module to expand the receptive field of techniques.
OD-UTDNet and enhance its feature extraction capabilities.
In this way, DACSP can make our network pay more attention A. Conventional Tree Detection Techniques
to the unoccluded areas, thus improving the detection accuracy
of the model in occluded cases. In addition, an effective feature As a long-standing and fundamental task in both remote
enhancement module (i.e., self-calibrated convolutions) [19] is sensing and computer vision, tree detection has attracted a
used to further enlarge the fields of view of each convolutional great deal of research attention in academia and industry [7],
layer and enrich the output features. Moreover, to cope with [9], [21]–[23]. The traditional tree detection methods
the problem of dense detection in this work, the GFocal are commonly based on manual field measurements,
loss [20] is introduced in the design of OD-UTDNet, which which are costly and time-consuming. Afterward, aerial
has been proven effective in dense object detection. photography and LiDAR techniques are often used to
One challenge for applying deep learning techniques to directly or indirectly assess urban trees [5], [6], [8], [24].
urban tree detection is the need for a benchmark dataset. However, given that the collection of remote sensing data
The lack of such datasets is a pervasive problem in the is difficult and expensive, many efforts have begun to
field of urban tree detection due to expensive data collection study image-based urban tree detection, particularly with
and annotation cost. To the best of our knowledge, there RGB images. Srestasathiern et al. [25] propose a palm tree
is currently no public image dataset related to urban trees. detection approach based on high-resolution satellite images
Therefore, we capture and collect 1860 images of urban tree and achieve a detection rate of about 90%. Jiang et al. [26]
scenes and label them in the format of Pattern Analysis, propose a graphics processing unit (GPU)-accelerated scale-
Statical Modeling and Computational Learning- Visual Object space filtering (SSF) algorithm to detect papaya trees with
Classes (PASCAL-VOC) to establish an urban tree detection UAV images. SSF shows a clear improvement over other
dataset, named UTD. algorithms in both speed and accuracy. Donmez et al. [7] use
Extensive experiments on the UTD benchmark demonstrate a connected components labeling (CCL) algorithm to detect
that our network outperforms 12 representative state-of-the- citrus trees based on high-resolution UAV images and achieve
art object detection algorithms. The main contributions of our a high accuracy rate.
method are summarized as follows:
B. Tree Detection via Deep Learning
1) A novel lightweight yet effective detection framework is Advances in deep learning bring a big opportunity for
proposed for discerning occluded and dense urban trees, automatic tree detection. Iqbal et al. [27] present a deep
called OD-UTDNet. learning approach for detection and segmentation of coconut
2) To address the heavy occlusion problem in urban trees, trees in aerial imagery. They use a Mask R-CNN model
OD-UTDNet leverages a well-designed DACSP module with a ResNet50/ResNet101-based architecture and achieve
and self-calibrated (SC) convolutions to expand the an overall 91% mean average precision (AP) for coconut tree
receptive field of our model and enhance its feature detection. Hartling et al. [28] conduct extensive experiments
extraction ability. In addition, the GFocal loss is intro- and reveal that DenseNet is more effective for urban tree

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: DETECTING OCCLUDED AND DENSE TREES IN URBAN TERRESTRIAL VIEWS 4707312

Fig. 1. Detection results by different methods on two typical examples of occluded and dense urban scenes. From (a) to (d): the detection results by
(a) RetinaNet [14], (b) CenterNet [17], (c) YOLOv5s [18], and (d) our OD-UTDNet, respectively. Occluded and dense scenes strongly degrade the performance
of various object detectors. Even humans have difficulty determining the number of trees in such challenging images. As observed, the proposed OD-UTDNet
can discern ten trees from these two samples, while other detectors can only detect up to three trees, which indicates that our method can more robustly detect
the trees in such scenarios with higher confidence.

classification and detection, outperforming the popular random have been developed, including Fast R-CNN [37], Faster
forest (RF) and support vector machine (SVM) techniques. R-CNN [38], Cascade R-CNN [39], and Grid R-CNN [40].
Dong et al. [29] develop a single-tree detection algorithm for Despite achieving remarkable detection accuracy, two-stage
high-resolution remote sensing images based on a cascade detectors are not satisfactory in terms of inference speed.
neural network. They design a classifier with a back propaga- Accordingly, to achieve a better speed–performance trade-
tion (BP) neural network and analyze the differences between off, various one-stage detectors are proposed for real-time
tree and nontree samples. Liu et al. [30] propose a point-based detection. Representative algorithms include You Only Look
neural network named LayerNet for tree species classification Once (YOLO) series [13], [18], [41]–[43], RetinaNet [14],
in forest regions. LayerNet can divide multiple overlapping CenterNet [17], Single Shot MultiBox Detector (SSD) [12],
layers in Euclidean space to extract the local structural features etc. In general, the speed of the one-stage detector is relatively
of the tree. Briechle et al. [31] propose a dual-CNN-based faster, but the detection performance is slightly weaker than
network called Silvi-Net for the classification of 3-D trees. The that of the two-stage detector.
experimental results demonstrate that Silvi-Net outperforms
other state-of-the-art 3-D tree detection approaches, especially III. OD-UTDN ET
in the case of small samples. Ferreira et al. [32] develop a Recall a time when you walk in a street. You may find
fully convolutional neural network model for individual tree that the urban trees planted along the two sides of the street
detection and species classification in Amazonian regions. are dense and occluded with each other and the passing
They adopt a low-cost UAV to capture RGB images and then vehicles. Thus, you can just observe a part of these trees
detect trees in these images. Xie et al. [33] present an end-to- from your view, potentially leading to misunderstanding these
end trainable framework for street tree detection based on the trees. Similarly, the high density and occlusion of urban tree
Fast R-CNN network. The experimental results show a clear scenes significantly degrade the performance of cutting-edge
improvement of their method over other object detectors. object detectors, since these detectors can only perceive a
part of the trees and pay much attention to the occluded
C. Object Detection region.
Object detection aims at predicting both the class and Is it possible to make a detector pay more attention to
bounding box of the target object, which is an important the unoccluded region while enhancing its feature extraction
research area in computer vision [34], [35]. Recently, with the ability in the occluded region? If the answer is positive, the
rapid development of CNNs, learning-based algorithms have performance of the cutting-edge detection models for handling
dominated modern object detection for years. high density and occlusion problems can be improved. It is the
Current object detection techniques can roughly fall into focus of this work.
two categories, i.e., one stage and two stage. For two-stage In this section, we first describe the overall architecture
approaches, they first adopt the region proposal methods [36] of the urban tree detection network, namely, OD-UTDNet.
to produce a sparse set of candidate proposals and then After that, the proposed DACSP module is elaborated to
refine their locations and predict their specific categories. demonstrate how we address the heavy occlusion problem in
The most representative two-stage detector is R-CNN [35], the task of urban tree detection. Finally, the SC convolutions
which is the first successful attempt to replace the classifier and GFocal loss are introduced to optimize OD-UTDNet
with a CNN, achieving large improvements in detection accu- to further enhance the detection accuracy in dense tree
racy. Since then, numerous variants based on this framework scenes.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 2. Overall architecture of OD-UTDNet. It mainly consists of three parts: the backbone network, the neck module, and the detection head. Given an
urban tree image, we first use the backbone network to extract complex and deep features from the input image. After that, the neck module is adopted to
further fuse the multiscale features and transmit them to the detection head to predict the final detection result. DACSP refers to DACSP network, and SC
Conv refers to SC convolutions. The main contribution is a newly designed DACSP module. DACSP expands the fields of view of the network for paying
more attention to the unoccluded region and enhances the network’s feature extraction ability in the occluded region.

A. Overview of OD-UTDNet devices in the YOLOv5 family. But its detection accuracy is
For urban tree detection, the first consideration is how thus largely degenerated for high density and occluded urban
to address the problem of heavy occlusion in dense urban tree scenes.
spaces, which greatly degenerates the detection accuracy of To this end, we start from these three aspects and propose
the existing detectors. an enhanced object detection network based on YOLOv5s for
As known, YOLO series (e.g., the latest version occluded and dense urban tree detection. Here, the proposed
YOLOv5) [13], [18], [41]–[43] have been successfully applied OD-UTDNet is built on top of the latest version of the
in object detection. Although YOLOv5 has achieved promising YOLO series, namely, YOLOv5s [18] (the smallest version
results in various object detection benchmarks (e.g., Microsoft of YOLOv5). Actually, our OD-UTDNet has the potential to
Common Objects in Context (MSCOCO) [44], PASCAL- benefit from a more complex version of YOLOv5 to further
VOC [45]), there are still many challenging yet unsolved improve its performance, such as YOLOv5m and YOLOv5l.
problems. First, the YOLOv5 family is originally designed However, we choose the smallest version because a lightweight
for object detection in general yet easy scenarios, without model is more desirable for deploying automatic tree detection
considering how to cope with the nontrivial dense object on many memory-constrained devices.
detection scenes. Second, like most existing detectors, the As illustrated in Fig. 2, the proposed network consists of
YOLOv5 family is susceptible to the occlusion problems three main components, i.e., the backbone network, the neck
in the urban tree detection task, resulting in a significant module, and the detection head. Given an input image, we first
decrease in detection accuracy. Third, YOLOv5s is lightweight leverage the focus operation in the backbone network to divide
and efficient, which is promising for resource-limited mobile the image into different granularities and aggregate them

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: DETECTING OCCLUDED AND DENSE TREES IN URBAN TERRESTRIAL VIEWS 4707312

together. Then, several enhanced cross stage partial (CSP)


modules [46], i.e., dilated attention CSP (DACSP), are used to
extract complex and deep features from the restructured image.
DACSP is a novel feature enhancement module developed
to expand the receptive field of our model, thus making the
network pay more attention to the unoccluded areas to reduce
the impact of heavy occlusions. After that, the neck module
is adopted to produce the feature pyramids based on the
path aggregation network (PANet) [47] and then transmit the
feature maps to the detection head. Finally, the detection head
module is used to generate the final class probability score,
bounding boxes, and confidence score.
Moreover, to further boost the detection performance of
OD-UTDNet and address the dense detection problem in
urban spaces, we introduce a more recent multiscale feature
enhancement module (SC convolutions) and an up-to-date
GFocal loss into our network. Both them have been widely
used in object detection tasks and have been demonstrated to
be effective in improving detection accuracy, which will be
described in the following subsections.
Fig. 3. Architecture of dilated attention CSP network. In our designation,
we use a dilated convolution module to expand the receptive field of the
convolutional layer and a lightweight attention network to enhance the feature
B. DACSP Network extraction ability of the CSP network. Both of them can help the model
Theoretically, the feature extraction ability of the network address severe occlusions in urban spaces and improve the detection accuracy
of OD-UTDNet. FCANet represents the frequency channel attention network.
directly determines the performance of the model. Thus, DConv3 × 3 refers to a dilated convolution network with a kernel size of 3
we argue that there are two solutions to reduce the impact (dilation rate = 3).
of occlusions on urban tree detection. One is to expand
the receptive field of the network to help the model detect
trees from the unoccluded areas. The other is to enhance the CSP network (see Fig. 3). In this way, our model can focus on
feature extraction ability of the network so that trees can be more unoccluded areas, thus reducing the impact of occlusions
detected directly from the occluded areas. To improve the on detection accuracy.
feature extraction capability of the network, natural thinking Attention mechanisms have been widely used in improving
is to deepen the number of network layers or use more the performance of neural networks [50], [51]. We consider
complex network architectures (e.g., graph neural networks adopting a lightweight attention module to boost the fea-
(GNNs) [48] and Transformers [49]). However, a lightweight ture extraction capability of the CSP network, thus enhanc-
model is more desirable to deploy for automatic tree detection ing the detection accuracy of OD-UTDNet. Motivated by
tasks due to the limited hardware and computing ability of Qin et al. [52], an up-to-date frequency channel attention net-
intelligent detection systems (e.g., intelligent vehicles, UAVs). work (FCANet) is used in the design of the CSP network to
Given this, we consider using a dilated convolution module improve the detection performance of our model, as exhibited
(expanding the receptive field) and a lightweight attention net- in Fig. 3. FCANet is extended on the basis of Squeeze-and-
work (enhancing feature extraction ability) in the design of the Excitation Network (SENet) [51] and combined with discrete
CSP network to achieve a better parameter–performance trade- cosine transform (DCT) to develop a novel multispectral
off. This enhanced CSP network is called dilated attention CSP channel attention mechanism. It enables the CSP network
(DACSP), and its architecture is depicted in Fig. 3. to learn the weights from different features adaptively, thus
Dilated convolutions can expand the receptive field of the enhancing the detection performance of OD-UTDNet.
network without increasing the computational effort, which
has been demonstrated to be effective in improving the per- C. Self-Calibrated Convolutions
formance of networks. However, due to the special calculation The SC convolution network [19] is an enhanced CNN
method of dilated convolution, continuous information of structure, which is used to build long-range spatial and
the image will be destroyed, resulting in the loss of partial interchannel dependencies around each spatial location. That
features. To address such a problem, we combine dilated is, it can enlarge the receptive field of each convolutional
convolutions with conventional convolutions to expand the layer and improve the feature extraction ability of CNNs.
receptive field of the network while ensuring that the image Therefore, we consider using the SC convolution network as
information will not be lost. Specifically, we first use the a feature enhancement module to help OD-UTDNet address
dilated convolution module and the conventional convolu- the aforementioned occlusion problem and boost the detection
tion network to calculate the input feature maps and then accuracy of the model.
concatenate them through a residual connection to extract The architecture of SC convolutions is demonstrated in
multiscale features while expanding the receptive field of the Fig. 4. Given an input feature map X with channel C, we first

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

QFL adopts a joint representation of localization quality (IoU


score) and classification score, where the standard one-hot
category label y ∈ {0, 1} turns to a possible float target
y ∈ [0, 1] on the corresponding category. Note that y =
0 represents the negative samples with quality score being
equal to 0, and 0 < y ≤ 1 refers to the positive samples with
the target IoU score y. In addition, QFL uses the IoU score
between the estimated bounding box and its corresponding
ground-truth label to represent the localization quality label y,
which takes a value between 0 and 1. However, the original
form of focal loss only supports {0, 1} discrete labels, but the
Fig. 4. Architecture of the SC convolution network. It can build long-range new joint representation labels contain decimals. Therefore, we
spatial and interchannel dependencies around each spatial location, thus
expanding the receptive field of each convolutional layer and enhancing the employ QFL to extend the focal loss for enabling successful
feature extraction ability of CNNs. training under new continuous labels as

split it into two feature maps X 1 and X 2 with channel C/2. QFL(σ ) = −|y − σ |β ((1 − y) log(1 − σ ) + y log(σ )) (1)
Then, X 1 is sent to the SC branch for feature transformation
where −((1 − y) log(1 − σ ) + y log(σ )) denotes the complete
and fusion. In this branch, three filters (i.e., K 2 , K 3 , and K 4 )
cross-entropy loss function, y ∈ [0, 1] denotes the new quality
are adopted to extract and fuse multiscale features from X 1 .
label, and σ ∈ (0, 1) represents the predicted result. Note
Next, the filter K 1 is used to transmit and extract features from
that the sigmoid operator σ (·) is used for multiclass imple-
X 2 to obtain the other half of result Y2 . Finally, we concatenate
mentation, with the output being denoted as σ for simplicity.
Y1 and Y2 to produce the final output Y . In our designation,
Similar to the focal loss, the term |y − σ |β is used to adjust
the SC convolutions are introduced into the neck module of
the weights for positive/negative samples during the training
OD-UTDNet to expand the fields of view of the convolutional
phase. As quality estimation becomes accurate (σ → y), the
layer and extract multiscale features, thus enforcing the net-
modulating factor goes to 0 and the weight for well-estimated
work to pay more attention to the unoccluded areas. In this
examples is reduced, where the parameter β is used to control
way, our model can well address the unpredictable occlusion
the weighting rate (β = 2 works best in our experiments).
scenarios in urban tree detection tasks.
QFL still retains the classification vector, but the physical
meaning of the corresponding category position confidence
D. Generalized Focal Loss
is no longer a classification score but a quality prediction
GFocal loss (GFL) [20] is an improved version of focal score. Furthermore, to accurately describe the distribution in
loss [14], which aims to solve the problem of imbalance real data, DFL is used to accelerate network’s training.
between positive and negative samples in dense object detec- DFL first converts the integral over the continuous domain
tion tasks. In the training process, focal loss reduces the weight into a discrete representation, i.e., {y0 , y1 , . . . , yi , . . . , yn }.
of numerous simple negative samples, making the network pay Therefore, given the discrete distribution property
n
more attention to difficult samples (i.e., dense objects), thus i=0 P(y i ) = 1, the predicted regression value ŷ
improving the detection accuracy of dense objects. Therefore, (y0 ≤ ŷ ≤ yn ) is formulated as
natural thinking is to use focal loss in the design of our

n
framework to cope with dense scenes in urban tree detection ŷ = P(yi )yi . (2)
tasks. However, focal loss itself has many limitations; for i=0
example, it can only handle discrete labels such as 0 or
Hence, the general distribution P(x) can be directly imple-
1 but cannot deal with continuous labels between 0 and 1.
mented through a Softmax S(·) layer with n + 1 units. For
In addition, the inconsistency of the classification and regres-
simplicity, the output of P(yi ) is marked as Si . In this way,
sion calculation methods during training and inference will
ŷ can be trained in an end-to-end manner with existing loss
degenerate the detection accuracy of the model. To this end,
objectives like IoU loss [53]. Furthermore, considering that the
Li et al. [20] develop an enhanced focal loss (GFocal loss) to
most appropriate underlying location would not be far away
address the above problems and further improve the accuracy
from the coarse label, DFL is used to help the network rapidly
of dense object detection tasks. Gfocal loss mainly improves
focus on the values near label y, by enlarging the probabilities
focal loss from two aspects: one is to incorporate quality
of yi and yi+1 (yi ≤ y ≤ yi+1 ). Consequently, DFL is extended
estimation into the class prediction vector to eliminate the risk
on the basis of the complete cross-entropy part in QFL as
of inconsistency between the training and inference phases,
and the other is to leverage a vector instead of the Dirac delta DFL(Si , Si+1 ) = −((yi+1 − y) log(Si )
distribution to accurately depict flexible distribution in real + (y − yi ) log(Si+1 )). (3)
data.
GFocal loss consists of quality focal loss (QFL) and dis- Intuitively, DFL is designed to focus on enlarging the
tribution focal loss (DFL). To address the above-mentioned probabilities of the values around y (i.e., yi and yi+1 ), thus
inconsistency problem between the training and test phases, can improve the training efficiency. Finally, QFL and DFL can

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: DETECTING OCCLUDED AND DENSE TREES IN URBAN TERRESTRIAL VIEWS 4707312

be unified to form the GFocal loss. Assume that a network TABLE I


estimates probabilities for two variables yl , yr (yl < yr ) as D ETAILED I NFORMATION A BOUT UTD D ATASET
p yl , p yr ( p yl ≥ 0, p yr ≥ 0, p yl + p yr = 1), with a final
estimation of their linear combination being ŷ = yl p yl +
yr p yr (yl ≤ ŷ ≤ yr ). Note that the corresponding continuous
label y for the prediction ŷ also satisfies yl ≤ y ≤ yr . Taking
the absolute distance |y− ŷ|β as modulating factor, Generalized
Focal loss (GFL) is formulated as
    β
GFL p yl , p yr = − y − yl p yl + yr p yr 
    
× (yr − y) log p yl + (y − yl ) log p yr .
(4)
Through extensive experiments, we find that DFL provides
our model with relatively limited performance gains while
adding additional computational overhead. Hence, only QFL
is adopted in the design of OD-UTDNet.
IV. E XPERIMENTS AND D ISCUSSION
In this section, we evaluate OD-UTDNet on the proposed
UTD benchmark and conduct comparisons with SSD300 [12],
RetinaNet [14] YOLOv3 [43], Centernet [17], Grid R-
CNN [40], Cascade R-CNN [39], YOLOv4 [42], You Only
Look One-level Feature (YOLOF) [54], Sparse R-CNN [55], Fig. 5. Examples of different scenes in the UTD dataset, where the red boxes
refer to ground truths.
and YOLOv5 [18]. After that, a comprehensive ablation study
is implemented to analyze the effectiveness of each component
designed in our OD-UTDNet. We also describe the details of B. Implementation Details
the established UTD dataset and specific training settings. The
details are as follows. 1) Training Details: The proposed OD-UTDNet is imple-
mented using PyTorch 1.9 on a system with an Intel(R)
A. Dataset Core(TM) i7-9700 CPU, 16-GB random access memory
Since there is no publicly available urban tree image dataset, (RAM), and an Nvidia GeForce RTX 3090 GPU. To optimize
to train and evaluate the proposed network, a new object our model, we use the stochastic gradient decent (SGD)
detection dataset of urban trees is established, called UTD. optimizer with a mini-batch size of 32, where the momentum
To the best of our knowledge, UTD is the first image dataset parameter and weight decay are set to 0.937 and 0.0005,
that focuses on urban tree detection research. Specifically, our respectively. The total number of epochs and the initial learn-
UTD contains a total of 1860 tree images captured in various ing rate are set to 100 and 0.01, respectively. Considering that
urban scenes. These images are collected from the Internet or training the network on pretrained weights can accelerate the
captured by us. Considering that an accurately labeled dataset convergence speed of the model, our OD-UTDNet is trained
is essential for object detection tasks, we use the Colabeler on the UTD with the MSCOCO [44] pretrained weights.
tool [56] to manually annotate our UTD in the format of 2) Evaluation Settings: To quantitatively evaluate the per-
PASCAL-VOC. Moreover, to further ensure the accuracy of formance of OD-UTDNet, AP and AP50 are used as the
the labels, experts from Nanjing Forestry University are invited evaluation metrics, which are the most widely used evaluation
to proofread all the annotated images. The task of annotating is indexes in object detection tasks. The proposed OD-UTDNet is
difficult due to heavy occlusion of trees in some images, and it compared with several state-of-the-art (SOTA) object detection
takes about 5–10 min to label each image. After labeling these methods. These detectors can be classified into two cate-
1860 images, we obtain 5540 ground-truth bounding boxes. gories: 1) two-stage-based Grid R-CNN [40] and Cascade
To train our OD-UTDNet, the dataset is divided into three R-CNN [39] and 2) one-stage-based SSD300 [12], Reti-
groups, i.e., the training set, the validation set, and the test naNet [14], YOLOv3 [43], Centernet [17], YOLOv4 [42],
set, as exhibited in Table I. In addition, to demonstrate our YOLOF [54], and YOLOv5 [18]. Moreover, we also com-
dataset more intuitively, four image examples from the UTD pare OD-UTDNet with a recent sparse-based object detection
dataset are exhibited in Fig. 5, each of them containing more framework Sparse R-CNN [55].
than three urban trees. Although the proposed UTD contains
various categories of urban trees, all the different categories of C. Comparison With State-of-the-Arts
the tree are cast as one class here, since we mainly focus on the We report AP, AP50 , and frames per second (FPS) metrics
tree detection task in this work. In the future, we will further of 12 state-of-the-art detection algorithms (including three ver-
classify these detected trees in subsequent work. Furthermore, sions of YOLOv5) on the established UTD test set in Table II.
the established UTD will be released on our GitHub website To make a fair comparison, all the compared approaches are
to accelerate the research on automatic tree detection tasks. retrained on the UTD dataset, following the settings in their

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

TABLE II
C OMPARISON OF OD-UTDN ET W ITH S TATE - OF - THE -A RT O BJECT D ETECTORS ON THE UTD T EST S ET. W E E VALUATE THE RUN T IME W ITH BATCH
S IZE 1 ON A S INGLE NVIDIA G E F ORCE RTX 3090 GPU

Fig. 6. Detection results by different methods in a heavy occlusion case. From (a) to (h): detection results by (a) SSD300 [12], (b) RetinaNet [14], (c) Cascade
R-CNN [39], (d) CenterNet [17], (e) YOLOv4 [42], (f) YOLOF [54], (g) YOLOv5s [18], and (h) our OD-UTDNet, respectively. As can be seen, there are
a total of six trees in the image, and only three methods, namely, Cascade R-CNN, YOLOF, and OD-UTDNet, can detect all the trees. However, Cascade
R-CNN and YOLOF suffer from the mis-detection problem as they discern these six trees as 11 and seven trees, respectively. In contrast, our OD-UTDNet
can effectively address unpredictable occlusion scenarios.

articles. As observed, the proposed OD-UTDNet achieves the most compared detection algorithms miss some trees due to
best performance with 37.7 AP and 78.5 AP50 compared heavy occlusion. For Cascade R-CNN and YOLOF, although
with SOTA approaches. OD-UTDNet realizes an excellent they have overcome the impact of occlusion to some extent,
parameter–performance trade-off, since its parameter amount the confidence of the detected trees is quite low. In addition,
and calculation cost are acceptable and rank second among the Cascade R-CNN method suffers from significant mis-
the compared methods. In addition, the inference speed of detection, and it can easily identify other objects as trees.
OD-UTDNet is fast, and it can detect 62 images per sec- In contrast, the proposed DACSP module enhances the feature
ond, which is suitable for deploying on memory-constrained extraction capability of OD-UTDNet while expanding the
devices. receptive field of the model, thus reducing the impact of occlu-
To further evaluate the effectiveness of OD-UTDNet under sions on detection accuracy. Consequently, our OD-UTDNet
occluded and dense scenes, we test our method and other exhibits better results in detecting occluded and dense trees.
approaches in two heavy occlusion cases. Visual compar- Likewise, to verify the generalization ability of OD-UTDNet
isons of the detection results are depicted in Figs. 6 and 7. in general scenes (without severe occlusion), we present a
As observed, the captured images contain severe occlusion, visual comparison of the detection results in the cases of few
where trees and trees occluded each other to form a whole. occlusions, as exhibited in Figs. 8 and 9. It can be observed
It is practically impossible for a human to count the quantity that even in the case of only a small amount of occlusion, many
of trees in these images. As demonstrated in Figs. 6 and 7, detection algorithms still miss some trees. Similar to the results

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: DETECTING OCCLUDED AND DENSE TREES IN URBAN TERRESTRIAL VIEWS 4707312

Fig. 7. Detection results by different detectors in a heavy occlusion case. Only Cascade R-CNN and our OD-UTDNet can detect all the trees in the image,
while other detectors have more or less missed trees. Similar to the results in Fig. 6, Cascade R-CNN still has the problem of mis-detection. Comparatively, our
method can handle occlusion cases well. (a) SSD300. (b) RetinaNet. (c) Cascade R-CNN. (d) CenterNet. (e) YOLOv4. (f) YOLOF. (g) YOLOv5s. (h) Ours.

Fig. 8. Detection results by different approaches in a case of few occlusions. The results of SSD300, RetinaNet, YOLOv4, and YOLOF obviously miss
some trees. For CenterNet and YOLOv5s, although they can discern two trees in this image, the confidence of the detected trees is quite low. Similar to
previous cases, Cascade R-CNN identifies these two trees as three. In contrast, our OD-UTDNet can generalize well in both heavy occlusion and few occlusion
situations. (a) SSD300. (b) RetinaNet. (c) Cascade R-CNN. (d) CenterNet. (e) YOLOv4. (f) YOLOF. (g) YOLOv5s. (h) Ours.

in Figs. 6 and 7, the detection results produced by Cascade We first construct the base model with the original
R-CNN still have the problem of mis-detection, and one tree YOLOv5s as the baseline of the detection network and then
is detected as multiple trees. Compared with these SOTA train this model through implementation details mentioned
detectors, our OD-UTDNet can detect more trees with higher above. Subsequently, different modules are incrementally
confidence, which demonstrates that our model performs well added into the base model as:
in both heavy occlusion and few occlusion situations.
1) base model + dilated attention CSP module → V1 ,
2) V1 + SC convolutions → V2 ,
3) V2 + GFocal loss → V3 (our full model).
D. Ablation Study
All these models are retrained in the same way as before and
OD-UTDNet exhibits superior detection performance com- tested on the UTD test set. The performances of these variants
pared with 12 representative state-of-the-art methods. To fur- are exhibited in Table III.
ther validate the effectiveness of the proposed OD-UTDNet, As observed, each module in our OD-UTDNet contributes
comprehensive ablation studies are conducted to analyze dif- to object detection, especially the well-designed dilated atten-
ferent components, including the dilated attention CSP mod- tion CSP network, which achieves a 4.2 AP and 4.3 AP50
ule, SC convolutions, and GFocal loss. improvement over our base model. The introduction of SC

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 9. Detection results by different methods in a case of few occlusions. It can be observed that only our OD-UTDNet can detect all five trees in this figure,
while none of the compared algorithms can detect the far-left tree. For SSD300 and YOLOv4, they cannot even detect one tree in the image. Once again, the
Cascade R-CNN detector suffers from mis-detection problems. (a) SSD300. (b) RetinaNet. (c) Cascade R-CNN. (d) CenterNet. (e) YOLOv4. (f) YOLOF.
(g) YOLOv5s. (h) Ours.

TABLE III ACKNOWLEDGMENT


A BLATION S TUDY ON OD-UTDN ET. A S O BSERVED , O UR F ULL M ODEL
(V3 ) O UTPERFORMS O THER A LTERNATIVES The authors thank the anonymous reviewers for their careful
reading and valuable comments.

R EFERENCES
[1] L. Wallace, A. Lucieer, and C. S. Watson, “Evaluating tree detection and
segmentation routines on very high resolution UAV LiDAR data,” IEEE
Trans. Geosci. Remote Sens., vol. 52, no. 12, pp. 7619–7628, Dec. 2014.
[2] A. Fekete and M. Cserep, “Tree segmentation and change detection of
large urban areas based on airborne LiDAR,” Comput. Geosci., vol. 156,
Nov. 2021, Art. no. 104900.
[3] D. Chi, J. Degerickx, K. Yu, and B. Somers, “Urban tree health
convolutions and GFocal loss also greatly improve the perfor- classification across tree species by combining airborne laser scanning
mance of the network. Finally, it can be found that a committee and imaging spectroscopy,” Remote Sens., vol. 12, no. 15, p. 2435,
Jul. 2020.
with all three modules produces the highest detection perfor- [4] A. Harikumar, F. Bovolo, and L. Bruzzone, “A local projection-based
mance, which indicates that the three modules complement approach to individual tree detection and 3-D crown delineation in
each other. multistoried coniferous forests using high-density airborne LiDAR data,”
IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 1168–1182,
Feb. 2019.
V. C ONCLUSIONS AND F UTURE W ORK [5] M. Hirschmugl, M. Ofner, J. Raggam, and M. Schardt, “Single tree
In this work, for the first time, we have established a detection in very high resolution remote sensing data,” Remote Sens.
Environ., vol. 110, no. 4, pp. 533–544, 2007.
high-quality urban tree image dataset to facilitate research [6] W. Yao and Y. Wei, “Detection of 3-D individual trees in urban areas
on automatic detection of trees. Accordingly, a lightweight by combining airborne LiDAR data and imagery,” IEEE Geosci. Remote
yet effective detection framework is proposed, called OD- Sens. Lett., vol. 10, no. 6, pp. 1355–1359, Nov. 2013.
[7] C. Donmez, O. Villi, S. Berberoglu, and A. Cilek, “Computer
UTDNet, for automatically detecting occluded and dense vision-based citrus tree detection in a cultivated environment using
urban trees. To cope with heavy occlusion in tree detec- UAV imagery,” Comput. Electron. Agricult., vol. 187, Aug. 2021,
tion tasks, a DACSP module is developed to expand the Art. no. 106273.
[8] E. M. da Cunha Neto et al., “Using high-density UAV-lidar for deriving
receptive field of the model while enhancing its feature tree height of araucaria angustifolia in an urban Atlantic rain forest,”
extraction capability. In addition, the SC convolution net- Urban Forestry Urban Greening, vol. 63, Aug. 2021, Art. no. 127197.
work is introduced to further enlarge the fields of view of [9] E. G. Parmehr, M. Amati, E. J. Taylor, and S. J. Livesley, “Estimation of
urban tree canopy cover using random point sampling and remote sens-
each convolutional layer and enrich the output features, thus ing methods,” Urban Forestry Urban Greening, vol. 20, pp. 160–171,
reducing the impact of occlusion on tree detection. Moreover, Dec. 2016.
GFocal loss is introduced to address dense scenes in urban [10] Y. Lin, M. Jiang, Y. Yao, L. Zhang, and J. Lin, “Use of UAV oblique
imaging for the detection of individual trees in residential environments,”
tree detection. Finally, extensive evaluations demonstrate that Urban Forestry Urban Greening, vol. 14, no. 2, pp. 404–412, 2015.
our OD-UTDNet performs favorably against 12 representative [11] Z. Y. Chen, I. Y. Liao, and A. Ahmed, “KDT-SPSO: A multimodal
state-of-the-art algorithms. particle swarm optimisation algorithm based on k-d trees for palm tree
detection,” Appl. Soft Comput., vol. 103, May 2021, Art. no. 107156.
In the future work, we plan to use other powerful network [12] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur.
architectures, such as graph convolutional networks (GCNs) or Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands, Oct. 2016,
Transformers, to develop a more effective urban tree detection pp. 21–37.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
framework. In addition, we will further label the trees in UTD once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
according to their species and conduct a classification study. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: DETECTING OCCLUDED AND DENSE TREES IN URBAN TERRESTRIAL VIEWS 4707312

[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [39] Z. Cai and N. Vasconcelos, “Cascade R-CNN: High quality object
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), detection and instance segmentation,” IEEE Trans. Pattern Anal. Mach.
Oct. 2017, pp. 2980–2988. Intell., vol. 43, no. 5, pp. 1483–1498, May 2021.
[15] D. Liang et al., “Anchor retouching via model interaction for robust [40] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid R-CNN,” in Proc.
object detection in aerial images,” IEEE Trans. Geosci. Remote Sens., IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
vol. 60, pp. 1–13, 2022. pp. 7363–7372.
[16] M. Li, X. Zhao, J. Li, and L. Nan, “ComNet: Combinational neural [41] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
network for object detection in UAV-borne thermal images,” IEEE Trans. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Geosci. Remote Sens., vol. 59, no. 8, pp. 6662–6673, Aug. 2021. pp. 7263–7271.
[17] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019, [42] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal
arXiv:1904.07850. speed and accuracy of object detection,” 2020, arXiv:2004.10934.
[18] G. Jocher, A. Stoken, and J. Borovec. (Jan. 2021). Ultralytics/YOLOv5: [43] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
V4.0–nn.Silu() Activations, Weights & Biases Logging, Pytorch Hub 2018, arXiv:1804.02767.
Integration. [Online]. Available: https://zenodo.org/record/4418161 [44] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[19] J.-J. Liu, Q. Hou, M.-M. Cheng, C. Wang, and J. Feng, “Improv- Proc. Eur. Conf. Comput. Vis. (ECCV), Zurich, Switzerland, Sep. 2014,
ing convolutional networks with self-calibrated convolutions,” in Proc. pp. 740–755.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, [45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis-
pp. 10096–10105. serman, “The PASCAL visual object classes (VOC) challenge,” Int.
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
[20] X. Li et al., “Generalized focal loss: Learning qualified and distributed
[46] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and
bounding boxes for dense object detection,” in Proc. Annu. Conf. Neural
I.-H. Yeh, “CSPNet: A new backbone that can enhance learning capa-
Inf. Process. Syst. (NeurIPS), Dec. 2020, pp. 21002–21012.
bility of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[21] R. Mcroberts and E. Tomppo, “Remote sensing support for national (CVPR) Workshops, Jun. 2020, pp. 390–391.
forest inventories,” Remote Sens. Environ., vol. 110, no. 4, pp. 412–419, [47] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
Oct. 2007. for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[22] H. Kaartinen et al., “An international comparison of individual tree Recognit. (CVPR), Jun. 2018, pp. 8759–8768.
detection and extraction using airborne laser scanning,” Remote Sens., [48] W. Shi and R. Rajkumar, “Point-GNN: Graph neural network for 3D
vol. 4, no. 4, pp. 950–974, 2012. object detection in a point cloud,” in Proc. IEEE/CVF Conf. Comput.
[23] S. Malek, Y. Bazi, N. Alajlan, H. AlHichri, and F. Melgani, “Efficient Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 1711–1719.
framework for palm tree detection in UAV images,” IEEE J. Sel. Topics [49] D.-J. Chen, H.-Y. Hsieh, and T.-L. Liu, “Adaptive image transformer
Appl. Earth Observ. Remote Sens., vol. 7, no. 12, pp. 4692–4703, for one-shot object detection,” in Proc. IEEE/CVF Conf. Comput. Vis.
Dec. 2014. Pattern Recognit. (CVPR), Jun. 2021, pp. 12247–12256.
[24] J. Secord and A. Zakhor, “Tree detection in urban regions using aerial [50] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-
LiDAR and image data,” IEEE Geosci. Remote Sens. Lett., vol. 4, no. 2, resolution using very deep residual channel attention networks,” in Proc.
pp. 196–200, Apr. 2007. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 286–301.
[25] P. Srestasathiern and P. Rakwatin, “Oil palm tree detection with high [51] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
resolution multi-spectral satellite imagery,” Remote Sens., vol. 6, no. 10, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
pp. 9749–9774, Oct. 2014. pp. 7132–7141.
[26] H. Jiang, S. Chen, D. Li, C. Wang, and J. Yang, “Papaya tree detec- [52] Z. Qin, P. Zhang, F. Wu, and X. Li, “FcaNet: Frequency channel
tion with UAV images using a GPU-accelerated scale-space filtering attention networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
method,” Remote Sens., vol. 9, no. 7, p. 721, Jul. 2017. Oct. 2021, pp. 783–792.
[27] M. Shakaib Iqbal, H. Ali, S. N. Tran, and T. Iqbal, “Coconut trees [53] L. Tychsen-Smith and L. Petersson, “Improving object localization with
detection and segmentation in aerial imagery using mask region-based fitness NMS and bounded IoU loss,” in Proc. IEEE Conf. Comput. Vis.
convolution neural network,” 2021, arXiv:2105.04356. Pattern Recognit. (CVPR), Jun. 2018, pp. 6877–6885.
[28] S. Hartling, V. Sagan, P. Sidike, M. Maimaitijiang, and J. Carron, “Urban [54] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only
tree species classification using a WorldView-2/3 and LiDAR data fusion look one-level feature,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
approach and deep learning,” Sensors, vol. 19, no. 6, p. 1284, Mar. 2019. Recognit. (CVPR), Jun. 2021, pp. 13039–13048.
[29] D. Tianyang, Z. Jian, G. Sibin, S. Ying, and F. Jing, “Single-tree [55] P. Sun et al., “Sparse R-CNN: End-to-end object detection with learnable
detection in high-resolution remote-sensing images based on a cascade proposals,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
neural network,” ISPRS Int. J. Geo-Inf., vol. 7, no. 9, p. 367, 2018. (CVPR), Jun. 2021, pp. 14454–14463.
[30] M. Liu, Z. Han, Y. Chen, Z. Liu, and Y. Han, “Tree species classification [56] (May 2021). Colabeler. [Online]. Available: http://www.colabeler.com
of LiDAR data based on 3D deep learning,” Measurement, vol. 177,
Jun. 2021, Art. no. 109301.
[31] S. Briechle, P. Krzystek, and G. Vosselman, “Silvi-Net–A dual-CNN
approach for combined classification of tree species and standing dead
trees from remote sensing data,” Int. J. Appl. Earth Observ. Geoinf.,
vol. 98, Jun. 2021, Art. no. 102292.
[32] M. P. Ferreira et al., “Individual tree detection and species classification
of Amazonian palms using UAV images and deep learning,” Forest Ecol.
Manage., vol. 475, Nov. 2020, Art. no. 118397.
[33] Q. Xie, D. Li, Z. Yu, J. Zhou, and J. Wang, “Detecting trees in street
images via deep learning with attention module,” IEEE Trans. Instrum.
Meas., vol. 69, no. 8, pp. 5395–5406, Aug. 2020.
[34] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Yongzhen Wang received the M.S. degree from
Recognit. (CVPR), Jun. 2005, pp. 886–893. Xi’an Technological University, Xi’an, China, in
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- 2019. He is currently pursuing the Ph.D. degree with
chies for accurate object detection and semantic segmentation,” in Proc. the School of Computer Science and Technology,
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. Nanjing University of Aeronautics and Astronautics
[36] J. Uijlings, K. van de Sande, and T. Gevers, “Selective search for object (NUAA), Nanjing, China.
recognition,” Int. J. Comput. Vis., vol. 104, pp. 154–171, Oct. 2013. His research interests include deep learning, image
[37] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. processing, and computer vision, particularly in
(ICCV), Dec. 2015, pp. 1440–1448. the domains of object detection, image dehazing,
[38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- and image deraining. He has served as a Program
time object detection with region proposal networks,” in Proc. Adv. Committee (PC) Member for AAAI Conference on
Neural Inf. Process. Syst. (NIPS), vol. 28, Dec. 2015, pp. 91–99. Artificial Intelligence (AAAI) 2022.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.
4707312 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Xuefeng Yan received the Ph.D. degree from the Lina Gong received the Ph.D. degree in computer
Beijing Institute of Technology, Beijing, China, software and theory from the China University of
in 2005. Mining and Technology, Beijing, China, in 2020.
He is a Professor with the School of Computer Sci- She is currently a Lecturer with the College of
ence and Technology, Nanjing University of Aero- Computer Science and Technology, Nanjing Uni-
nautics and Astronautics (NUAA), Nanjing, China. versity of Aeronautics and Astronautics, Nanjing,
He was the Visiting Scholar with Georgia State China. She also studied as a Visitor one year in
University, Atlanta, GA, USA, in 2008 and 2012, the Software Analysis and Intelligence Lab (SAIL),
respectively. His research interests include intelli- School of Computing, Queen’s University, Kingston,
gent computing, model-based systems engineering ON, Canada. Her research interests include deep
(MBSE)/complex system modeling, simulation, and learning, software analysis, software testing, and
evaluation. mining software repositories.

Mingqiang Wei (Senior Member, IEEE) received


the Ph.D. degree in computer science and engineer-
Hexiang Bao received the B.E. degree from Hefei ing from The Chinese University of Hong Kong
University, Hefei, China, in 2017. He is currently (CUHK), Hong Kong, in 2014.
pursuing the M.S. degree in computer science and He is a Professor with the School of Computer Sci-
ence and Technology, Nanjing University of Aero-
technology with the Nanjing University of Aeronau-
tics and Astronautics (NUAA), Nanjing, China. nautics and Astronautics (NUAA), Nanjing. Before
His research interests include deep learning, image joining NUAA, he served as an Assistant Professor
with the Hefei University of Technology, Hefei,
processing, and object detection.
China, and a Post-Doctoral Fellow at CUHK.
Dr. Wei was a recipient of the CUHK Young
Scholar Thesis Awards in 2014. He is now an Associate Editor for the Visual
Computer Journal, Journal of Electronic Imaging, and Journal of Image and
Graphics, and a Guest Editor for IEEE T RANSACTIONS ON M ULTIMEDIA.
His research interests focus on 3-D vision, computer graphics, and deep
learning.

Yiping Chen (Senior Member, IEEE) received the


Ph.D. degree in information and communications Jonathan Li (Senior Member, IEEE) received the
engineering from the National University of Defense Ph.D. degree in geomatics engineering from the
Technology, Changsha, China, in 2011. University of Cape Town, Cape Town, South Africa,
She is a Research Associate Professor with the in 2000.
Fujian Key Laboratory of Sensing and Computing He is a Professor with the Department of
for Smart Cities, School of Informatics, Xiamen Geography and Environmental Management and
University, Xiamen, China. From 2007 to 2011, cross-appointed with the Department of Systems
she was an Assistant Researcher with The Chinese Design Engineering, University of Waterloo, Water-
University of Hong Kong, Hong Kong. Her research loo, ON, Canada, and a Fellow of the Engineering
interests include remote sensing image processing, Institute of Canada. His main research interests
mobile laser scanning data analysis, 3-D point cloud computer vision, and include image and point cloud analytics, mobile
autonomous driving. She has published more than 70 articles in referred mapping, and artificial intelligence (AI)-powered information extraction from
journals, including IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTA - LiDAR point clouds and Earth observation images. He has coauthored over
TION S YSTEMS , IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE 500 publications, including 300+ in refereed journals and 200+ in conference
S ENSING, and IEEE J OURNAL OF S ELECTED T OPICS IN A PPLIED E ARTH proceedings.
O BSERVATIONS AND R EMOTE S ENSING, and conferences, including IEEE Dr. Li was a recipient of the 2021 Geomatica Award, the 2020 Samuel
Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Gamble Award, and the 2019 Outstanding Achievement Award in Mobile
International Symposium on Geoscience and Remote Sensing (IGARSS), and Mapping Technology. He is currently serving as the Editor-in-Chief of
ISPRS. International Journal of Applied Earth Observation and Geoinformation, and
Dr. Chen was a recipient of the 2020 Best Reviewer of the IEEE J OURNAL an Associate Editor of IEEE T RANSACTIONS ON I NTELLIGENT T RANS -
OF S ELECTED T OPICS IN A PPLIED E ARTH O BSERVATIONS AND R EMOTE PORTATION S YSTEMS , the IEEE T RANSACTIONS ON G EOSCIENCE AND
S ENSING. R EMOTE S ENSING, and Canadian Journal of Remote Sensing.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:19 UTC from IEEE Xplore. Restrictions apply.

You might also like