You are on page 1of 8

Exploring the Potential of Vegetation Indices for Urban Tree

Segmentation in Street View Images


This paper was downloaded from TechRxiv (https://www.techrxiv.org).

LICENSE

CC BY 4.0

SUBMISSION DATE / POSTED DATE

20-01-2023 / 24-01-2023

CITATION

Arevalo-Ramirez, Tito; Alfaro, Anali; Saavedra, José M.; Recabarren, Matías; Ponce-Donoso, Mauricio;
Delpiano, José (2023): Exploring the Potential of Vegetation Indices for Urban Tree Segmentation in Street
View Images. TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.21933291.v1

DOI

10.36227/techrxiv.21933291.v1
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

Exploring the Potential of Vegetation Indices for


Urban Tree Segmentation in Street View Images
Tito Arevalo-Ramirez, Anali Alfaro, Jose M. Saavedra, Matı́as Recabarren, Mauricio Ponce-Donoso, José
Delpiano

1 Abstract—Urban forests play a crucial role in the development of


2 cities because of the urban ecosystem services they provide. Previous
3 works have alleviated urban forest monitoring by discriminating tree
4 species and performing tree inventories using street view images and
5 convolutional neural networks. However, the characterization of trees
6 from street-view images remains a challenging task. Determining
7 tree structural parameters has been limited because of inaccurate
8 tree segmentation caused by combined, occluded, or leaf-off trees.
9 Therefore, the current work evaluates the potential of vegetation in-
10 dices derived from red, green, blue, and synthesized near-infrared and Fig. 1: Tree segmentation challenges. Tree of interest in en-
11 red-edge spectral bands for urban tree segmentation. In particular, closed by cyan lines. Unwanted objects are shown by magenta
12 we attempt to show whether or not vegetation indices add relevant regions
13 information to deep neural segmentation networks when there are
14 low fine-tuning training samples. A conditional adversarial network
15 generates red-edge and near-infrared images in urban environments, (e.g., 18 months) and a high amount of money [6]. In order 46
16 which retrieve an average structural similarity index of 0.86 and
17 0.81, respectively. Furthermore, we note that by using appropriate to alleviate the characterization and evaluation of urban trees, 47

18 multispectral vegetation indices, one can boost the average inter- previous works have proposed different artificial intelligence- 48

19 section over the union between 5.07 % to 13.7 %. Specifically, we based strategies; see [6], [7], [8], [9], [10], [11], [12] and 49

20 suggest the SegFormer segmentation network pre-trained with the the references therein. It is essential to highlight that before 50
21 CityScapes dataset and Red Edge Modified Simple Ration index for mentioned works use street view images in the electromagnetic 51
22 improving urban tree segmentation. However, if no multispectral data
23 is available, the DeepLabV3 network pre-trained with the ADE20k spectrum’s Red, Green, and Blue (RGB) bands. 52

24 dataset is suggested because it could achieve the best RGB outcomes Most previous works focus on identifying tree species by 53

25 average IoU value of 0.671. first detecting trees using object detection algorithms (e.g., 54

26 Index Terms—Urban trees, Semantic Segmentation, Image to You Only Look Once, YOLO, deep learning approach). In- 55

27 Image translation, Multispectral Features, Neural Networks formation about tree species and the number of individuals 56

significantly alleviates tasks related to urban tree inventories 57

28 I. I NTRODUCTION [6], [13]. Nevertheless, tree characterization remains an as 58

challenging task because it usually depends on pixel-wise 59


29

30

31
U Rban forests have become essential in developing sus-
tainable cities in the last decades. Air and water qual-
ity control, microclimate regulation, or carbon sequestration,
identification of trees, which is commonly obscured by tree
occlusion or crown combinations, see Fig. 1. In particular,
60

61

based on the literature review, we found that only a few works 62


32 among other ecosystem services, are usually determined by
are pursuing to tackle the segmentation of urban trees and the 63
33 the characterization of urban trees [1], [2]. For instance, crown
computation of their dendrometric parameters [8], [10]. 64
34 projection area (area under tree dripline) and leaf area are used
The reference [8] aims to automatically determine tree 65
35 to calculate rainfall interception for estimating stormwater-
profile information such as the tree height, diameter at breast 66
36 runoff reduction benefits [3]. Moreover, urban trees can be
height (DBH), and tree species. Specifically, the methodology 67
37 associated with social inequality and facilitate its quantifica-
proposed starts detecting trees within a bounding box using 68
38 tion and enhancement [4]. Further, urban trees can be used
YOLOv3. Next, pixel-wise segmentation is performed for the 69
39 to retrieve economic compensation metrics for communities
detailed identification of trees. Panoptic-Deeplab framework 70
40 and local governments [5]. In this sense, proper management
performs the semantic segmentation using the Cityscapes 71
41 and characterization of urban trees enable environmental and
dataset for training and validation stages [14]. Despite the 72
42 socioeconomic earnings.
fact that the authors achieve acceptable outcomes, the main 73
43 In situ measurements of tree dendrometric parameters con-
drawback is that the model is evaluated using ideal tree 74
44 stitute the traditional and most accurate approach for determin-
images. In particular, ideal tree images are the ones in which 75
45 ing tree characteristics. Nevertheless, they require long periods
trees are detectable without being occluded or overlapped by 76

Tito Arevalo-Ramirez, Anali Alfaro, José Saavedra, Matı́as Recabarren, and obstacles or other trees. 77
José Delpiano are with the Faculty of Engineering and Applied Sciences, The research presented by [10] detects trees using the 78
Universidad de los Andes, Santiago, Chile
Mauricio Ponce-Donoso is with the Sociedad Chilena de Arboricultura, YOLO network with MobileNet as the backbone and computes 79

Santiago, Chile their height by pixel coordinates of the tree bounding box. 80
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

81 Note that this work handles tree occlusion and multiple tree
82 bounding boxes in the dataset generation stage. Images that
83 compose the dataset are taken at a distance that frame up the
84 tree of interest. Further, a person holding a reference object
85 (object with known dimensions) stands up close to that tree. In
86 particular, the reference object has two purposes. The former
87 serves as a scale for determining tree height, and the second
88 alleviates the identification of the tree of interest. The tree
89 under analysis is selected as the tree closest to the reference
90 object. Even though the proposed data acquisition protocol
91 could alleviate the recognition of tree-of-interest, it might
92 share some of the field survey disadvantages. For instance, data
93 collection could be prone to long periods and require trained Fig. 2: Field sampling scheme. The tree of interest is repre-
94 staff because two people are required, one to take photographs sented in cyan. The binary mask is generated using an image
95 of the trees and the other to carry up a reference object and manipulation program [21].
96 stand next to the tree of interest.

97 Based on the research works mentioned above, a reliable II. DATA ACQUISITION 136
98 segmentation of the tree to be assessed must be available
The RGB images for this study were obtained by vol- 137
99 for an appropriate tree characterization. However, large urban
unteers using smartphones under the project Arbocensus in 138
100 datasets which include tree instances have the disadvantage
Santiago metropolitan region, Chile. In particular, urban trees 139
101 that they do not discriminate individual occurrences of trees
were mapped from Las Condes and La Reina communes. 140
102 [14], [15]. Moreover, as mentioned by [8], it is challenging to
Santiago region’s climate can be described as Continental 141
103 generate a customized semantic segmentation dataset compa-
Mediterranean, with average coldest and warmest temperatures 142
104 rable to object detection datasets [16], in terms of training
of about 9.4 and 21.9 Celcius, respectively. Further, per 143
105 samples, due to semantic segmentation requires pixel-wise
year, precipitation is around 275 millimeters [20]. Regarding 144
106 labeled instances, which is an arduous task. Therefore, in the
urban tree photographs, volunteers (citizen scientists) could 145
107 present work, we explore a strategy to improve the segmenta-
capture about three thousand images of about one hundred 146
108 tion of urban trees by two state-of-the-art deep neural networks
species. However, we randomly chose one hundred images to 147
109 (i.e., DeepLabV3 and SegFormer [17], [18]). Both networks
evaluate whether vegetation indices could improve urban tree 148
110 are fine-tuned using a limited set of training, validation, and
segmentation when using small number og tree samples. A 149
111 testing samples.
detailed process of urban tree sampling is shown in Fig. 2. 150

After the random selection of the one hundred urban tree 151
112 Since the custom dataset we have created has only one hun-
images, the binary masks were created using an image ma- 152
113 dred RGB samples, we attempt to incorporate information into
nipulation program [21]. The primary motivation to employ 153
114 segmentation models by vegetation indices (VIs) computed
the software mentioned above was to create fine binary masks 154
115 from the electromagnetic spectrum’s visible, red-edge, and
as the one shown in Fig. 2. The detailed segmentation masks 155
116 near-infrared bands. We hypothesize that vegetation indices
were obtained manually by pixel-wise labeling of the tree of 156
117 could add valuable knowledge about urban trees, which might
interest using free selection and color selection tools. 157
118 not be decoded in the training stage of the segmentation mod-
119 els due to lacking training samples. The belief that vegetation
120 indices could boost urban tree segmentation is supported by a III. M ETHODOLOGY 158

121 previous work, which shows that multispectral indices improve Once we obtain the RGB images and their corresponding 159

122 the classification of ground points in forested regions [19]. binary masks, the VIs are computed. Visible-based VIs are 160

determined straightforwardly using each image’s red, green, 161

123 In this context, we evaluate the behavior and segmentation and blue channels. Regarding multispectral indices, they are 162

124 performance of DeepLabV3 and SegFormer models when they generated using synthesized red-edge and near-infrared chan- 163

125 are directly fed with a four-channel image (i.e., RGB channels nels. A supervised image-to-image translation model derived 164

126 plus a vegetation index channel). A total of 19 vegetation by training and validating a conditional adversarial network 165

127 indices were computed, ten based on visible bands and nine generates the synthesized red-edge and near-infrared channels. 166

128 on red-edge and near-infrared bands. The later indices are Next, we compute 19 sets of four-channel images (i.e., red, 167

129 calculated using multispectral data synthesized from the RGB green, blue, and VI) to evaluate if any proposed VI improves 168

130 channels using a conditional adversarial network. By integrat- the performance of the segmentation models. The segmen- 169

131 ing vegetation indices during the training process, we have tation outcomes using RGB images are used as reference. 170

132 found that one could improve the segmentation performance Furthermore, it is essential to highlight that deep neural 171

133 by about 13.7 % when using an appropriate deep neural networks are pre-trained with two urban datasets (ADE20k 172

134 network and vegetation index, even if it is computed using and CityScapes), which include urban tree instances. Figure 3 173

135 multispectral synthesized data. shows the general scheme of the proposed methodology. 174
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

Fig. 3: General scheme of the proposed methodology for evaluating visible and synthesized red-edge and near-infrared vegetation
indices. Description of visible and multispectral indices is shown in Table I. Light gray block describe the computation of
vegetation indices based on visible and predicted multispectral bands. Gray block pictures the fine-tuning of pre-trained
segmentation networks. Note that I2I refers to image-to-image translation model.

175 A. Visible, Red-Edge, and Near-Infrared Processing In this sense, nineteen sets of one hundred images, which 211

retrieve knowledge about red, green, blue, channels, and a 212


176 Before explaining how to compute the VIs, we start by VI, are generated. In particular, we use four-channel images 213
177 describing the procedure for determining the artificial red-edge where the first three channels correspond to red, green, and 214
178 and near-infrared channels. blue information, and the fourth channel provides knowledge 215
179 1) Conditional Adversarial Network: is a network that about a specific vegetation index. 216
180 determines a model for mapping a pixel from a source
181 image to a target image; see [22] for more details. It has
182 been demonstrated that image-to-image translation models can B. Segmentation models 217

183 predict the near-infrared channel from aerial crop images [23]. The pixel-wise identification is performed by two state-of- 218
184 Nevertheless, in our case, the source and target images are the-art semantic segmentation deep networks implemented in 219
185 street view RGB and red-edge/near-infrared images, respec- a PyTorch open-source toolbox MMSegmentation [27]. 220
186 tively. Note that the RGB to multispectral channels mapping 1) DeepLabV3: is a deep convolutional neural network that 221
187 is learned by two different models, one for red-edge and the exploits the potential of atrous convolution for improving its 222
188 other for near-infrared. Each model is trained, validated, and performance in semantic image segmentation tasks [17]. This 223
189 tested from scratch using a hyperspectral city dataset [24]. segmentation model achieves a mean Intersection over Union 224
190 Specifically, we use 1054 and 100 images to train and validate (mIoU) of 42.42% and 79.09% on ADE20k and Cityscapes 225
191 each model. Both stages are performed using one thousand datasets, respectively. Table II shows the computation of 226
192 epochs with default parameters, UNet as the generator and intersection over union. 227
193 PatchGAN as the discriminator. Then, red-edge and near- 2) SegFormer: is a semantic segmentation deep network 228
194 infrared mapping models are evaluated using the structural that unifies transformers with lightweight multilayer percep- 229
195 similarity index measure (SSIM) with 176 images not seen in tron decoders, which avoids complex decoders [18].When 230
196 the training or validation stages. using ADE20k and Cityscapes datasets, SegFormer yields a 231
197 After the red-edge and near-infrared mapping models are mIoU of about 37.85% and 76.54%, correspondingly. 232
198 determined, we use them to compute multispectral channels Note that segmentation networks are pre-trained with three 233
199 using our dataset’s RGB images. Next, the VIs are calcu- channels images (i.e., red, green, and blue). A detailed de- 234
200 lated as operations and transformations between visible or scription of pre-training stages and models can be found in 235
201 multispectral channels. We have chosen ten indices based on [27]. 236
202 RGB channels, and nine on multispectral channels, which have 3) Training and Validation: Using our dataset, we took 237
203 been reported in previous works for evaluating the status of advantage of the transfer learning procedure for training, 238
204 vegetation [25]. validating, and fine-tuning segmentation models. Specifically, 239

205 2) Visible Vegetation Indices: These indices are computed we use 74, 16, and 10 images for training, validating, and 240

206 using only the red, green, and blue channels of the electro- testing the models. Note that default parameters are used for 241

207 magnetic spectrum. Table I shows the description of each VI. the stages mentioned above. Nevertheless, it is important to 242

208 3) Multispectral Vegetation Indices: Multispectral indices note that when using four channels images, the networks’ input 243

209 use bands in the visible, red-edge, and near-infrared bands to is set to four. Further, 15 thousand iterations with batch size 244

210 estimate the status of vegetation. Table I explains each index. of one are employed for fine-tuning the segmentation models. 245
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

TABLE I: Visible and multispectral VIs. Where ρα is the reflectance in the α band. The red, green, blue, red-edge, and
near-infrared channels are represented by R, G, B, RE, and N IR, respectively. References of each vegetation index equation
can be found in [19], [25], [26]
Vegetation Index Equation
Visible
Color Index of Vegetation Extraction (cive) 18.78745 + 0.44ρR − 0.88ρG + 0.385ρB
Excess Green Index (exg) 2ρG − ρR − ρB
Excess Red Index (exr) 1.4ρR − ρG
Excess Green Minus Red Index (exgr) exg - exr
2ρG −ρR −ρB
Green Leaf Index (gli) 2ρG +ρR +ρB
ρ2 2
G −ρR
Modified Green Red Vegetation Index (mgrvi)
ρ2 +ρ2
G R
ρG −ρR
Modified Photochemical Reflectance Index (mpri)
 ρG +ρR 
ρ −ρ
Normalized Difference Index (ndi) 128 ρG +ρR + 1
G R
ρ2
g −ρR ρB
Red Green Blue Vegetation Index (rgbvi) ρ2
g +ρR ρB
Triangular Greenness Index (tgi) 0.5((ρR − ρB) − (ρR − ρG )) − ((ρR − ρG ) − (ρR − ρB ))
Multispectral
Chlorophyll Absorption Reflectance Index (cari) (ρRE − ρR ) − 0.2(ρRE − ρG )
ρN IR −ρR
Enhanced Vegetation Index (evi) 2.5 ρ +6ρ −7.5ρ +1
N IR R B
ρN IR −ρG
Green Normalized Difference Vegetation Index (gndvi) ρN IR +ρG
ρRE ((ρRE −ρR )−0.2(ρRE −ρG ))
Modified CARI (mcari) ρR

2ρN IR +1− 2ρN IR +1−8(ρN IR −ρR )
Modified Soil Adjusted Vegetation Index (msavi) 2
ρN IR −ρR
Normalized Difference Vegetation Index (ndvi) ρN IR +ρR
1.16(ρN IR −ρR )
Optimization of Soil Adjusted Vegetation Index (osavi) ρN IR +ρR +0.16
ρN IR /ρRE −1
Red Edge Modified Simple Ratio (remsr) √
ρN IR /ρRE +1
ρN IR −ρRE
Red Edge Normalized Difference Vegetation Index (rendvi) ρN IR +ρRE

TABLE II: Quantitative metrics for evaluating the performance


of CRF. Where ToI is the tree of interest, IoU refers to the
intersection over the union, p is the precision, and r the recall.
Predicted Metric
ToI non–ToI IoU = a/(a + b + c)
ToI a b p = a/(a + c)
Ground-truth (a) Red-Edge: Worst, SSIM = 0.14
non–ToI c d r = a/(a + b)

246 After training and validating segmentation models using


247 our dataset, we evaluate them using IoU quantitative metric
248 detailed in Table II and the testing set with samples that have
249 not been seen in previous stages.
(b) Near-Infrared: Worst, SSIM = 0.09

250 IV. R ESULTS Fig. 4: Worst cases for qualitative results of conditional
adversarial network for red-edge and near-infrared channels.
251 For training, validating, and testing the image-to-image Both of them are associated with conditions of extremely low
252 translation model, we extracted the reflectance at 718 nm illumination.
253 (red-edge) and 840 nm (near-infrared) from the hyperspectral
254 city dataset. These bands are close to the ones used by the
255 MicaSense redEdge multispectral camera. The performance of
256 the conditional adversarial network is described in Table III.
multispectral channel using our dataset. It should be high- 262
257 The worst prediction outcomes are shown in Fig. 4.
lighted that poorly illuminated environments yield the worst 263
258 Since the average SSIM is over 0.8 for the synthesized
outcomes, specifically in dark images with no light sources. 264
259 red-edge and near-infrared channels and no previous works
Nevertheless, this reconstruction behavior might not occur in 265
260 evaluate this information in urban tree segmentation, we used
our dataset because all images were captured on sunny days. 266
261 the corresponding image-to-image models for predicting the
Once the multispectral channels are determined, vegetation 267

indices’ computation is straightforward by equations in Table 268


TABLE III: Structural similarity index measure outcomes for I. Then, the RGB and VIs information is fed to segmentation 269
the Image-to-image translation using 176 testing samples from models. The behavior of these models on the testing set is 270
hyperspectral city dataset shown in Table IV. 271

Red-Edge Near-Infrared Figure 5 shows qualitative outcomes of the best and worst
Metric 272
avg std avg std segmentation instances when using the fine-tuning set that 273
SSIM 0.86 0.14 0.81 0.17 helps to enhance segmentation performance. 274
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

TABLE IV: Segmentation performance of the deep neural networks using IoU metric. Description of the visible and multispectral
VIs is in table I. Bold values are the ones that achieve the best average IoU metric.
Testing samples
Model Pretraining Set avg
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
RGB 0.859 0.922 0.718 0.632 0.486 0.643 0.453 0.736 0.526 0.739 0.671

ADE20k
RGB&exg 0.830 0.894 0.776 0.693 0.490 0.576 0.470 0.741 0.579 0.723 0.677
RGB&gndvi 0.900 0.904 0.697 0.668 0.486 0.611 0.524 0.719 0.561 0.751 0.682
RGB&ndvi 0.864 0.926 0.729 0.675 0.481 0.630 0.519 0.706 0.540 0.724 0.679
RGB 0.843 0.921 0.646 0.584 0.475 0.623 0.483 0.687 0.575 0.765 0.660
DeepLabV3

RGB&exg 0.863 0.919 0.716 0.685 0.479 0.561 0.533 0.659 0.612 0.656 0.668
RGB&mpri 0.913 0.921 0.766 0.662 0.504 0.614 0.573 0.725 0.587 0.692 0.696
CityScapes

RGB&tgi 0.898 0.913 0.678 0.595 0.466 0.552 0.568 0.678 0.556 0.787 0.669
RGB&evi 0.862 0.924 0.767 0.720 0.530 0.606 0.537 0.722 0.541 0.746 0.696
RGB&mcari 0.883 0.900 0.756 0.625 0.532 0.625 0.521 0.699 0.520 0.777 0.684
RGB&msavi 0.874 0.861 0.735 0.651 0.502 0.673 0.531 0.652 0.588 0.760 0.683
RGB&osavi 0.871 0.918 0.730 0.736 0.508 0.626 0.533 0.662 0.530 0.787 0.690
RGB&rendvi 0.859 0.909 0.717 0.728 0.510 0.617 0.536 0.754 0.511 0.721 0.686
RGB 0.759 0.925 0.645 0.613 0.534 0.706 0.431 0.642 0.642 0.767 0.667
ADE20k

RGB&rgbvi 0.823 0.911 0.710 0.656 0.496 0.686 0.428 0.707 0.684 0.712 0.681
RGB&cari 0.848 0.922 0.720 0.609 0.454 0.669 0.594 0.781 0.674 0.771 0.704
RGB&mcari 0.862 0.913 0.709 0.621 0.515 0.644 0.478 0.741 0.640 0.676 0.680
RGB 0.627 0.925 0.506 0.695 0.502 0.722 0.440 0.687 0.370 0.728 0.620
RGB&tgi 0.874 0.913 0.596 0.764 0.574 0.679 0.518 0.632 0.172 0.675 0.640
SegFormer

RGB&cari 0.852 0.902 0.563 0.684 0.513 0.649 0.552 0.781 0.349 0.492 0.634
RGB&evi 0.854 0.919 0.498 0.674 0.485 0.700 0.438 0.742 0.580 0.720 0.661
CityScapes

RGB&mcari 0.858 0.916 0.559 0.560 0.498 0.627 0.559 0.624 0.639 0.552 0.639
RGB&msavi 0.858 0.905 0.722 0.606 0.454 0.700 0.428 0.739 0.658 0.765 0.683
RGB&ndvi 0.893 0.872 0.717 0.519 0.492 0.699 0.615 0.688 0.510 0.767 0.677
RGB&osavi 0.863 0.916 0.663 0.548 0.423 0.608 0.521 0.682 0.532 0.790 0.655
RGB&remsr 0.881 0.913 0.787 0.687 0.436 0.642 0.485 0.737 0.675 0.808 0.705
RGB&rendvi 0.867 0.910 0.597 0.700 0.460 0.694 0.428 0.532 0.564 0.776 0.653

275 V. D ISCUSSIONS could not be as informative as the measured one. In this con- 305

text, we encourage future works to evaluate vegetation indices 306

276 The image-to-image translation models can retrieve reliable computed using genuine multispectral channels because they 307

277 red-edge and near-infrared channels, an SSIM greater than could add knowledge not retrieved by synthesized data. Note 308

278 0.8. Nevertheless, one should be aware that in environments that this work is the first research that attempts to improve 309

279 with inadequate illumination, this model fails to provide a fair urban tree segmentation performance using multispectral in- 310

280 representation of multispectral channels; see Fig. 4. Despite a formation. Therefore, the strategy presented could be used as 311

281 previous work [23] achieving SSIM values above 0.9, there a baseline for future works that seek to improve pixel-wise 312

282 are significant differences in the environments mapped. First, identification of urban trees. 313

283 [23] captured aerial photographs from crop fields, which have As expected, the knowledge provided by vegetation in- 314

284 two classes vegetation and terrain. This might alleviate the dices helps to improve the segmentation performance of 315

285 prediction task of near-infrared channels. However, in our DeepLabV3 and SegFormer networks. However, not all 19 sets 316

286 case, street view images of urban trees and objects such (RGB&VI) boost the segmentation behavior. Moreover, the 317

287 as buildings could change the illumination conditions by enhancement depends on the pretraining dataset. For instance, 318

288 introducing unexpected shadows or sparkles that might affect when DeepLabV3 and SegFormer are pre-trained with the 319

289 the spectral reflectance recorded by the hyperspectral sensor. CityScapes dataset, more sets (RGB&VI) yield higher IoU 320

290 On the other side, it is essential to highlight that [23] values than solely RGB images. 321

291 performs a radiometric calibration using a reflectance panel The segmentation outcomes presented in Table IV show that 322

292 to obtain absolute spectral information. In our case, the vegetation indices could improve the segmentation of urban 323

293 hyperspectral city dataset does not acknowledge whether or trees. Specifically, when using DeepLabV3, one could expect 324

294 not the spectral reflectance values are absolute or relative. a boosting in the IoU between 1.64 % and 5.45 % for ADE20k 325

295 Nevertheless, we considered this dataset and outcomes suitable and CityScapes datasets, respectively. For SegFormer, the 326

296 for urban tree segmentation purposes. We could not further enhancement is about 5.55 % and 13.7 % for ADE20k and 327

297 assess synthesized red-edge and near-infrared because our CityScapes, correspondingly. Note that SegFormer achieves 328

298 dataset lacks multispectral data. Moreover, the assessment of the most considerable IoU difference compared to RGB im- 329

299 multispectral reconstruction is out of the focus of the present ages when it uses RGB&remsr information. It has been trained 330

300 work. with the CityScapes dataset; see Table IV. 331

301 The image-to-image translation model’s reconstruction per- Regarding visible and multispectral VIs, the latter could be 332

302 formance might influence the segmentation networks’ per- the ones that add more information for boosting segmentation 333

303 formance. Although the red-edge and near-infrared channels networks. Specifically, EVI and REMSR vegetation indices 334

304 achieve an SSIM over 0.8, the synthesized multispectral data are the ones that allow us to achieve the best segmentation 335
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

worst outcomes SegFormer achieves lower precision than 362

DeepLabV3. However, since the difference for their best av- 363

erage IoU outcomes is about 0.009, more experiments should 364

be conducted to assess each segmentation network for urban 365

tree segmentation. 366

Although the proposed work alleviates urban tree seg- 367

mentation, this task should still be considered challenging. 368

Specifically, from the testing samples evaluated, the ones that 369

output the lowest IoU values are occluded trees and trees 370
(a) DeepLabV3: pre-trained on CityScapes and fine-tuned
using RGB&evi. The p and r for the best outcome are with combined crowns. Instances of these cases are shown 371

0.949 and 0.973; and for the worst outcome are 0.539 and in Fig. 1. In this sense, future works should focus on fusing 372
0.969. street view, aerial images, and tree georeferences to tackle the 373

still non-solved issues regarding tree segmentation. Moreover, 374

we also suggest performing aerial surveys and applying the 375

methodologies proposed by previous researchers [28], [29] for 376

improving the segmentation of urban trees. 377

Our experiments also revealed that deep neural segmenta- 378

tion networks could decode the information retrieved by veg- 379

etation indices based on visible or synthesized multispectral 380

bands. We infer that because from 19 VIs, just three indices 381

yield an average IoU greater than the one obtained with RGB 382

(b) SegFormer: pre-trained on CityScapes and fine-tuned images when using the ADE20k dataset. This behavior might 383
using RGB&remsr. The p and r for the best outcome are be due to all VIs being determined from a single source of 384
0.931 and 0.979; and for the worst outcome are 0.437 and
0.995. information, RGB images. The network’s ability to decode 385

vegetation indices might depend on the vegetation and tree 386


Fig. 5: Qualitative results of segmentation networks, best and instances available in the training sets. For instance, by pre- 387
worst outcomes are presented. Dark cyan repreents the true training segmentation models with the ADE20k dataset, one 388
positive pixels, dark magenta are false positive pixels, and can achieve the best IoU scores using solely RGB images for 389
light green region shows false negative pixels. Samples S2 fine-tuning these models. Conversely, the CityScapes dataset 390
and S5 achieves the best and worst outcomes. might need to include more vegetation or tree examples for 391

inferring the information retrieved by vegetation indices. 392

Finally, the insight retrieved by the current work alleviates 393


336 performance using DeepLabV3 and SegFormer networks, re-
and guides future works regarding the visible and multispectral 394
337 spectively. Note that the MPRI index, an RGB-based index, assessment of urban trees and related tasks. 395
338 shows performance similar to a multispectral index (EVI); see
339 Table IV. Although segmentation improvements, one should
340 be aware that the best outcome when using RGB&VI data VI. C ONCLUSIONS 396
341 (SegFormer pre-trained with CityScapes) is 5.07 % greater
342 than the best RGB outcome (DeepLabV3 pre-trained with The assessment of visible and multispectral vegetation in- 397

343 ADE20k). These results might suggest that deep neural net- dices shows that the knowledge derived from these indices 398

344 works can decode knowledge of vegetation indices in the train- can improve pixel-wise identification of urban trees. Note that 399

345 ing and validation stages. In particular, the ADE20k dataset multispectral indices are computed using red-edge and near- 400

346 might have a fair number of vegetation and tree instances that infrared channels estimated by an image-to-image translation 401

347 alleviates the transfer learning procedure for segmenting the model. The structural similarity index for red-edge and near- 402

348 tree of interest with small quantity of urban tree samples. infrared channels was 0.86 and 0.81, respectively. Regarding 403

349 Based on those mentioned above, the selection of pretrain- segmentation outcomes, one could improve the IoU score from 404

350 ing samples plays a crucial role in alleviating further steps in 0.620 to 0.705 (13.7 %) by using the SegFormer segmentation 405

351 the pixel-wise identification of urban. In particular, for RGB network pre-trained with the CityScapes dataset and RGB im- 406

352 images, we suggest using the DeepLabV3 network pre-trained ages combined with the Red Edge Modified Simple Ratio in- 407

353 with the ADE20k dataset for future works related to urban dex. However, more experiments with measured multispectral 408

354 tree segmentation because it retrieves the best IoU values information are suggested due to the segmentation improve- 409

355 for RGB images. If multispectral information is available, ments achieved with synthesized red-edge and near-infrared 410

356 the SegFormer model pre-trained with the CityScapes dataset channels. Moreover, if just RGB images are available, We 411

357 should be used due to its performance. advise employing DeepLabV3 pre-trained with the ADE20k 412

358 On the other hand, the differences between DeepLabV3 dataset as the base network for further fine-tuning with the 413

359 and SegFormer are shown in Fig. 5. In particular, for the custom urban trees dataset. Specifically, this configuration 414

360 worst case, the SegFormer network shows a greater region achieves the best IoU values (i.e., 0.671) when it was fine- 415

361 of false positives than DeepLabV3. Further, for the best and tuned with our RGB custom tree dataset. 416
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

417 VII. ACKNOWLEDGEMENTS [18] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., and 488
Luo, P., 2021. “Segformer: Simple and efficient design for semantic 489
418 This work is supported by the Agencia Nacional de Investi- segmentation with transformers”. arXiv preprint arXiv:2105.15203. 490

419 gación y Desarrollo (ANID) under grant Fondecyt 11220510, [19] Arevalo-Ramirez, T., Guevara, J., Rivera, R. G., Villacrés, J., Menéndez, 491
O., Fuentes, A., and Cheein, F. A., 2021. “Assessment of multispectral 492
420 FONDEF ID21I10360 and Tree Fund 21-JD-01. JD thankfully vegetation features for digital terrain modeling in forested regions”. 493
421 acknowledges funding from the Advanced Center of Electrical IEEE Transactions on Geoscience and Remote Sensing, 60, pp. 1–9. 494

422 and Electronic Engineering, AC3E (ANID/FB0008). [20] to travel, C., 2022. Climate-Santiago Chile. https://www. 495
climatestotravel.com/climate/chile/santiago. [Online; accessed 20- 496
December-2022]. 497
[21] The GIMP Development Team. Gimp. 498
423 R EFERENCES [22] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A., 2017. “Image-to-image 499
translation with conditional adversarial networks”. CVPR. 500
424 [1] Gillner, S., Vogt, J., Tharang, A., Dettmann, S., and Roloff, A., 2015.
[23] Aslahishahri, M., Stanley, K. G., Duddu, H., Shirtliffe, S., Vail, S., Bett, 501
425 “Role of street trees in mitigating effects of heat and drought at highly
K., Pozniak, C., and Stavness, I., 2021. “From rgb to nir: Predicting of 502
426 sealed urban sites”. Landscape and Urban Planning, 143, pp. 33–42.
near infrared reflectance from visible spectrum aerial images of crops”. 503
427 [2] Ponce-Donoso, M., Vallejos-Barra, O., Ingram, B., and Daniluk- In Proceedings of the IEEE/CVF International Conference on Computer 504
428 Mosquera, G., 2020. “Urban trees and environmental variables rela- Vision, pp. 1312–1322. 505
429 tionships in a city of central chile”. Arboriculture & Urban Forestry, [24] Huang, Y., Ren, T., Shen, Q., Fu, Y., and You, S., 2021. HSICityV2: 506
430 46(2), pp. 84–95. Urban Scene Understanding via Hyperspectral Images, July. 507
431 [3] McPherson, G., Simpson, J. R., Peper, P. J., Maco, S. E., and Xiao, Q., [25] Jaya, N., Harmayani, K., Widhiawati, I., Atmaja, G., and Jaya, I., 2022. 508
432 2005. “Municipal forest benefits and costs in five us cities”. Journal of “Spatial analysis of vegetation density classification in determining 509
433 forestry, 103(8), pp. 411–416. environmental impacts using uav imagery”. ISPRS Annals of the 510
434 [4] Schwarz, K., Fragkias, M., Boone, C. G., Zhou, W., McHale, M., Grove, Photogrammetry, Remote Sensing and Spatial Information Sciences, 3, 511
435 J. M., O’Neil-Dunne, J., McFadden, J. P., Buckley, G. L., Childers, pp. 417–422. 512
436 D., et al., 2015. “Trees grow on money: urban tree canopy cover and [26] Fu, H., Wang, C., Cui, G., She, W., and Zhao, L., 2021. “Ramie yield 513
437 environmental justice”. PloS one, 10(4), p. e0122051. estimation based on uav rgb images”. Sensors, 21(2), p. 669. 514
438 [5] Mullaney, J., Lucke, T., and Trueman, S. J., 2015. “A review of benefits [27] Contributors, M., 2020. MMSegmentation: Openmmlab semantic seg- 515
439 and challenges in growing street trees in paved urban environments”. mentation toolbox and benchmark. https://github.com/open-mmlab/ 516
440 Landscape and urban planning, 134, pp. 157–166. mmsegmentation. 517
441 [6] Beery, S., Wu, G., Edwards, T., Pavetic, F., Majewski, B., Mukherjee, S., [28] Wallace, L., Lucieer, A., and Watson, C. S., 2014. “Evaluating tree 518
442 Chan, S., Morgan, J., Rathod, V., and Huang, J., 2022. “The auto arborist detection and segmentation routines on very high resolution uav lidar 519
443 dataset: A large-scale benchmark for multiview urban forest monitoring data”. IEEE Transactions on Geoscience and Remote Sensing, 52(12), 520
444 under domain shift”. In Proceedings of the IEEE/CVF Conference on pp. 7619–7628. 521
445 Computer Vision and Pattern Recognition, pp. 21294–21307. [29] Harikumar, A., Bovolo, F., and Bruzzone, L., 2018. “A local projection- 522
446 [7] Branson, S., Wegner, J. D., Hall, D., Lang, N., Schindler, K., and Perona, based approach to individual tree detection and 3-d crown delineation in 523
447 P., 2018. “From google maps to a fine-grained catalog of street trees”. multistoried coniferous forests using high-density airborne lidar data”. 524
448 ISPRS Journal of Photogrammetry and Remote Sensing, 135, pp. 13–30. IEEE Transactions on Geoscience and Remote Sensing, 57(2), pp. 1168– 525
449 [8] Choi, K., Lim, W., Chang, B., Jeong, J., Kim, I., Park, C.-R., and Ko, 1182. 526
450 D. W., 2022. “An automatic approach for tree species detection and
451 profile estimation of urban street trees using deep learning and google
452 street view images”. ISPRS Journal of Photogrammetry and Remote
453 Sensing, 190, pp. 165–180.
454 [9] Jodas, D. S., Brazolin, S., Yojo, T., De Lima, R. A., Velasco, G. D. N.,
455 Machado, A. R., and Papa, J. P., 2021. “A deep learning-based approach
456 for tree trunk segmentation”. In 2021 34th SIBGRAPI Conference on
457 Graphics, Patterns and Images (SIBGRAPI), IEEE, pp. 370–377.
458 [10] Jodas, D. S., Yojo, T., Brazolin, S., Velasco, G. D. N., and Papa, J. P.,
459 2022. “Detection of trees on street-view images using a convolutional
460 neural network”. International Journal of Neural Systems, 32(01),
461 p. 2150042.
462 [11] Lumnitz, S., Devisscher, T., Mayaud, J. R., Radic, V., Coops, N. C., and
463 Griess, V. C., 2021. “Mapping trees along urban street networks with
464 deep learning and street-level imagery”. ISPRS Journal of Photogram-
465 metry and Remote Sensing, 175, pp. 144–157.
466 [12] Wang, Y., Yan, X., Bao, H., Chen, Y., Gong, L., Wei, M., and Li, J.
467 “Detecting occluded and dense trees in urban terrestrial views with a
468 high-quality tree detection dataset”. IEEE Transactions on Geoscience
469 and Remote Sensing, 60, pp. 1–12.
470 [13] Berland, A., and Lange, D. A., 2017. “Google street view shows promise
471 for virtual street tree surveys”. Urban Forestry & Urban Greening, 21,
472 pp. 11–15.
473 [14] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benen-
474 son, R., Franke, U., Roth, S., and Schiele, B., 2016. “The cityscapes
475 dataset for semantic urban scene understanding”. In Proc. of the IEEE
476 Conference on Computer Vision and Pattern Recognition (CVPR).
477 [15] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A.,
478 2017. “Scene parsing through ade20k dataset”. In Proceedings of the
479 IEEE conference on computer vision and pattern recognition, pp. 633–
480 641.
481 [16] Everingham, M., Van Gool, L., Williams, C. K. I., Winn,
482 J., and Zisserman, A. The PASCAL Visual Object Classes
483 Challenge 2012 (VOC2012) Results. http://www.pascal-
484 network.org/challenges/VOC/voc2012/workshop/index.html.
485 [17] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H., 2017. “Re-
486 thinking atrous convolution for semantic image segmentation”. arXiv
487 preprint arXiv:1706.05587.

You might also like