Professional Documents
Culture Documents
Tao 2019
Tao 2019
a
School of Geosciences and Info-Physics, Central South University, Changsha, Hunan 410083, China
b
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Keywords: Deep neural networks perform well in road extraction from very high-resolution satellite imagery. A network
Road extraction with certain reasoning ability will give more satisfactory road network extraction results. In this study, we
Semantic segmentation designed a spatial information inference structure, which enables multidirectional message passing between
Spatial information inference structure pixels when it is integrated to a typical semantic segmentation framework. Since the spatial information could be
Road-specific contextual information
propagated and reinforced via inter layer propagation, the proposed road extraction network can learn both the
local visual characteristics of the road and the global spatial structure information (such as the continuity and
trend of the road). As a result, this method can effectively solve occlusions and preserve the continuity of the
extracted road. The validation experiments using three large datasets of very high-resolution (VHR) satellite
imagery show that the proposed method can improve road extraction accuracy and provide an output that is
more in line with human expectations.
1. Introduction 2011; Grinias et al., 2016; Maboudi et al., 2018). These feature-based
and object-based algorithms have several points in common:
Road extraction using the remote sensing technology has got wide Combination of spectral features and spatial features: Since high-re-
applications in urban design, navigation, and geographic information solution satellite imagery has less spectral information and more spatial
update. The very high-resolution (VHR) satellite imagery with highly information, the road detector using only spectral features is unreliable,
structured and unified data is an excellent information source for road due to the occlusion of neighboring objects, such as buildings, trees and
network extraction. However, the manual interpretation of remote even shadows (Mirnalinee et al., 2011). Therefore, combining spectral
sensing imagery costs a lot of time and efforts. Automatic extraction of features with spatial features may provide more reliable results for road
road information from VHR satellite imagery would be able to improve extraction (Das et al., 2011; Rao et al., 2004; Gupta et al., 2007; Plaza
the efficiency of transportation database acquisition and updating. et al., 2009; Shi et al., 2014).
Over the past decades, various road extraction algorithms have been Fusion of multiscale information: Many empirical studies (Peng et al.,
proposed (Mena, 2003; Das et al., 2011; Chaudhuri et al., 2012). They 2008; Huang and Zhang, 2009; Wang et al., 2014; Miao et al., 2016)
either detect the skeletons of roads (Shi et al., 2014; Sujatha and have suggested that road and road centerlines extraction algorithms can
Selvathi, 2015) or extract all the road pixels out. These methods can be be improved by fusing multi-scale spatial features. Analyzing the rea-
divided into feature-based approaches and object-based approaches. sons for the improvement, we believe that most traditional road ex-
Early feature-based methods only used spectral features. Spatial fea- traction algorithms based on single scale make judgment only on the
tures, such as line (Quackenbush, 2004), edge (Unsalan and Sirmacek, basis of the pixels in the neighborhood or within the coverage of a
2012), and ridge (Nevatia and Babu, 1980; Treash and Amaratunga, certain size operator (Chaudhuri et al., 2012; Mirnalinee et al., 2011).
2000) were taken into consideration later. Moreover, prior knowledge Therefore, for a fixed-scale input image, the amount of information
such as direction, magnitude and geometric features (Gamba et al., receives by the feature extraction operator and the classifier is fixed. As
2006; Poullis and You, 2010; Liu et al., 2017) are also used as con- a result, too little information leads to improper classification, and too
straints for road extraction. Object-based methods consider the spectral much redundant information may interfere the judgment. In conclu-
and spatial features of roads, but they extract road segments from sion, multiscale information is critical for road extraction, but the
images and then further refine them using custom rules (Yuan et al., method for multi-scale information fusion need to be designed properly.
⁎
Corresponding author.
E-mail address: lihaifeng@csu.edu.cn (H. Li).
https://doi.org/10.1016/j.isprsjprs.2019.10.001
Received 5 June 2019; Received in revised form 1 October 2019; Accepted 3 October 2019
0924-2716/ © 2019 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
156
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 2. Principles of (a) ASPP and (b) SIIS for contextual information modeling.
However, they failed to extract satisfactory road information from re- thickness of each chunk is w = H / k . Then, the obtained sequence of
mote sensing images, especially when roads are occluded by other chunks S1 = {C11, C12, …, C1k} is sent into CRNN1 one by one, where
objects, such as trees, shadows, and buildings. The main reason is that CRNN1 denotes the first information processing unit in SIIS and will be
blindly fusing MSML features and enlarging the receptive field may introduced in Section 3.2. Specifically, the first chunk C11 is optimized
introduce invalid context information and lead to errors in the ex- by CRNN1 to generate a new chunk C21 with an equal size. When CRNN1
tracted road information (Fig. 2(a)). Furthermore, the effective road- optimizes the second chunk C12 , the last new chunk C21 will be also
specific contextual information should be the local road topology taken as the input to provide the contextual information. This process
(Fig. 2(b)). would continue until the last chunk C1k is updated, during which the
All in all, modeling effective road-specific contextual information is context information is continuously transmitted downward.
the key to solve the problem of the extracted road with fractures, In part II, the new chunks C21, C22, …, C2k form a sequence
especially in the case of occlusion. To this end, we proposed the spatial S2 = {C2k , …, C22, C21} from bottom to top, which is then sent into CRNN2
information inference structure (SIIS) to capture the local topology of for optimization in the way as part I does and to produce k new chunks.
roads (Fig. 2(b)). By integrating SIIS, most classic segmentation net- After that, these new chunks will be concatenated in the H dimension to
works can utilize the road-specific contextual information to overcome form a complete tensor with a size of C × H × W . To increase the di-
obstacles and improve the completeness of the extracted roads. rections for information propagation, the new tensor is re-split along
the W dimension and is processed similarly in part III and part IV. As
3. Method shown in Fig. 3, the four parts of SIIS correspond to the four main di-
rections of context information propagation, which are downward,
In this study, we designed a spatial information inference structure upward, rightward and leftward.
(SIIS) to better model the road-specific contextual information. SIIS can
explicitly transmit information along roads by an RNN based informa- 3.2. The information processing unit
tion processing unit. In the following, we first described the basic
structure of the proposed SIIS, and then detailed the involved spatial The key step of road context information modeling is to transmit
information inference mechanism. Finally, we showed how to in- useful context information, such as the local topology information of
corporate SIIS into traditional semantic framework to perform end-to- roads. Nevertheless, the basic framework of SIIS only ensures that each
end training. pixel indiscriminately receives the information from distant pixels in
four directions. Thus, it is necessary to design a refined message passing
3.1. The spatial information inference structure mechanism that can remember useful messages and forget unrelated
ones. Since this mechanism behaves like a memory machine, we
SIIS is agnostic to the base semantic segmentation framework, and adopted RNN as the information processing unit, which has demon-
its overview structure is depicted in Fig. 3. As shown in part I of Fig. 3, strated to be an effective tool to model the long-term memory in a se-
the input of SIIS is a tensor (feature map) sized C × H × W outputted quence.
from the backbone network’s encoder (i.e. the feature extractor), where However, traditional RNN units can only input one dimensional
C , H , and W denote the number of channels, rows, and columns, re- data, resulting in the loss of spatial information that is useful for re-
spectively. The tensor is firstly split into k chunks along H, and the ducing noise and fine-tuning the directions of road context information
157
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 3. Basic structure of the proposed spatial information inference structure (SIIS).
158
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 6. The process of the Conv3d-RNN units gradually controlling the contextual information received by an element in the feature map, where T 0, T 1, T 2, T 3
represent different stages of model training.
contextual information that can correct prediction of this pixel is re- category difference of pixels corresponding to Pi and Gti , and J em-
tained (T2 ). As useful contextual information accumulates, it has larger phasizes the deviation between the predicted road and the real road,
chances to be classified as a road (T3). the weight parameter is used to adjust the contribution ratio of M and
J to total loss.
3.4. Network integration with SIIS and end-to-end training
4. Experiments and analysis
In this study, we selected the high-performance typical semantic
segmentation network, DeepLabv3+, as the backbone to build the SII
4.1. Overall details of the experiments
network (SII-Net). As shown in Fig. 7, we insert SIIS with the bottleneck
layer named conv3 between the encoder and decoder to make it fully
4.1.1. Dataset
functional. Specifically, the feature map extracted by the encoder and
The validation experiments used the DEEPGLOBE-CVPR 2018 road
the ASPP structure of backbone network mainly contain appearance
extraction sub-challenge dataset1 (referred to as the CVPR dataset
features, so it should be further optimized by SIIS with contextual in-
hereafter), the Massachusetts road dataset2 and the RoadTracer dataset
formation transmission and filtering mechanism. As a result, the opti-
(Bastani et al., 2018). The CVPR dataset contains 6226 satellite images
mized feature map containing the road-specifically contextual in-
with a paired mask for road labels (Demir et al., 2018). These images,
formation can finally improve the road extraction results of the
collected by the DigitalGlobe’s satellite, have a size of 1024 × 1024
decoder.
pixels and a resolution of 50 cm/pixel. In the experiment, they were
To train this end-to-end SII-Net for road extraction, we used a
divided into the training set, validation set, and the test set. The Mas-
comprehensive index as the loss function to measure the difference
sachusetts road dataset (Mnih and Hinton, 2010) consists of 1171
between the predictions Pi (i = 0, 1, 2, …, n) and the ground truth
images, including 1108 images for training, 14 images for validation,
Gti (i = 0, 1, 2, …, n) , where n is the number of training samples. The
and 49 images for testing. Each image has a size of 1500 × 1500 pixels
loss function is defined as:
and a resolution of 120 cm/pixel. The RoadTracer dataset contains 300
Loss(W ) = M + (1 )(1 J ), (3)
where M is the mean squared error (MSE) between Pi and Gti , and J is 1
https://competitions.codalab.org/competitions/18467#participate.
the Jaccard index (Intersection Over Union). Since M measures basic 2
https://www.cs.toronto.edu/vmnih/data/.
159
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 7. Flowchart of the SII-Net. The bottleneck layer is used to reduce the dimension of feature map input SIIS to avoid large computational load.
Fig. 8. The process of the category ratio cropping method being applied to a typical sample of the CVPR dataset.
images with size of 4096 × 4096 pixels and a resolution of 60 cm/pixel, the same stride s and a w × w cropping window to get a set of sub
180 images for training and 120 images for testing. images and the corresponding sub labels {Isi , Lsi } , where s = 128 and
n n n
w = 512 . Second, Lsi was used to calculate the ratio Ri = n1 , n2 , …, nc ,
4.1.2. Data pre-processing s s
with nc denoting the number of pixels belonging to category c in Lsi and
s
For the CVPR dataset and the Massachusetts road dataset, the mis- ns denoting the total number of pixels of Lsi . Then, the smallest value in
identification of road pixels as background pixels is the main source of Ri , min(Ri ) , was used to compare against the threshold . For the pairs
loss value, since the background (non-road) pixels are much more than of Isi and Lsi , only those with the min(Ri ) greater than were kept. We
the road pixels in the satellite image (Figs. 9 and 10). Therefore, the set the ratio threshold as 0.01, which is a user-defined constant. The
optimization may reduce loss, but the optimized semantic segmentation influence of this parameter on the final road extraction results is ana-
networks have large chances to misidentify uncertain pixels as back- lysed in Section 4.6.
ground rather than road. To solve this problem, we adopted a simple After the CRC data pre-processing, the imbalance between the
and effective data pre-processing strategy, the category ratio cropping number of road and background samples is effectively alleviated, so the
(CRC) method. performance of the trained model can be improved. Finally, we got
Take an image I in the training set and its corresponding ground 88,689 labelled images with the size of 512 × 512 from the CVPR
truth label L as an example (Fig. 8). First, {I ;L} were slide cropped with
160
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
dataset and 48,290 from the Massachusetts roads dataset for training. 1
n
The roads in the RoadTracer dataset are very dense, so we did not Mean NRB = NRBi ,
n (5)
perform the CRC method. We directly use slide cropping with the stride
i=1
of 256, and finally got 8820 labelled images for training, each 512 × 512 where n donates the number of images in the test set.
pixel in size.
4.2. Experiment using the CVPR dataset
4.1.3. Train details
The training sets obtained by data pre-processing was preprocessed
In this experiment, we took the road extraction task as a semantic
by a series of common data enhancement methods. We cropped the
segmentation problem and focused on extracting complete road surface.
image and mask to random size and aspect ratio, flip and rotate (by 90
We compared the proposed SII-Net with four semantic segmentation
degrees) them randomly. Then, the random brightness (0.5–1.5), con-
based road extraction methods, including U-Net (Ronneberger et al.,
trast (0.7–1.3), saturation (0.8–1.2) and hue (−0.05 to 0.05) for
2015), Deep Residual U-Net (Zhang et al., 2017), HF-FCN (Zuo et al.,
spectral augmentation were also used to increase the data diversity.
2016) and the original DeepLabv3+ (Chen et al., 2018). As shown in
Finally, limited by GPU video memory, all images were scaled to
rows 1 and 2 in Fig. 9, the proposed SII-Net successfully extracts the
256 × 256 for feeding networks. Afterwards, all models were trained
roads under occlusion. This indicates that SII-Net does not solely de-
with the same parameter settings and environment. Specifically, we
pend on the visual features of roads but has reasoning ability by
trained the models using the Adam optimizer in the Ubuntu 16.04
modeling road-specific contextual information. In the third row, SII-Net
platform with one GTX1080Ti (memory 11 GB) that allows a batch size
does the best job of depicting the roads in the red circle. Moreover, in
of 16 images. The learning rate was initially set to be 1e−5 and re-
complex situations (the last row of Fig. 9), the SII-Net extraction results
duced by a factor of 0.02 per epoch. Since the CRC method divided the
have less noise than the results of other methods.
original large image into many small images, the model iterated much
A quantitative assessment was also done to compare the effective-
more times in each epoch on the CVPR dataset and the Massachusetts
ness of these methods. As shown in Table 1, the proposed SII-Net
roads dataset. Moreover, we used the ImageNet pre-trained encoder in
achieved the largest F1 score of 0.9279 with Mean IoU of 0.8344, larger
our network, which can further accelerate the convergence. As a result,
than the original DeepLabv3+ (F1 score of 0.9158 and Mean IoU of
the proposed network converged in only 15 epochs on these two da-
0.8247). The performance improvement was consistently in all in-
tasets. The sampling size of the RoadTracer dataset is much smaller
dicators compared to other methods, including the U-Net, ResUnet and
than the other two datasets, so the convergence cost up to 50 epochs by
HF-FCN. For example, when compared with the U-Net, the SII-Net
the proposed network.
achieved an increase in F1 score and Mean IoU of 9.69% and 8.46%,
Besides, the weight parameter in the loss function (Eq. (3)) was set
respectively. Since SII-Net uses the road-specific contextual informa-
to be 0.7 for the CVPR dataset and 0.1 for the Massachusetts roads
tion, the number of fault fractures of extracted roads are reduced
dataset and the RoadTracer dataset. On the one hand, is determined
considerably, which is nearly 2 times less than DeepLabv3+ (from 6.03
by precision evaluation on the test set. On the other hand, it should
to 3.49) and 3 times less than ResUnet (from 9.77 to 3.49).
conform to the characteristics of the dataset. Unlike CVPR datasets with
complete road surface masks, roads in the Massachusetts datasets and
the RoadTracer dataset are marked with centerlines with equally width. 4.3. Experiment using the Massachusetts roads dataset
As a result, the road pixels to background pixels ratio in the masks of
these two datasets is much smaller than that in the CVPR dataset. In this In the Massachusetts roads dataset experiment, we compared the
case, the misidentification of road pixels as background pixels only proposed method with the four semantic segmentation based road ex-
leads to a small M defined in Eq. (3), so the proportion of M should be traction methods as mentioned above. As shown in Table 2, the pro-
reduced while the proportion of J should be increased to promote the posed SII-Net gets a slight increase of 1.08–2.19% for F1 score and
convergence of the model. 1.13–3.75% for Mean IoU, respectively. Different from these indicators
that are calculated by area, Mean NRB only measures the topological
4.1.4. Evaluation metrics integrity of the extracted roads and thus reflects road connectivity more
To assess the performance of the road extraction methods, we directly. For example, DeepLabv3+demonstrates some advantages over
adopted three measures as follows: U-Net in area-based metrics, but it fails in the topology-based metric.
The proposed SII-Net method achieves better performance than other
• F1 score is an evaluation metric for the harmonic mean of precision methods in both area-based metrics and topology-based metrics. By
contrast, the dramatic increase in Mean NRB still demonstrated ad-
(P) and recall (R), and it can be calculated by Eq. (4). We used the
relaxed P and R and set the slack parameter to be 3 as previous vantages of SII-Net in solving the problem of the extracted road with
studies did (Mnih and Hinton, 2010; Zhang et al., 2017; Saito et al., fractures (e.g. 5.32 lower than DeepLabv3+and 9.32 lower than HF-
2016). FCN). The results in Fig. 10 also indicate that the road network ex-
tracted by the proposed approach has more details and fewer false
P×R
F1 = 2 × fractures.
P+R (4)
161
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 9. Road extraction results using the CVPR dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
Table 1 Table 3
Road extraction results of five approaches using the CVPR dataset. Road extraction results obtained by the five segmentation base approaches from
the RoadTracer dataset.
Method F1 score Mean IoU Mean NRB
Method F1 score Mean IoU Mean NRB
U-Net 0.8314 0.7498 7.27
ResUnet 0.8310 0.7537 9.77 U-Net 0.6432 0.5965 748.13
HF-FCN 0.8897 0.7997 6.84 ResUnet 0.7327 0.6612 649.07
DeepLabv3+ 0.9158 0.8247 6.03 HF-FCN 0.7490 0.6676 543.33
SII-Net 0.9279 0.8344 3.49 DeepLabv3+ 0.7752 0.6884 499.00
SII-Net 0.7923 0.7016 410.60
Table 2
Road extraction results of five approaches using the Massachusetts roads da- four segmentation based methods into road centerlines with single-
taset. pixel widths, which is consistent with the results of RoadTracer. Then,
Method F1 score Mean IoU Mean NRB
we expanded the single-pixel width ground truth and prediction to 8
pixels wide, because F1 score and IoU cannot properly evaluate road
U-Net 0.8999 0.7441 11.59 topology on single-pixel width centerlines.
ResUnet 0.8928 0.7490 14.06 As shown in Table 3, SII-Net is superior to semantic segmentation
HF-FCN 0.8888 0.7392 17.73
based road extraction methods in accuracy, with the F1 score
DeepLabv3+ 0.8994 0.7654 13.73
SII-Net 0.9107 0.7767 8.41 1.71–14.91% higher and the Mean IoU 1.32–10.51% higher than the
other methods. In addition, the Mean NRB of SII-Net is 88.40, 132.73,
238.47 and 337.53 smaller than DeepLabv3+, HF-FCN, ResUnet and U-
ground truth by expanding the single-pixel width centerlines to 11 Net, indicating that it has obvious advantage in solving the problem of
pixels wide as training labels. For the comparison, in the test stage, we the extracted road with fractures. The results in Fig. 11 also demon-
first converted the ground truth and the predicted result of SII-Net and strate that the road network extracted by the proposed approach has
Fig. 10. Road extraction results using the Massachusetts roads dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
162
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Fig. 11. Road extraction results using the RoadTracer dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
RoadTracer 0.6678 0.6317 487.33 In this section, we analysed the influence of threshold , a key
SII-Net 0.7931 0.7013 378.80 parameter of data pre-processing method CRC, on the final road ex-
traction. As shown in Table 6, the number of sub images and labels
obtained by CRC varies greatly with the threshold , which also leads to
Table 5 different final performance of the trained model. Take the results of the
Comparson of road extraction results between Conv3d-RNN and Conv1D on Massachusetts dataset as an example, when = 0.01, a lot of sub images
three datasets. and labels that barely contain roads are filtered out, resulting in a much
4]90ptDataset SII-Net (Conv1D) SII-Net (Conv3d-RNN) smaller sample size than that of = 0 . However, this alleviates the
imbalance between the road and the background, which is conducive to
F1 score Mean IoU F1 score Mean IoU
the better convergence of the model and thus improves the performance
CVPR dataset 0.9168 0.8210 0.9279 0.8344 of the proposed method. Besides, the value of cannot be large, because
Massachusetts dataset 0.9057 0.7678 0.9107 0.7767 it may cause too much data to be eliminated and reduce the accuracy of
RoadTracer dataset 0.7600 0.6775 0.7923 0.7016 the proposed method.
more details and fewer false fractures. Note that, the size of the com- 5. Discussion
plete test images composed of 4 images is 8192 × 8192 pixels, so the
calculated Mean NRB is much larger than the results of the first two The above comparison experiments have demonstrated the effec-
datasets. tiveness of the proposed SII-Net for road extraction, especially for sol-
For the RoadTracer method, we directly used the well-trained model ving occlusions and preserve the continuity of the extracted road. In
(available online) and the test time strategies recommended by the this section, we further discuss the extensibility and robustness of SIIS.
authors of RoadTracer, including steps and setting of relevant thresh-
olds, to achieve the best evaluation result. As shown in Fig. 12, the road
graphs generated by RoadTracer have good connectivity, but not good
3
completeness. The proposed approach, however, obtains more accurate In this case, the width of the blank area is close to 200 pixels to avoid being
beyond the image border. For details, the code and paper of RoadTracer are
road details while maintaining the road connectivity. For quantitative
available at https://roadmaps.csail.mit.edu/roadtracer.
163
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Table 6
Different threshold in the CRC data pre-processing method.
Sample size F1 score Mean IoU Sample size F1 score Mean IoU
Table 7
Comparison of semantic segmentation networks with and without SIIS.
190ptMethod] CVPR dataset Massachusetts dataset RoadTracer dataset
Fig. 13. Road extraction results with artificial occlusions (yellow circles). (a) Input image. (b) U-Net. (c) U-Net with SIIS. (d) ResUnet. (e) ResUnet with SIIS. (f)
DeepLabv3+. (g) DeepLabv3 + with SIIS. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 14. Road extraction results obtained by SII-Net in diverse road scences, including (a) road segments with high curvatuse, (b) a complex scene with severe
occlusion near road intersection and (c) rural narrow road without clear boundary.
5.1. The extensibility of SIIS DeepLabv3+ with SIIS, similar improvements were achieved for three
datasets. These results demonstrate that integrating the SIIS with tra-
Since the input and output of SIIS are feature maps with flexible ditional semantic segmentation models can improve the accuracy of
sizes, they can be easily incorporated into most typical segmentation road extraction.
networks to improve the performance for road extraction. To verify the
extensibility of SIIS structure, we integrated SIIS with two classic se-
5.2. The robustness of SIIS
mantic segmentation models, ResUnet and U-Net, and compared them
with the original ones. As shown in Table 7, the performance of both
In this section, we discussed the robustness of SIIS in the following
models is significant improved by combining with SIIS. For U-Net, F1
two aspects:
score and Mean IoU increased by nearly 5% on the CVPR dataset, 0.9%
First, robustness analysis in unseen occlusion scenarios. We manually
and 1.5% on the Massachusetts dataset, and 3.1% and 1.9% on the
attached some occlusions to the test samples. As Fig. 13 shows, the first
RoadTracer dataset respectively. For ResUnet with SIIS and
row of images are attached a copy of a large truck on the road and the
164
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
second row of images are attached rectangular areas of woods. The Das, S., Mirnalinee, T., Varghese, K., 2011. Use of salient features for the design of a
features of the attached occlusions are unnatural and distinguishable multistage framework to extract roads from high-resolution multispectral satellite
images. IEEE Trans. Geosci. Remote Sens. 49 (10), 3906–3931.
from the original training set. In this case, traditional semantic seg- Chaudhuri, D., Kushwaha, N., Samal, A., 2012. Semi-automated road detection from high
mentation networks, which make decision according to visual features resolution satellite images by directional morphological enhancement and segmen-
or just memorize similar occlusion scenes from the training samples, tation techniques. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sensing 5 (5),
1538–1544.
can’t handle these extremes situation (columns b, d and f), but the Shi, W., Miao, Z., Debayle, J., 2014. An integrated method for urban main-road centerline
proposed SIIS can help these models to extract the road information extraction from optical remotely sensed imagery. IEEE Trans. Geosci. Remote Sens.
completely (columns c, e and g). This demonstrates the robustness and 52 (6), 3359–3372.
Sujatha, C., Selvathi, D., 2015. Connected component-based technique for automatic
reasoning ability of the proposed SII-Net. extraction of road centerline in high resolution satellite images. Eurasip J. Image
Second, robustness analysis in case of diverse road scenes. Due to the Video Process. 2015 (1), 8.
complexity and diversity of road, some extracted roads are prone to Quackenbush, L.J., 2004. A review of techniques for extracting linear features from
imagery. Photogramm. Eng. Remote Sens. 70 (12), 1383–1392.
fractures like road segments with high curvature, road intersections and
Unsalan, C., Sirmacek, B., 2012. Road network detection using probabilistic and graph
rural narrow road. Fig. 14 shows the road extraction results of the theoretical methods. IEEE Trans. Geosci. Remote Sens. 50 (11), 4441–4453.
proposed method in these challenging road scenes. As can be seen, the Nevatia, R., Babu, K.R., 1980. Linear feature extraction and description. Comput.
proposed method showed highly robustness for curved road segments Graphics Image Process. 13 (3), 257–269.
Treash, K., Amaratunga, K., 2000. Automatic road detection in grayscale aerial images. J.
(Fig. 14(a)). The road scene in Fig. 14(b) is also challenging, because Comput. Civil Eng. 14 (1), 60–69.
severe occlusion is close to the road intersection. However, the pro- Gamba, P., Dell’Acqua, F., Lisini, G., 2006. Improving urban road extraction in high-
posed method can also handle it. Our method failed to detect the path resolution images exploiting directional filtering, perceptual grouping, and simple
topological concepts. IEEE Geosci. Remote Sens. Lett. 3 (3), 387–391.
in the third challenging scene (Fig. 14(c)), which is the extremely Poullis, C., You, S., 2010. Delineation and geometric modeling of road networks. ISPRS J.
narrow country path near the T-junction without obvious boundaries. Photogramm. Remote Sensing 65 (2), 165–181.
The main reason for this failure can be considered that this illegible part Liu, J., Qin, Q., Li, J., Li, Y., 2017. Rural road extraction from high-resolution remote
sensing images based on geometric feature inference. ISPRS Int. J. Geo-Inf. 6 (10),
is the end of the path, not in the middle, which causes SIIS to fail to 314.
gather enough evidence for reasoning. Yuan, J., Wang, D.L., Wu, B., Yan, L., Li, R., 2011. Legion-based automatic road extraction
from satellite imagery. IEEE Trans. Geosci. Remote Sens. 49 (11), 4528–4538.
Grinias, I., Panagiotakis, C., Tziritas, G., 2016. Mrf-based segmentation and unsupervised
6. Conclusions classification for building and road detection in peri-urban areas of high-resolution
satellite images. ISPRS J. Photogramm. Remote Sensing 122, 145–166.
In this study, we analysed the defects of traditional semantic seg- Maboudi, M., Amini, J., Malihi, S., Hahn, M., 2018. Integrating fuzzy object based image
analysis and ant colony optimization for road extraction from remotely sensed
mentation networks in road extraction and proposed a novel spatial
images. ISPRS J. Photogramm. Remote Sensing 138, 151–163.
information inference structure that is based on road-specific con- Mirnalinee, T., Das, S., Varghese, K., 2011. An integrated multistage framework for au-
textual information. Experiments on two datasets showed the ad- tomatic road extraction from high resolution satellite imagery. J. Indian Soc. Remote
vantages of the proposed method for road extraction, especially in the Sensing 39 (1), 1–25.
Rao, S.G., Puri, M., Das, S., 2004. Unsupervised segmentation of texture images using a
face of occlusion. We also demonstrated that the road extraction per- combination of gabor and wavelet features. In: Proceedings of the Indian Conference
formance of classic semantic segmentation networks can be sig- on Computer Vision, Graphics & Image Processing (ICVGIP), pp. 370–375.
nificantly improved by integrating SIIS. Moreover, using the road-spe- Gupta, L., Pathangay, V., Patra, A., Dyana, A., Das, S., 2007. Indoor versus outdoor scene
classification using probabilistic neural network. Eurasip J. Adv. Signal Process. 2007
cific contextual information, the proposed road extraction methods (1), 1–10.
have highly robustness in handling complex roads, like road segments Plaza, A., Benediktsson, J.A., Boardman, J.W., Brazile, J., Bruzzone, L., Camps-Valls, G.,
with high curvature, road under severe occlusions, road intersections. Chanussot, J., Fauvel, M., Gamba, P., Gualtieri, A., 2009. Recent advances in tech-
niques for hyperspectral image processing. Remote Sens. Environ. 113 (1),
Further work could be conducted to examine the possibility of applying S110–S122.
the proposed method to large VHR satellite datasets of other categories Shi, W., Miao, Z., Wang, Q., Zhang, H., 2014. Spectral–spatial classification and shape
such as rivers or other slender object. Another future work aims at features for urban road centerline extraction. IEEE Geosci. Remote Sens. Lett. 11 (4),
788–792.
extending the proposed model to the task of road detection using hy-
Peng, T., Jermyn, I.H., Prinet, V., Zerubia, J., 2008. Incorporating generic and specific
perspectral data, and further using hyperspectral inversion model (Tao prior knowledge in a multiscale phase field model for road extraction from vhr
et al., 2018, 2019) for road material classification. images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sensing 1 (2), 139–146.
Huang, X., Zhang, L., 2009. Road centreline extraction from high-resolution imagery
based on multiscale structural features and support vector machines. Int. J. Remote
Declaration of Competing Interest Sens. 30 (8), 1977–1987.
Wang, J., Qin, Q., Yang, X., Wang, J., Ye, X., Qin, X., 2014. Automated road extraction
The authors declared that there is no conflict of interest. from multi-resolution images using spectral information and texture. In: Proceedings
of the 2014 IEEE Geoscience and Remote Sensing Symposium (IGARSS). IEEE, pp.
533–536.
Acknowledgment Miao, Z.L., Shi, W.Z., Samat, A., Lisini, G., Gamba, P., 2016. Information fusion for urban
road extraction from vhr optical satellite images. IEEE J. Sel. Top. Appl. Earth Observ.
Remote Sensing 9 (5), 1817–1829.
This work was supported by National key research and development Hung, W.C., Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X., Yang, M.H., 2017. Scene
projects (Grant No. 2018YFB0504500), National Natural Science parsing with global context embedding. In: Proceedings of the IEEE International
Foundation of China (Grant No. 41771458, 41301453), Young Elite Conference on Computer Vision, pp. 2631–2639.
Liu, Y., Wang, R., Shan, S., Chen, X., 2018. Structure inference net: object detection using
Scientists Sponsorship Program by Hunan Province of China under scene-level context and instance-level relationships. In: Proceedings of the IEEE
Grant 2018RS3012, and Hunan Science and Technology Department Conference on Computer Vision and Pattern Recognition, pp. 6985–6994.
Innovation Platform Open Fund Project under Grant 18K005. Tao, C., Mi, L., Li, Y., Qi, J., Xiao, Y., Zhang, J., 2019. Scene context-driven vehicle de-
tection in high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 57 (10),
7339–7351.
Appendix A. Supplementary material Grote, A., Heipke, C., Rottensteiner, F., 2012. Road network extraction in suburban areas.
Photogramm. Rec. 27 (137), 8–28.
Zhang, Z., Zhang, X., Sun, Y., Zhang, P., 2018. Road centerline extraction from very-high-
Supplementary data associated with this article can be found, in the
resolution aerial image and lidar data based on road connectivity. Remote Sensing 10
online version, at https://doi.org/10.1016/j.isprsjprs.2019.10.001. (8), 1284.
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.
References Mnih, V., Hinton, G.E., 2010. Learning to detect roads in high-resolution aerial images. In:
Proceedings of the European Conference on Computer Vision (ECCV), pp. 210–223.
Zhou, L., Zhang, C., Wu, M., 2018. D-linknet: Linknet with pretrained encoder and dilated
Mena, J.B., 2003. State of the art on automatic road extraction for gis update: a novel convolution for high resolution satellite imagery road extraction. In: Proceedings of
classification. Pattern Recogn. Lett. 24 (16), 3037–3058. the IEEE Conference on Computer Vision and Pattern Recognition Workshops
165
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
(CVPRW). IEEE, pp. 192–196. Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated convolutions, arXiv
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biome- preprint arXiv:1511.07122.
dical image segmentation. In: Proceedings of the International Conference on Medical Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G., 2018. Understanding
Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351. convolution for semantic segmentation. In: Proceedings of the 2018 IEEE Winter
Springer, pp. 234–241. Conference on Applications of Computer Vision (WACV). IEEE, pp. 1451–1460.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Semantic image segmentation with deep convolutional nets, atrous convolution, and
(CVPR), pp. 770–778. fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.
Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual u-net. IEEE Geosci. Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017 Rethinking atrous convolution for
Remote Sens. Lett. PP (99), 1–5. semantic image segmentation, arXiv preprint arXiv:1706.05587.
Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C., 2017. Automatic road detection Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with
and centerline extraction via cascaded end-to-end convolutional neural network. atrous separable convolution for semantic image segmentation. In: Proceedings of the
IEEE Trans. Geosci. Remote Sens. 55 (6), 3322–3337. European Conference on Computer Vision (ECCV), pp. 801–818.
Buslaev, A., Seferbekov, S., Iglovikov, V., Shvets, A., 2018. Fully convolutional network Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.
for automatic road extraction from satellite imagery. In: Proceedings of the IEEE Convolutional lstm network: A machine learning approach for precipitation now-
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. casting. In: Proceedings of the Advances in Neural Information Processing Systems
207–210. (NIPS), pp. 802–810.
Bastani, F., He, S., Abbar, S., Alizadeh, M., Balakrishnan, H., Chawla, S., Madden, S., Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D.,
DeWitt, D., 2018. Roadtracer: Automatic extraction of road networks from aerial Raskar, R., Deepglobe, 2018: A challenge to parse the earth through satellite images.
images. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern
Pattern Recognition (CVPR). IEEE, New York, pp. 4720–4728. Recognition Workshops (CVPRW), pp. 172–181.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic Saito, S., Yamashita, T., Aoki, Y., 2016. Multiple object extraction from aerial imagery
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and with convolutional neural networks. Electronic Imaging 60 (1), 1–9.
Pattern Recognition (CVPR), pp. 3431–3440. Zuo, T., Feng, J., Chen, X., 2016. Hf-fcn: Hierarchically fused fully convolutional network
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoder- for robust building extraction. In: Proceedings of the Asian Conference on Computer
decoder architecture for image segmentation, arXiv preprint arXiv:1511.00561 39 Vision (ACCV). Springer, pp. 291–302.
(12), 2481–2495. Tao, C., Wang, Y.J., Cui, W.B., Zou, B., Zou, Z.R., Tu, Y.L., 2019. A transferable spec-
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network. In: troscopic diagnosis model for predicting arsenic contamination in soil. Sci. Total
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Environ. 669, 964–972.
(CVPR), pp. 2881–2890. Tao, C., Wang, Y.J., Zou, B., Tu, Y.L., Jiang, X.L., 2018. Assessment and analysis of mi-
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic image grations of heavy metal lead and zinc in soil with hyperspectral inversion model.
segmentation with deep convolutional nets and fully connected crfs, arXiv preprint Spectrosc. Spectral Anal. 38 (6), 1850–1855.
arXiv:1412.7062.
166