Tao 2019

ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier.com/locate/isprsjprs
Spatial information inference net: Road extraction using road-specific T

contextual information
Chao Taoa, Ji Qia, Yansheng Lib, Hao Wanga, Haifeng Lia,
⁎
a
School of Geosciences and Info-Physics, Central South University, Changsha, Hunan 410083, China
b
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
ARTICLE INFO ABSTRACT
Keywords: Deep neural networks perform well in road extraction from very high-resolution satellite imagery. A network
Road extraction with certain reasoning ability will give more satisfactory road network extraction results. In this study, we
Semantic segmentation designed a spatial information inference structure, which enables multidirectional message passing between
Spatial information inference structure pixels when it is integrated to a typical semantic segmentation framework. Since the spatial information could be
Road-specific contextual information
propagated and reinforced via inter layer propagation, the proposed road extraction network can learn both the
local visual characteristics of the road and the global spatial structure information (such as the continuity and
trend of the road). As a result, this method can effectively solve occlusions and preserve the continuity of the
extracted road. The validation experiments using three large datasets of very high-resolution (VHR) satellite
imagery show that the proposed method can improve road extraction accuracy and provide an output that is
more in line with human expectations.
1. Introduction 2011; Grinias et al., 2016; Maboudi et al., 2018). These feature-based
and object-based algorithms have several points in common:
Road extraction using the remote sensing technology has got wide Combination of spectral features and spatial features: Since high-re-
applications in urban design, navigation, and geographic information solution satellite imagery has less spectral information and more spatial
update. The very high-resolution (VHR) satellite imagery with highly information, the road detector using only spectral features is unreliable,
structured and unified data is an excellent information source for road due to the occlusion of neighboring objects, such as buildings, trees and
network extraction. However, the manual interpretation of remote even shadows (Mirnalinee et al., 2011). Therefore, combining spectral
sensing imagery costs a lot of time and efforts. Automatic extraction of features with spatial features may provide more reliable results for road
road information from VHR satellite imagery would be able to improve extraction (Das et al., 2011; Rao et al., 2004; Gupta et al., 2007; Plaza
the efficiency of transportation database acquisition and updating. et al., 2009; Shi et al., 2014).
Over the past decades, various road extraction algorithms have been Fusion of multiscale information: Many empirical studies (Peng et al.,
proposed (Mena, 2003; Das et al., 2011; Chaudhuri et al., 2012). They 2008; Huang and Zhang, 2009; Wang et al., 2014; Miao et al., 2016)
either detect the skeletons of roads (Shi et al., 2014; Sujatha and have suggested that road and road centerlines extraction algorithms can
Selvathi, 2015) or extract all the road pixels out. These methods can be be improved by fusing multi-scale spatial features. Analyzing the rea-
divided into feature-based approaches and object-based approaches. sons for the improvement, we believe that most traditional road ex-
Early feature-based methods only used spectral features. Spatial fea- traction algorithms based on single scale make judgment only on the
tures, such as line (Quackenbush, 2004), edge (Unsalan and Sirmacek, basis of the pixels in the neighborhood or within the coverage of a
2012), and ridge (Nevatia and Babu, 1980; Treash and Amaratunga, certain size operator (Chaudhuri et al., 2012; Mirnalinee et al., 2011).
2000) were taken into consideration later. Moreover, prior knowledge Therefore, for a fixed-scale input image, the amount of information
such as direction, magnitude and geometric features (Gamba et al., receives by the feature extraction operator and the classifier is fixed. As
2006; Poullis and You, 2010; Liu et al., 2017) are also used as con- a result, too little information leads to improper classification, and too
straints for road extraction. Object-based methods consider the spectral much redundant information may interfere the judgment. In conclu-
and spatial features of roads, but they extract road segments from sion, multiscale information is critical for road extraction, but the
images and then further refine them using custom rules (Yuan et al., method for multi-scale information fusion need to be designed properly.
⁎
Corresponding author.
E-mail address: lihaifeng@csu.edu.cn (H. Li).
https://doi.org/10.1016/j.isprsjprs.2019.10.001
Received 5 June 2019; Received in revised form 1 October 2019; Accepted 3 October 2019
0924-2716/ © 2019 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
C. Tao, et al. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166
Contextual information: In object detection and semantic segmenta-

tion, modeling of context can reduce error and improve accuracy (Hung
et al., 2017; Liu et al., 2018; Tao et al., 2019). For road detection, there
are three levels of contextual information: the global-level, local-level
and object-level. The global-level context is the main road direction in
the study area (Gamba et al., 2006). The local-level context is calcu-
lated from the pixels in a small range (Poullis and You, 2010). The
object-level context, which considers the relationship between the road
segments, or the relationship between the road and the surrounding
objects (Grote et al., 2012), has got a lot of attention. On the basis of the
observation of the spatial distribution patterns of road segments, Miao
and Shi (2016) proposed a series of reasonable spatial regularization
strategies to select and connect candidate road segments (Miao et al.,
2016). To connect broken road segments, Zhang and Zhang (2018) used
the relationship between road segments and interfered objects to con-
nect discrete road segments and obtained a more complete road net-
work (Zhang et al., 2018).
The idea of fusing multiscale information and combining spectral
and spatial features can be realized by the convolution neural networks
(CNN)-based methods. CNN has the structural characteristics of local
receptive fields, weight sharing, and subsampling (Bishop, 2006).
Therefore, its convolutional layer can collaboratively extract spectral
and spatial features, and adaptively fuse the multiscale information by
integrating low-level features into high-level features layer by layer.
With the powerful feature extraction ability, CNN has been used in
Fig. 1. Road extraction results of (a) the Deep Residual U-Net and (b) the
many road extraction researches (Das et al., 2011; Mnih and Hinton,
DeepLabv3+. Neither of both semantic segmentation networks can completely
2010; Zhou et al., 2018). For example, Zhang and Liu (2017) combined extract the road coverd by trees.
U-Net (Ronneberger et al., 2015) with residual units (He et al., 2016)
and proposed the Deep Residual U-Net (Zhang et al., 2017) that is easy
to train and has good road extraction performance. Cheng and Wang details of the proposed SII-Net are presented in Section 3. Detailed
(2017) proposed a cascaded convolutional neural network (CasNet), comparisons between the proposed SII-Net and five state-of-the-art road
which can complete road detection and centerline extraction simulta- extraction methods are provided in Section 4. Finally, discussions and
neously (Cheng et al., 2017). Buslaev and Seferbekov (2018) proposed a conclusions are presented in Section 5 and 6, respectively.
modified U-Net and used a lot of tricks during training and predicting,
and finally achieved good results in the DEEPGLOBE-CVPR 2018 road 2. The shortcomings of traditional semantic segmentation
extraction sub-challenge (Buslaev et al., 2018). Recently, Bastani and networks for road detection
He (2018) proposed RoadTracer, which uses an iterative CNN-guided
search strategy to construct the road map from aerial images (Bastani The fully convolution networks (FCNs) were developed in 2015, on
et al., 2018). the basis of which the semantic segmentation network was established
These end-to-end deep learning methods performed well in road (Long et al., 2015). Following this, a lot of work has been done to
extraction, but they had difficulties in depicting the roads covered by improve the semantic segmentation ability using the contextual in-
trees or non-road objects. As Fig. 1 shows, the extracted roads have formation.
severe fractures caused by the occlusion of trees. The main reason is Some studies aggregated multi-scale and multi-level features (MSML
that visual features are not reliable when there are occlusions. In ad- features) to exploit the contextual information mainly by two struc-
dition, the context information modeling mechanism in current deep tures: the encoder-decoder structure with skip connections and the
learning based road extraction methods is not very suitable for the road spatial pyramid pooling (SPP). U-Net and SegNet are the typical net-
extraction. Therefore, even if those methods have strong visual features works that use the encoder-decoder structure to improve the details of
extraction ability, it is hard to identify roads under occlusions. the prediction results by fusing the MSML features (Ronneberger et al.,
To solve this problem, this study focused on modeling the road- 2015; Badrinarayanan et al., 2017). PSPNet employed the SPP structure
specific contextual information, so that the network not only relies on to collect MSML feature as the contextual information to significantly
the appearance features but also combines suitable contextual in- reduce local errors in large object segmentations (Zhao et al., 2017).
formation for reasoning. To this end, we proposed a well-designed Another major way to exploit contextual information is to enlarge
spatial information inference structure (SIIS) to capture and transmit the receptive field using dilated convolutions (or atrous convolutions)
the road-specific contextual information. Different from traditional (Chen et al., 2014; Yu and Koltun, 2015). The dilated convolution,
CNNs, in which information is transmitted layer by layer, SIIS splits the which can effectively aggregate longer-range contextual information
feature map into chunks along row or column and applies the in- without increasing the computation load, is to insert “holes” (zeros)
formation processing units sequentially. In this way, information can be between elements in an ordinary convolution. Wang and Chen (2017)
roughly propagated along roads. Moreover, we designed an Recurrent employed the hybrid dilated convolution (HDC) to further enlarge the
Neural Network (RNN) based information processing unit, which is receptive fields of the network (Wang et al., 2018). Then, Deeplabv2
used to further alleviate noise and fine-tune the transmission of the proposed atrous spatial pyramid pooling (ASPP) to aggregate con-
contextual information. In general, the network with SIIS (SII-Net) textual information (Chen et al., 2017). Consisting of parallel dilated
achieves performance improvement by capturing and utilizing the road- convolutions with different dilated rates, ASPP can capture the large-
specific contextual information. range multi-scale contextual information and has been extended in
The remainder of this paper is structured as follows: Section 2 DeepLabv3 and DeepLabv3+ (Chen et al., 2017, 2018).
analyzes the shortcomings of traditional semantic segmentation net- With these improvements, traditional semantic segmentation net-
works in road detection and provides corresponding solutions. The works perform well on natural image semantic segmentation dataset.
156
Fig. 2. Principles of (a) ASPP and (b) SIIS for contextual information modeling.
However, they failed to extract satisfactory road information from re- thickness of each chunk is w = H / k . Then, the obtained sequence of
mote sensing images, especially when roads are occluded by other chunks S1 = {C11, C12, …, C1k} is sent into CRNN1 one by one, where
objects, such as trees, shadows, and buildings. The main reason is that CRNN1 denotes the first information processing unit in SIIS and will be
blindly fusing MSML features and enlarging the receptive field may introduced in Section 3.2. Specifically, the first chunk C11 is optimized
introduce invalid context information and lead to errors in the ex- by CRNN1 to generate a new chunk C21 with an equal size. When CRNN1
tracted road information (Fig. 2(a)). Furthermore, the effective road- optimizes the second chunk C12 , the last new chunk C21 will be also
specific contextual information should be the local road topology taken as the input to provide the contextual information. This process
(Fig. 2(b)). would continue until the last chunk C1k is updated, during which the
All in all, modeling effective road-specific contextual information is context information is continuously transmitted downward.
the key to solve the problem of the extracted road with fractures, In part II, the new chunks C21, C22, …, C2k form a sequence
especially in the case of occlusion. To this end, we proposed the spatial S2 = {C2k , …, C22, C21} from bottom to top, which is then sent into CRNN2
information inference structure (SIIS) to capture the local topology of for optimization in the way as part I does and to produce k new chunks.
roads (Fig. 2(b)). By integrating SIIS, most classic segmentation net- After that, these new chunks will be concatenated in the H dimension to
works can utilize the road-specific contextual information to overcome form a complete tensor with a size of C × H × W . To increase the di-
obstacles and improve the completeness of the extracted roads. rections for information propagation, the new tensor is re-split along
the W dimension and is processed similarly in part III and part IV. As
3. Method shown in Fig. 3, the four parts of SIIS correspond to the four main di-
rections of context information propagation, which are downward,
In this study, we designed a spatial information inference structure upward, rightward and leftward.
(SIIS) to better model the road-specific contextual information. SIIS can
explicitly transmit information along roads by an RNN based informa- 3.2. The information processing unit
tion processing unit. In the following, we first described the basic
structure of the proposed SIIS, and then detailed the involved spatial The key step of road context information modeling is to transmit
information inference mechanism. Finally, we showed how to in- useful context information, such as the local topology information of
corporate SIIS into traditional semantic framework to perform end-to- roads. Nevertheless, the basic framework of SIIS only ensures that each
end training. pixel indiscriminately receives the information from distant pixels in
four directions. Thus, it is necessary to design a refined message passing
3.1. The spatial information inference structure mechanism that can remember useful messages and forget unrelated
ones. Since this mechanism behaves like a memory machine, we
SIIS is agnostic to the base semantic segmentation framework, and adopted RNN as the information processing unit, which has demon-
its overview structure is depicted in Fig. 3. As shown in part I of Fig. 3, strated to be an effective tool to model the long-term memory in a se-
the input of SIIS is a tensor (feature map) sized C × H × W outputted quence.
from the backbone network’s encoder (i.e. the feature extractor), where However, traditional RNN units can only input one dimensional
C , H , and W denote the number of channels, rows, and columns, re- data, resulting in the loss of spatial information that is useful for re-
spectively. The tensor is firstly split into k chunks along H, and the ducing noise and fine-tuning the directions of road context information
157
Fig. 3. Basic structure of the proposed spatial information inference structure (SIIS).
introduce a trainable coefficient (initialized to 1), which determines

the impact of the previous moment on the current moment, to avoid
over-connection to some extent.
In SIIS, four independent Conv3d-RNN units are set in four parts to
refine the road contextual information transmitted in multi-direction.
After the entire model is trained and converged, SIIS can adaptively
transmit useful contextual information inside the feature map along the
road direction.
3.3. Contextual information transmitting and filtering
The road contextual information transmitting and filtering can be

Fig. 4. The Conv3d-RNN unit. represents element-wise addition. understood as the continuous change of the information received by
each pixel of the feature map. By changing the source of the informa-
propagation. Inspired by the work of (Xingjian et al., 2015), we de- tion received by each pixel, SIIS realizes the modeling of road-specific
veloped the 3-D convolutional RNN (Conv3d-RNN) by replacing all the contextual information. Finally, the feature map is optimized as most
matrix operations in the traditional RNN unit with the 3D convolution, pixels are updated by the received more appropriate contextual in-
because 3-D convolution is more conducive to transmit road contextual formation. In the following, we visualized some intuitive and re-
information along the road segments within chunks. Finally, the in- presentative channels of feature maps to show how the SIIS units ex-
formation processing unit, Conv3d-RNN, as shown in Fig. 4, can filter tract and transmit the context information. Specifically, we selected an
noise and fine-tune the direction of information propagation. image with severe occlusions as the input. Then, we plot some strong
In the Conv3d-RNN unit, all matrix multiplications are replaced by activations of the feature maps and projected them down to pixel space
3-D convolution operations. Correspondingly, 3-D tensors are taken as using minmax normalization.
the input and output. In addition, its structure and activation functions As shown in Fig. 5, in the feature map obtained by the Conv3 layer,
have also been modified for road contextual information modeling. the activation value of the road segment covered by trees is very low,
Take the first Conv3d-RNN unit in SIIS as an example (CRNN1 in part I which is almost consistent with that of the surrounding farmland and
of Fig. 3), a sequence of chunks S1 = {C11, C12, …, C1k} is the input of bare land. After the processing by the proposed SIIS units, the activa-
CRNN1 and the new chunks C21, C22, …, C2k are the output. The specific tion value of the road segment occluded by the tree increases and be-
details are shown in Fig. 4 and Eqs. (1) and (2). comes the same as that of normal roads. This indicates that the pro-
posed SIIS can gradually capture and transmit the road-specific
C21 = g (Whz g (Wxh C11) + g (Wzh C11)), (1) contextual information for road extraction when visual features are not
reliable.
ht = g (Wxh C1i + 1) + g (Wzh C2i ) i = 1, 2, …, k 1 Fig. 6 is another example for the filtering and transmitting process
,
C2i + 1 = g (Whz ht ) t = 1, 2, …, k 1 of the contextual information. In the image, a road pixel is blocked by
(2)
trees. In the early stages of model training (T0 ), the pixel receives all
where ‘’ and denote the 3-D convolution operation and the element- basic information from its neighboring pixels. Since all the neighboring
wise multiplication, respectively. g is the ReLU activate function. Note pixels belong to trees, this pixel is not recognized as a road. According
that, Wxh Wzh , and Whz here are all trainable 3-D convolution kernels and to the contextual information transmission mechanism of SIIS, the pixel
ht is the intermediate result. can gradually receive information from distant pixels (T1). However,
In this process, the previous result C2i always joins the current there is still a lot of redundant and useless information. Since such in-
processing, and the contextual information will be transmitted ac- formation is not conducive to obtain correct prediction results, the loss
cordingly. And the previous results C2i is filtered by 3-D convolution to is still large. Afterward, errors are propagated through backpropagation
remove useless information before fusing with the current input C1i + 1. algorithm, causing the network to adjust the learnable weights of
Moreover, since the road may end at any point on the image, we con- Conv3d-RNN units. As the weights are iteratively updated, these units
sider the road context information is locally reliable. Therefore, we can gradually filter useless information (T 1 T 2 ) and finally only the
158
Fig. 5. Visualization of feature maps in a fully trained SII-Net.
Fig. 6. The process of the Conv3d-RNN units gradually controlling the contextual information received by an element in the feature map, where T 0, T 1, T 2, T 3
represent different stages of model training.
contextual information that can correct prediction of this pixel is re- category difference of pixels corresponding to Pi and Gti , and J em-
tained (T2 ). As useful contextual information accumulates, it has larger phasizes the deviation between the predicted road and the real road,
chances to be classified as a road (T3). the weight parameter is used to adjust the contribution ratio of M and
J to total loss.
3.4. Network integration with SIIS and end-to-end training
4. Experiments and analysis
In this study, we selected the high-performance typical semantic
segmentation network, DeepLabv3+, as the backbone to build the SII
4.1. Overall details of the experiments
network (SII-Net). As shown in Fig. 7, we insert SIIS with the bottleneck
layer named conv3 between the encoder and decoder to make it fully
4.1.1. Dataset
functional. Specifically, the feature map extracted by the encoder and
The validation experiments used the DEEPGLOBE-CVPR 2018 road
the ASPP structure of backbone network mainly contain appearance
extraction sub-challenge dataset1 (referred to as the CVPR dataset
features, so it should be further optimized by SIIS with contextual in-
hereafter), the Massachusetts road dataset2 and the RoadTracer dataset
formation transmission and filtering mechanism. As a result, the opti-
(Bastani et al., 2018). The CVPR dataset contains 6226 satellite images
mized feature map containing the road-specifically contextual in-
with a paired mask for road labels (Demir et al., 2018). These images,
formation can finally improve the road extraction results of the
collected by the DigitalGlobe’s satellite, have a size of 1024 × 1024
decoder.
pixels and a resolution of 50 cm/pixel. In the experiment, they were
To train this end-to-end SII-Net for road extraction, we used a
divided into the training set, validation set, and the test set. The Mas-
comprehensive index as the loss function to measure the difference
sachusetts road dataset (Mnih and Hinton, 2010) consists of 1171
between the predictions Pi (i = 0, 1, 2, …, n) and the ground truth
images, including 1108 images for training, 14 images for validation,
Gti (i = 0, 1, 2, …, n) , where n is the number of training samples. The
and 49 images for testing. Each image has a size of 1500 × 1500 pixels
loss function is defined as:
and a resolution of 120 cm/pixel. The RoadTracer dataset contains 300
Loss(W ) = M + (1 )(1 J ), (3)
where M is the mean squared error (MSE) between Pi and Gti , and J is 1
https://competitions.codalab.org/competitions/18467#participate.
the Jaccard index (Intersection Over Union). Since M measures basic 2
https://www.cs.toronto.edu/vmnih/data/.
159
Fig. 7. Flowchart of the SII-Net. The bottleneck layer is used to reduce the dimension of feature map input SIIS to avoid large computational load.
Fig. 8. The process of the category ratio cropping method being applied to a typical sample of the CVPR dataset.
images with size of 4096 × 4096 pixels and a resolution of 60 cm/pixel, the same stride s and a w × w cropping window to get a set of sub
180 images for training and 120 images for testing. images and the corresponding sub labels {Isi , Lsi } , where s = 128 and
n n n
w = 512 . Second, Lsi was used to calculate the ratio Ri = n1 , n2 , …, nc ,
4.1.2. Data pre-processing s s
with nc denoting the number of pixels belonging to category c in Lsi and
s
For the CVPR dataset and the Massachusetts road dataset, the mis- ns denoting the total number of pixels of Lsi . Then, the smallest value in
identification of road pixels as background pixels is the main source of Ri , min(Ri ) , was used to compare against the threshold . For the pairs
loss value, since the background (non-road) pixels are much more than of Isi and Lsi , only those with the min(Ri ) greater than were kept. We
the road pixels in the satellite image (Figs. 9 and 10). Therefore, the set the ratio threshold as 0.01, which is a user-defined constant. The
optimization may reduce loss, but the optimized semantic segmentation influence of this parameter on the final road extraction results is ana-
networks have large chances to misidentify uncertain pixels as back- lysed in Section 4.6.
ground rather than road. To solve this problem, we adopted a simple After the CRC data pre-processing, the imbalance between the
and effective data pre-processing strategy, the category ratio cropping number of road and background samples is effectively alleviated, so the
(CRC) method. performance of the trained model can be improved. Finally, we got
Take an image I in the training set and its corresponding ground 88,689 labelled images with the size of 512 × 512 from the CVPR
truth label L as an example (Fig. 8). First, {I ;L} were slide cropped with
160
dataset and 48,290 from the Massachusetts roads dataset for training. 1
n
The roads in the RoadTracer dataset are very dense, so we did not Mean NRB = NRBi ,
n (5)
perform the CRC method. We directly use slide cropping with the stride
i=1
of 256, and finally got 8820 labelled images for training, each 512 × 512 where n donates the number of images in the test set.
pixel in size.
4.2. Experiment using the CVPR dataset
4.1.3. Train details
The training sets obtained by data pre-processing was preprocessed
In this experiment, we took the road extraction task as a semantic
by a series of common data enhancement methods. We cropped the
segmentation problem and focused on extracting complete road surface.
image and mask to random size and aspect ratio, flip and rotate (by 90
We compared the proposed SII-Net with four semantic segmentation
degrees) them randomly. Then, the random brightness (0.5–1.5), con-
based road extraction methods, including U-Net (Ronneberger et al.,
trast (0.7–1.3), saturation (0.8–1.2) and hue (−0.05 to 0.05) for
2015), Deep Residual U-Net (Zhang et al., 2017), HF-FCN (Zuo et al.,
spectral augmentation were also used to increase the data diversity.
2016) and the original DeepLabv3+ (Chen et al., 2018). As shown in
Finally, limited by GPU video memory, all images were scaled to
rows 1 and 2 in Fig. 9, the proposed SII-Net successfully extracts the
256 × 256 for feeding networks. Afterwards, all models were trained
roads under occlusion. This indicates that SII-Net does not solely de-
with the same parameter settings and environment. Specifically, we
pend on the visual features of roads but has reasoning ability by
trained the models using the Adam optimizer in the Ubuntu 16.04
modeling road-specific contextual information. In the third row, SII-Net
platform with one GTX1080Ti (memory 11 GB) that allows a batch size
does the best job of depicting the roads in the red circle. Moreover, in
of 16 images. The learning rate was initially set to be 1e−5 and re-
complex situations (the last row of Fig. 9), the SII-Net extraction results
duced by a factor of 0.02 per epoch. Since the CRC method divided the
have less noise than the results of other methods.
original large image into many small images, the model iterated much
A quantitative assessment was also done to compare the effective-
more times in each epoch on the CVPR dataset and the Massachusetts
ness of these methods. As shown in Table 1, the proposed SII-Net
roads dataset. Moreover, we used the ImageNet pre-trained encoder in
achieved the largest F1 score of 0.9279 with Mean IoU of 0.8344, larger
our network, which can further accelerate the convergence. As a result,
than the original DeepLabv3+ (F1 score of 0.9158 and Mean IoU of
the proposed network converged in only 15 epochs on these two da-
0.8247). The performance improvement was consistently in all in-
tasets. The sampling size of the RoadTracer dataset is much smaller
dicators compared to other methods, including the U-Net, ResUnet and
than the other two datasets, so the convergence cost up to 50 epochs by
HF-FCN. For example, when compared with the U-Net, the SII-Net
the proposed network.
achieved an increase in F1 score and Mean IoU of 9.69% and 8.46%,
Besides, the weight parameter in the loss function (Eq. (3)) was set
respectively. Since SII-Net uses the road-specific contextual informa-
to be 0.7 for the CVPR dataset and 0.1 for the Massachusetts roads
tion, the number of fault fractures of extracted roads are reduced
dataset and the RoadTracer dataset. On the one hand, is determined
considerably, which is nearly 2 times less than DeepLabv3+ (from 6.03
by precision evaluation on the test set. On the other hand, it should
to 3.49) and 3 times less than ResUnet (from 9.77 to 3.49).
conform to the characteristics of the dataset. Unlike CVPR datasets with
complete road surface masks, roads in the Massachusetts datasets and
the RoadTracer dataset are marked with centerlines with equally width. 4.3. Experiment using the Massachusetts roads dataset
As a result, the road pixels to background pixels ratio in the masks of
these two datasets is much smaller than that in the CVPR dataset. In this In the Massachusetts roads dataset experiment, we compared the
case, the misidentification of road pixels as background pixels only proposed method with the four semantic segmentation based road ex-
leads to a small M defined in Eq. (3), so the proportion of M should be traction methods as mentioned above. As shown in Table 2, the pro-
reduced while the proportion of J should be increased to promote the posed SII-Net gets a slight increase of 1.08–2.19% for F1 score and
convergence of the model. 1.13–3.75% for Mean IoU, respectively. Different from these indicators
that are calculated by area, Mean NRB only measures the topological
4.1.4. Evaluation metrics integrity of the extracted roads and thus reflects road connectivity more
To assess the performance of the road extraction methods, we directly. For example, DeepLabv3+demonstrates some advantages over
adopted three measures as follows: U-Net in area-based metrics, but it fails in the topology-based metric.
The proposed SII-Net method achieves better performance than other
• F1 score is an evaluation metric for the harmonic mean of precision methods in both area-based metrics and topology-based metrics. By
contrast, the dramatic increase in Mean NRB still demonstrated ad-
(P) and recall (R), and it can be calculated by Eq. (4). We used the
relaxed P and R and set the slack parameter to be 3 as previous vantages of SII-Net in solving the problem of the extracted road with
studies did (Mnih and Hinton, 2010; Zhang et al., 2017; Saito et al., fractures (e.g. 5.32 lower than DeepLabv3+and 9.32 lower than HF-
2016). FCN). The results in Fig. 10 also indicate that the road network ex-
tracted by the proposed approach has more details and fewer false
P×R
F1 = 2 × fractures.
P+R (4)
4.4. Experiment using the RoadTracer dataset

• Mean IoU is the average ratio of the correctly classified pixels in a
class to the union of predicted pixels of this class and ground truth. In the RoadTracer dataset experiment, we also compared the pro-
It is commonly used to evaluate the accuracy of the semantic seg- posed method with the four semantic segmentation based road ex-
mentation. traction methods. Since the ground truths for RoadTracer dataset are
• The number of road breaks (NRB) represents the number of false road centerlines, we further compared SII-Net with the RoadTracer
fractures in the road extraction result compared with its corre- (Bastani et al., 2018) that focuses on extracting road centerlines. To
sponding ground truth label. For the whole test set, the mean of NRB ensure normal training and convergence of SII-Net and other segmen-
(Mean NRB) was adopted to measure the overall effectiveness. tation base methods, we reduced the non-overlap area between road the
161
Fig. 9. Road extraction results using the CVPR dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
Table 1 Table 3
Road extraction results of five approaches using the CVPR dataset. Road extraction results obtained by the five segmentation base approaches from
the RoadTracer dataset.
Method F1 score Mean IoU Mean NRB
U-Net 0.8314 0.7498 7.27
ResUnet 0.8310 0.7537 9.77 U-Net 0.6432 0.5965 748.13
HF-FCN 0.8897 0.7997 6.84 ResUnet 0.7327 0.6612 649.07
DeepLabv3+ 0.9158 0.8247 6.03 HF-FCN 0.7490 0.6676 543.33
SII-Net 0.9279 0.8344 3.49 DeepLabv3+ 0.7752 0.6884 499.00
SII-Net 0.7923 0.7016 410.60
Table 2
Road extraction results of five approaches using the Massachusetts roads da- four segmentation based methods into road centerlines with single-
taset. pixel widths, which is consistent with the results of RoadTracer. Then,
we expanded the single-pixel width ground truth and prediction to 8
pixels wide, because F1 score and IoU cannot properly evaluate road
U-Net 0.8999 0.7441 11.59 topology on single-pixel width centerlines.
ResUnet 0.8928 0.7490 14.06 As shown in Table 3, SII-Net is superior to semantic segmentation
HF-FCN 0.8888 0.7392 17.73
based road extraction methods in accuracy, with the F1 score
DeepLabv3+ 0.8994 0.7654 13.73
SII-Net 0.9107 0.7767 8.41 1.71–14.91% higher and the Mean IoU 1.32–10.51% higher than the
other methods. In addition, the Mean NRB of SII-Net is 88.40, 132.73,
238.47 and 337.53 smaller than DeepLabv3+, HF-FCN, ResUnet and U-
ground truth by expanding the single-pixel width centerlines to 11 Net, indicating that it has obvious advantage in solving the problem of
pixels wide as training labels. For the comparison, in the test stage, we the extracted road with fractures. The results in Fig. 11 also demon-
first converted the ground truth and the predicted result of SII-Net and strate that the road network extracted by the proposed approach has
Fig. 10. Road extraction results using the Massachusetts roads dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
162
Fig. 11. Road extraction results using the RoadTracer dataset. (a) Ground truth. (b) U-Net. (c) ResUnet. (d) HF-FCN. (e) DeepLabv3+. (f) Proposed SII-Net.
comparison, we excluded the border area with a width of 200 pixels

around the prediction of RoadTracer and SII-Net (the same treatment
was done for ground truth), since the detection range of RoadTracer
cannot cover the entire area.3 The comparison between the results of
the proposed method and the RoadTracer method as well as four se-
mantic segmentation based road extraction methods in Table 4 indicate
that the proposed method is better in terms of the three evaluation
indicators.
4.5. Performance comparison between the Conv3d-RNN units and Conv1D
We also compared the performances between the proposed Conv3d-

RNN units with traditional Conv1D using three datasets, because the
input of the Conv3d-RNN is a sequence of chunks, which is similar to
the input of Conv1D. However, Conv1D can only convolve the feature
map locally. By contrast, the Conv3d-RNN unit in SIIS can accept other
Fig. 12. Road extraction results using the Massachusetts roads dataset. (a) part of feature map as input to transmit context information. As can be
Ground truth. (b) RoadTracer. (c) Proposed SII-Net. seen in Table 5, the Conv3d-RNN units have higher F1 score and Mean
IoU than Conv1D for three datasets.
Table 4
Comparison between RoadTracer and SII-Net using the RoadTracer dataset.
4.6. Analysis of threshold in CRC
RoadTracer 0.6678 0.6317 487.33 In this section, we analysed the influence of threshold , a key
SII-Net 0.7931 0.7013 378.80 parameter of data pre-processing method CRC, on the final road ex-
traction. As shown in Table 6, the number of sub images and labels
obtained by CRC varies greatly with the threshold , which also leads to
Table 5 different final performance of the trained model. Take the results of the
Comparson of road extraction results between Conv3d-RNN and Conv1D on Massachusetts dataset as an example, when = 0.01, a lot of sub images
three datasets. and labels that barely contain roads are filtered out, resulting in a much
4]90ptDataset SII-Net (Conv1D) SII-Net (Conv3d-RNN) smaller sample size than that of = 0 . However, this alleviates the
imbalance between the road and the background, which is conducive to
F1 score Mean IoU F1 score Mean IoU
the better convergence of the model and thus improves the performance
CVPR dataset 0.9168 0.8210 0.9279 0.8344 of the proposed method. Besides, the value of cannot be large, because
Massachusetts dataset 0.9057 0.7678 0.9107 0.7767 it may cause too much data to be eliminated and reduce the accuracy of
RoadTracer dataset 0.7600 0.6775 0.7923 0.7016 the proposed method.
more details and fewer false fractures. Note that, the size of the com- 5. Discussion
plete test images composed of 4 images is 8192 × 8192 pixels, so the
calculated Mean NRB is much larger than the results of the first two The above comparison experiments have demonstrated the effec-
datasets. tiveness of the proposed SII-Net for road extraction, especially for sol-
For the RoadTracer method, we directly used the well-trained model ving occlusions and preserve the continuity of the extracted road. In
(available online) and the test time strategies recommended by the this section, we further discuss the extensibility and robustness of SIIS.
authors of RoadTracer, including steps and setting of relevant thresh-
olds, to achieve the best evaluation result. As shown in Fig. 12, the road
graphs generated by RoadTracer have good connectivity, but not good
3
completeness. The proposed approach, however, obtains more accurate In this case, the width of the blank area is close to 200 pixels to avoid being
beyond the image border. For details, the code and paper of RoadTracer are
road details while maintaining the road connectivity. For quantitative
available at https://roadmaps.csail.mit.edu/roadtracer.
163
Table 6
Different threshold in the CRC data pre-processing method.
490pt ] CVPR dataset Massachusetts dataset
Sample size F1 score Mean IoU Sample size F1 score Mean IoU
0.00 115,225 0.9248 0.8321 56,897 0.8935 0.7708

0.01 88,689 0.9279 0.8344 48,290 0.9107 0.7767
0.10 18,385 0.9062 0.8149 7,717 0.8422 0.7447
Table 7
Comparison of semantic segmentation networks with and without SIIS.
190ptMethod] CVPR dataset Massachusetts dataset RoadTracer dataset
F1 score Mean IoU F1 score Mean IoU F1 score Mean IoU
U-Net 0.8314 0.7498 0.8867 0.7965 0.6432 0.5965

U-Net + SIIS 0.8803 0.7930 0.9103 0.8208 0.6745 0.6156
ResUnet 0.8310 0.7537 0.8887 0.8016 0.7327 0.6612
ResUnet + SIIS 0.8985 0.8091 0.9086 0.8207 0.7530 0.6771
Fig. 13. Road extraction results with artificial occlusions (yellow circles). (a) Input image. (b) U-Net. (c) U-Net with SIIS. (d) ResUnet. (e) ResUnet with SIIS. (f)
DeepLabv3+. (g) DeepLabv3 + with SIIS. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 14. Road extraction results obtained by SII-Net in diverse road scences, including (a) road segments with high curvatuse, (b) a complex scene with severe
occlusion near road intersection and (c) rural narrow road without clear boundary.
5.1. The extensibility of SIIS DeepLabv3+ with SIIS, similar improvements were achieved for three
datasets. These results demonstrate that integrating the SIIS with tra-
Since the input and output of SIIS are feature maps with flexible ditional semantic segmentation models can improve the accuracy of
sizes, they can be easily incorporated into most typical segmentation road extraction.
networks to improve the performance for road extraction. To verify the
extensibility of SIIS structure, we integrated SIIS with two classic se-
5.2. The robustness of SIIS
mantic segmentation models, ResUnet and U-Net, and compared them
with the original ones. As shown in Table 7, the performance of both
In this section, we discussed the robustness of SIIS in the following
models is significant improved by combining with SIIS. For U-Net, F1
two aspects:
score and Mean IoU increased by nearly 5% on the CVPR dataset, 0.9%
First, robustness analysis in unseen occlusion scenarios. We manually
and 1.5% on the Massachusetts dataset, and 3.1% and 1.9% on the
attached some occlusions to the test samples. As Fig. 13 shows, the first
RoadTracer dataset respectively. For ResUnet with SIIS and
row of images are attached a copy of a large truck on the road and the
164
second row of images are attached rectangular areas of woods. The Das, S., Mirnalinee, T., Varghese, K., 2011. Use of salient features for the design of a
features of the attached occlusions are unnatural and distinguishable multistage framework to extract roads from high-resolution multispectral satellite
images. IEEE Trans. Geosci. Remote Sens. 49 (10), 3906–3931.
from the original training set. In this case, traditional semantic seg- Chaudhuri, D., Kushwaha, N., Samal, A., 2012. Semi-automated road detection from high
mentation networks, which make decision according to visual features resolution satellite images by directional morphological enhancement and segmen-
or just memorize similar occlusion scenes from the training samples, tation techniques. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sensing 5 (5),
1538–1544.
can’t handle these extremes situation (columns b, d and f), but the Shi, W., Miao, Z., Debayle, J., 2014. An integrated method for urban main-road centerline
proposed SIIS can help these models to extract the road information extraction from optical remotely sensed imagery. IEEE Trans. Geosci. Remote Sens.
completely (columns c, e and g). This demonstrates the robustness and 52 (6), 3359–3372.
Sujatha, C., Selvathi, D., 2015. Connected component-based technique for automatic
reasoning ability of the proposed SII-Net. extraction of road centerline in high resolution satellite images. Eurasip J. Image
Second, robustness analysis in case of diverse road scenes. Due to the Video Process. 2015 (1), 8.
complexity and diversity of road, some extracted roads are prone to Quackenbush, L.J., 2004. A review of techniques for extracting linear features from
imagery. Photogramm. Eng. Remote Sens. 70 (12), 1383–1392.
fractures like road segments with high curvature, road intersections and
Unsalan, C., Sirmacek, B., 2012. Road network detection using probabilistic and graph
rural narrow road. Fig. 14 shows the road extraction results of the theoretical methods. IEEE Trans. Geosci. Remote Sens. 50 (11), 4441–4453.
proposed method in these challenging road scenes. As can be seen, the Nevatia, R., Babu, K.R., 1980. Linear feature extraction and description. Comput.
proposed method showed highly robustness for curved road segments Graphics Image Process. 13 (3), 257–269.
Treash, K., Amaratunga, K., 2000. Automatic road detection in grayscale aerial images. J.
(Fig. 14(a)). The road scene in Fig. 14(b) is also challenging, because Comput. Civil Eng. 14 (1), 60–69.
severe occlusion is close to the road intersection. However, the pro- Gamba, P., Dell’Acqua, F., Lisini, G., 2006. Improving urban road extraction in high-
posed method can also handle it. Our method failed to detect the path resolution images exploiting directional filtering, perceptual grouping, and simple
topological concepts. IEEE Geosci. Remote Sens. Lett. 3 (3), 387–391.
in the third challenging scene (Fig. 14(c)), which is the extremely Poullis, C., You, S., 2010. Delineation and geometric modeling of road networks. ISPRS J.
narrow country path near the T-junction without obvious boundaries. Photogramm. Remote Sensing 65 (2), 165–181.
The main reason for this failure can be considered that this illegible part Liu, J., Qin, Q., Li, J., Li, Y., 2017. Rural road extraction from high-resolution remote
sensing images based on geometric feature inference. ISPRS Int. J. Geo-Inf. 6 (10),
is the end of the path, not in the middle, which causes SIIS to fail to 314.
gather enough evidence for reasoning. Yuan, J., Wang, D.L., Wu, B., Yan, L., Li, R., 2011. Legion-based automatic road extraction
from satellite imagery. IEEE Trans. Geosci. Remote Sens. 49 (11), 4528–4538.
Grinias, I., Panagiotakis, C., Tziritas, G., 2016. Mrf-based segmentation and unsupervised
6. Conclusions classification for building and road detection in peri-urban areas of high-resolution
satellite images. ISPRS J. Photogramm. Remote Sensing 122, 145–166.
In this study, we analysed the defects of traditional semantic seg- Maboudi, M., Amini, J., Malihi, S., Hahn, M., 2018. Integrating fuzzy object based image
analysis and ant colony optimization for road extraction from remotely sensed
mentation networks in road extraction and proposed a novel spatial
images. ISPRS J. Photogramm. Remote Sensing 138, 151–163.
information inference structure that is based on road-specific con- Mirnalinee, T., Das, S., Varghese, K., 2011. An integrated multistage framework for au-
textual information. Experiments on two datasets showed the ad- tomatic road extraction from high resolution satellite imagery. J. Indian Soc. Remote
vantages of the proposed method for road extraction, especially in the Sensing 39 (1), 1–25.
Rao, S.G., Puri, M., Das, S., 2004. Unsupervised segmentation of texture images using a
face of occlusion. We also demonstrated that the road extraction per- combination of gabor and wavelet features. In: Proceedings of the Indian Conference
formance of classic semantic segmentation networks can be sig- on Computer Vision, Graphics & Image Processing (ICVGIP), pp. 370–375.
nificantly improved by integrating SIIS. Moreover, using the road-spe- Gupta, L., Pathangay, V., Patra, A., Dyana, A., Das, S., 2007. Indoor versus outdoor scene
classification using probabilistic neural network. Eurasip J. Adv. Signal Process. 2007
cific contextual information, the proposed road extraction methods (1), 1–10.
have highly robustness in handling complex roads, like road segments Plaza, A., Benediktsson, J.A., Boardman, J.W., Brazile, J., Bruzzone, L., Camps-Valls, G.,
with high curvature, road under severe occlusions, road intersections. Chanussot, J., Fauvel, M., Gamba, P., Gualtieri, A., 2009. Recent advances in tech-
niques for hyperspectral image processing. Remote Sens. Environ. 113 (1),
Further work could be conducted to examine the possibility of applying S110–S122.
the proposed method to large VHR satellite datasets of other categories Shi, W., Miao, Z., Wang, Q., Zhang, H., 2014. Spectral–spatial classification and shape
such as rivers or other slender object. Another future work aims at features for urban road centerline extraction. IEEE Geosci. Remote Sens. Lett. 11 (4),
788–792.
extending the proposed model to the task of road detection using hy-
Peng, T., Jermyn, I.H., Prinet, V., Zerubia, J., 2008. Incorporating generic and specific
perspectral data, and further using hyperspectral inversion model (Tao prior knowledge in a multiscale phase field model for road extraction from vhr
et al., 2018, 2019) for road material classification. images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sensing 1 (2), 139–146.
Huang, X., Zhang, L., 2009. Road centreline extraction from high-resolution imagery
based on multiscale structural features and support vector machines. Int. J. Remote
Declaration of Competing Interest Sens. 30 (8), 1977–1987.
Wang, J., Qin, Q., Yang, X., Wang, J., Ye, X., Qin, X., 2014. Automated road extraction
The authors declared that there is no conflict of interest. from multi-resolution images using spectral information and texture. In: Proceedings
of the 2014 IEEE Geoscience and Remote Sensing Symposium (IGARSS). IEEE, pp.
533–536.
Acknowledgment Miao, Z.L., Shi, W.Z., Samat, A., Lisini, G., Gamba, P., 2016. Information fusion for urban
road extraction from vhr optical satellite images. IEEE J. Sel. Top. Appl. Earth Observ.
Remote Sensing 9 (5), 1817–1829.
This work was supported by National key research and development Hung, W.C., Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X., Yang, M.H., 2017. Scene
projects (Grant No. 2018YFB0504500), National Natural Science parsing with global context embedding. In: Proceedings of the IEEE International
Foundation of China (Grant No. 41771458, 41301453), Young Elite Conference on Computer Vision, pp. 2631–2639.
Liu, Y., Wang, R., Shan, S., Chen, X., 2018. Structure inference net: object detection using
Scientists Sponsorship Program by Hunan Province of China under scene-level context and instance-level relationships. In: Proceedings of the IEEE
Grant 2018RS3012, and Hunan Science and Technology Department Conference on Computer Vision and Pattern Recognition, pp. 6985–6994.
Innovation Platform Open Fund Project under Grant 18K005. Tao, C., Mi, L., Li, Y., Qi, J., Xiao, Y., Zhang, J., 2019. Scene context-driven vehicle de-
tection in high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 57 (10),
7339–7351.
Appendix A. Supplementary material Grote, A., Heipke, C., Rottensteiner, F., 2012. Road network extraction in suburban areas.
Photogramm. Rec. 27 (137), 8–28.
Zhang, Z., Zhang, X., Sun, Y., Zhang, P., 2018. Road centerline extraction from very-high-
Supplementary data associated with this article can be found, in the
resolution aerial image and lidar data based on road connectivity. Remote Sensing 10
online version, at https://doi.org/10.1016/j.isprsjprs.2019.10.001. (8), 1284.
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.
References Mnih, V., Hinton, G.E., 2010. Learning to detect roads in high-resolution aerial images. In:
Proceedings of the European Conference on Computer Vision (ECCV), pp. 210–223.
Zhou, L., Zhang, C., Wu, M., 2018. D-linknet: Linknet with pretrained encoder and dilated
Mena, J.B., 2003. State of the art on automatic road extraction for gis update: a novel convolution for high resolution satellite imagery road extraction. In: Proceedings of
classification. Pattern Recogn. Lett. 24 (16), 3037–3058. the IEEE Conference on Computer Vision and Pattern Recognition Workshops
165
(CVPRW). IEEE, pp. 192–196. Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated convolutions, arXiv
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biome- preprint arXiv:1511.07122.
dical image segmentation. In: Proceedings of the International Conference on Medical Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G., 2018. Understanding
Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351. convolution for semantic segmentation. In: Proceedings of the 2018 IEEE Winter
Springer, pp. 234–241. Conference on Applications of Computer Vision (WACV). IEEE, pp. 1451–1460.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Semantic image segmentation with deep convolutional nets, atrous convolution, and
(CVPR), pp. 770–778. fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.
Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual u-net. IEEE Geosci. Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017 Rethinking atrous convolution for
Remote Sens. Lett. PP (99), 1–5. semantic image segmentation, arXiv preprint arXiv:1706.05587.
Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C., 2017. Automatic road detection Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with
and centerline extraction via cascaded end-to-end convolutional neural network. atrous separable convolution for semantic image segmentation. In: Proceedings of the
IEEE Trans. Geosci. Remote Sens. 55 (6), 3322–3337. European Conference on Computer Vision (ECCV), pp. 801–818.
Buslaev, A., Seferbekov, S., Iglovikov, V., Shvets, A., 2018. Fully convolutional network Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.
for automatic road extraction from satellite imagery. In: Proceedings of the IEEE Convolutional lstm network: A machine learning approach for precipitation now-
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. casting. In: Proceedings of the Advances in Neural Information Processing Systems
207–210. (NIPS), pp. 802–810.
Bastani, F., He, S., Abbar, S., Alizadeh, M., Balakrishnan, H., Chawla, S., Madden, S., Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D.,
DeWitt, D., 2018. Roadtracer: Automatic extraction of road networks from aerial Raskar, R., Deepglobe, 2018: A challenge to parse the earth through satellite images.
images. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern
Pattern Recognition (CVPR). IEEE, New York, pp. 4720–4728. Recognition Workshops (CVPRW), pp. 172–181.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic Saito, S., Yamashita, T., Aoki, Y., 2016. Multiple object extraction from aerial imagery
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and with convolutional neural networks. Electronic Imaging 60 (1), 1–9.
Pattern Recognition (CVPR), pp. 3431–3440. Zuo, T., Feng, J., Chen, X., 2016. Hf-fcn: Hierarchically fused fully convolutional network
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoder- for robust building extraction. In: Proceedings of the Asian Conference on Computer
decoder architecture for image segmentation, arXiv preprint arXiv:1511.00561 39 Vision (ACCV). Springer, pp. 291–302.
(12), 2481–2495. Tao, C., Wang, Y.J., Cui, W.B., Zou, B., Zou, Z.R., Tu, Y.L., 2019. A transferable spec-
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network. In: troscopic diagnosis model for predicting arsenic contamination in soil. Sci. Total
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Environ. 669, 964–972.
(CVPR), pp. 2881–2890. Tao, C., Wang, Y.J., Zou, B., Tu, Y.L., Jiang, X.L., 2018. Assessment and analysis of mi-
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic image grations of heavy metal lead and zinc in soil with hyperspectral inversion model.
segmentation with deep convolutional nets and fully connected crfs, arXiv preprint Spectrosc. Spectral Anal. 38 (6), 1850–1855.
arXiv:1412.7062.
166

Tao 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tao 2019

Uploaded by

Copyright:

Available Formats

ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019) 155–166

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

Spatial information inference net: Road extraction using road-specific T

ARTICLE INFO ABSTRACT

Contextual information: In object detection and semantic segmenta-

introduce a trainable coefficient (initialized to 1), which determines

3.3. Contextual information transmitting and filtering

The road contextual information transmitting and filtering can be

Fig. 5. Visualization of feature maps in a fully trained SII-Net.

4.4. Experiment using the RoadTracer dataset

comparison, we excluded the border area with a width of 200 pixels

4.5. Performance comparison between the Conv3d-RNN units and Conv1D

We also compared the performances between the proposed Conv3d-

490pt ] CVPR dataset Massachusetts dataset

0.00 115,225 0.9248 0.8321 56,897 0.8935 0.7708

F1 score Mean IoU F1 score Mean IoU F1 score Mean IoU

U-Net 0.8314 0.7498 0.8867 0.7965 0.6432 0.5965

You might also like