You are on page 1of 19

sensors

Article
Road Extraction from Unmanned Aerial Vehicle
Remote Sensing Images Based on Improved
Neural Networks
Yuxia Li 1 , Bo Peng 1 , Lei He 2, *, Kunlong Fan 1 , Zhenxu Li 1 and Ling Tong 1
1 School of Automation Engineering, University of Electronic Science and Technology of China,
Chengdu 611731, China; liyuxia@uestc.edu.cn (Y.L.); pengbo1@std.uestc.edu.cn (B.P.);
fankunlong@std.uestc.edu.cn (K.F.); lizhenxu@std.uestc.edu.cn (Z.L.); tongling@uestc.edu.cn (L.T.)
2 School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
* Correspondence: helei1978@cuit.edu.cn; Tel.: +86-189-808-88155

Received: 16 August 2019; Accepted: 18 September 2019; Published: 23 September 2019 

Abstract: Roads are vital components of infrastructure, the extraction of which has become a topic of
significant interest in the field of remote sensing. Because deep learning has been a popular method
in image processing and information extraction, researchers have paid more attention to extracting
road using neural networks. This article proposes the improvement of neural networks to extract
roads from Unmanned Aerial Vehicle (UAV) remote sensing images. D-Linknet was first considered
for its high performance; however, the huge scale of the net reduced computational efficiency. With a
focus on the low computational efficiency problem of the popular D-LinkNet, this article made some
improvements: (1) Replace the initial block with a stem block. (2) Rebuild the entire network based
on ResNet units with a new structure, allowing for the construction of an improved neural network
D-Linknetplus. (3) Add a 1 × 1 convolution layer before DBlock to reduce the input feature maps,
reducing parameters and improving computational efficiency. Add another 1 × 1 convolution layer
after DBlock to recover the required number of output channels. Accordingly, another improved
neural network B-D-LinknetPlus was built. Comparisons were performed between the neural nets,
and the verification were made with the Massachusetts Roads Dataset. The results show improved
neural networks are helpful in reducing the network size and developing the precision needed for
road extraction.

Keywords: road; UAV sensors; image processing; convolutional neural net

1. Introduction
Roads are vital components of infrastructure; they are essential to urban construction,
transportation, military, and economic development. How road information can be extracted is
a topic of significant research interest currently. Recently, Unmanned Aerial Vehicle (UAV) exploration
has seen rapid development in remote sensing fields, advantageous for its high flexibility, targeting
capability, and low cost [1].
Recently, deep learning has been widely used in the field of information extraction with remote
sensing images. Liu et al. [2] used a full-convolution neural network to extract the initial road area,
then filtered the interference pattern and connected the road fracture area according to the road
length–width ratio, morphological operation, and Douglas-Peucker (DP) algorithm. Li et al. [3]
proposed a Y-Net network including a two-arm feature extraction model and one feature fusion model.
Liu et al. [4] introduced feature learning method based on deep learning to extract road robust features
automatically, and proposed a method of extracting road centerline based on multi-scale Gabor filter
and multi-direction, non-maximum suppression. Xu et al. [5] constructed a new road extraction method

Sensors 2019, 19, 4115; doi:10.3390/s19194115 www.mdpi.com/journal/sensors


Sensors 2019, 19, 4115 2 of 19

based on DenseNet through the introduction of local and global attention units, which can effectively
extract road network from remote sensing images with local and global information. Zhang et al. [6]
proposed a new network for road extraction from remote sensing images, which combines ResNet
and U-Net. Wang et al. [7] proposed a road extraction method based on deep neural network (DNN)
and finite state machine (FSM). Sun et al. [8] proposed a new road extraction method based on
superimposed U-Net and multi-output network model, and used mixed loss function to solve the
problem of sample imbalance in multi-objective classification. Kestur et al. [9] proposed U-FCN (U-type
Fully Convolutional Network) model to realize road extraction from UAV low-altitude remote sensing
images, which consists of a set of convolution stacks and corresponding mirror deconvolution stacks.
Sameen et al. [10] proposed a high resolution, orthophoto road segmentation framework based on deep
learning. In this method, a deep convolution self-encoder is used to extract the road from orthophoto
image. Costea et al. [11] proposed three successive steps to extract the road in remote sensing images:
(1) Extract the initial road results by combining multiple U-Net networks; (2) Use a Convolutional
Neural Network(CNN) to recover the details of the output results in the first network; and (3) Connect
the road breakage area by reasoning to improve the road extraction accuracy. Filin et al. [12] proposed
a method combining deep learning and image post-processing to extract road information from remote
sensing images. The road extraction results were improved through four steps: vectorization of
road results, detection of road width, extension of interrupted roads, and removal of non-road areas.
Buslaev et al. [13] constructed a full convolution network similar to U-Net structure by using ResNet_34
pre-trained in ImageNet.
D-LinkNet [14] was introduced because it won the CVPR DeepGlobe 2018 Road Extraction
Challenge because of its highest Intersection over Union (IoU) scores on the validation set and the
test set. D-LinkNet could have higher IoU scores if it used ResNet, whose encoder part has a deeper
layer than ResNet_34 [15]; however, the scale of model would be extremely large because of the design
of its center part called DBlock [16]. To address this problem, the article proposed improved neural
networks to extract road information based on D-Linknet: (1) Replace the initial block with stem
block; (2) Rebuild the entire network based on ResNet units with a new structure so that a new neural
network, D-Linknetplus, can be built; and (3) Add a 1 × 1 convolution as a layer before DBlock to
reduce the input feature-maps, to reduce the parameters, and to improve computational efficiency.
Meanwhile, add another 1 × 1 convolution layer after DBlock to recover the required number of output
channels. Accordingly, an improved neural network B-D-LinknetPlus was built. The accuracy of road
extraction results obtained by the modified model were trained using our own training dataset after
data augmentation [17,18].

2. Materials and Methods


The UAV remote sensing image is the first processing material required for the research; however,
the data obtained from the camera sensor has several disadvantages. Accordingly, we first had to take
some transportation before labeling and training. Further, the convolution neural net was introduced
for processing and extracting road information from UAV images.

2.1. Materials

2.1.1. Image Data Source


As seen in Figure 1, the experimental images from optical sensors were taken in different areas
and at different times. The selected areas in the article are from the Luoyang region in Henan Province,
as seen in Figure 2. The acquisition time of the UAV image in the Luoyang area was on 18 March 2018,
and the original single image size is at the centimeter level.
Sensors 2019, 19, 4115 3 of 19
Sensors2019,
Sensors 2019,19,
19,xxFOR
FORPEER
PEERREVIEW
REVIEW 33of
of19
19

Figure1.
Figure
Figure 1.Unmanned
1. UnmannedAerial
Unmanned AerialVehicle
Aerial Vehicle(UAV)
Vehicle (UAV)Remote
RemoteSensing
SensingImages.
Images.

Figure2.
Figure
Figure 2.Mosaic
2. MosaicUnmanned
Mosaic UnmannedAerial
Unmanned AerialVehicle
Aerial Vehicleimages
Vehicle images(Luoyang
images (LuoyangRegion).
(Luoyang Region).
Region).

2.1.2. Label
2.1.2.Label Data
LabelData Making
DataMaking
Making
Becauseof
Because ofthe largenumber
thelarge numberof ofparameters
parametersin inthe
theconvolution
convolutionneuralneuralnetwork,
network,ififthethemodel
modelisis
directlyapplied
directly appliedtoto
applied actual
to actualroad
actual detection
road
road detection
detection after randomly
after
after initializing
randomly
randomly these these
initializing
initializing parameters, the effect
these parameters,
parameters, thecannot
the effect
effect
be guaranteed.
cannot be Therefore,
guaranteed. before
Therefore, applying
before the convolution
applying the neural
convolution network
neural
cannot be guaranteed. Therefore, before applying the convolution neural network to the actual to the
networkactual
to situation,
the actual
the network
situation,
situation, the
the parameters
network first had tofirst
networkparameters
parameters be trained.
first hadto
had To
tobe evaluate
betrained.
trained.Totheevaluate
To effect of the
evaluate network
effecttraining,
theeffect ofnetwork
of the training,
network first step
training,
thefirst
was
the first stepwas
to manually
step wastrain
tomanually
to manually trainsamples.
and testtrain
the andtest
and testthe
thesamples.
samples.
Thispaper
This paperusedusedlabel
labelImg
ImgPlus-master,
Plus-master,an anopen
opensource
sourcesample
sampleannotation
annotationtool
toolwritten
writtenininPython
Python
onGitHub,
on GitHub,to tolabel
labelthe
theimages.
images.The
Thesoftware
softwarecombines
combinesQt Qtto tocreate
createananaccessible
accessiblehuman–computer
human–computer
interactioninterface.
interaction interface.Because
Becauseonly
Because onlythe
only theroad
the roadinformation
road informationneeds
information needsto
needs tobe
to beextracted
be extractedfor
extracted forresearch,
for research,thetheroad
road
areasshould
areas shouldbe belabeled
labeledandandexpressed
expressedin inred
redon onthe
thesoftware
softwareinterface.
interface.The
Theinitial
initialsoftware
softwareinterface
interface
andannotation
and annotationresults
resultsareareshown
shownininFigure
Figure3.
Figure 3.3.
Sensors 2019, 19, 4115 4 of 19
Sensors 2019, 19, x FOR PEER REVIEW 4 of 19
Sensors 2019, 19, x FOR PEER REVIEW 4 of 19

Figure 3.
Figure 3. Initial
Initial Software
Software Interface
Interface and
and Annotation
Annotation Results.
Results.
Figure 3. Initial Software Interface and Annotation Results.

The road category


The category label was
was set asas 1, and
and the
the other
other areas
areas labels
labels were set
set as 0.
0. To
To display
display the data
data
The road
road category label
label was set
set as 1,
1, and the other areas labels were
were set as
as 0. To display the
the data
conveniently, thepixel
conveniently, pixel valueof
of labeledareaarea wasset
set to255.
255. Thelabeled
labeled imageisisshown
shown inFigure
Figure 4.
conveniently, the
the pixel value
value of labeled
labeled area was
was settoto 255. The
The labeled image
image is shown in in Figure 4.
4.

Figure 4. Label Result Image.


Figure 4.
Figure 4. Label
Label Result
Result Image.
Image.
2.1.3. Training Data Augmentation
2.1.3. Training Data Augmentation
The large number of samples required in a convolutional network greatly restricts its wide
The large number of samples required in a convolutional convolutional network greatly restricts its wide
application. In this article, manual labeling of sample data was far from meeting the requirements of
application. In Inthis
thisarticle,
article,manual
manuallabeling
labelingofofsample
sampledata data was
was farfar from
from meeting
meeting thethe requirements
requirements of
network training for the number of samples. However, prior research [17] fully demonstrated the
of network
network training
training forfor the
the numberofofsamples.
number samples.However,
However,prior priorresearch
research[17][17] fully
fully demonstrated the
effectiveness of data augmentation in network model training. Therefore, network training samples
effectiveness of data augmentation in network model training. Therefore, network training samples
through data augmentation were added according to the actual scene.
through data augmentation were added added according
according to to the
the actual
actual scene.
scene.
(1) Hue, Saturation, Value (HSV) Contrast Transform
(1) Hue, Saturation, Value Value (HSV)
(HSV) Contrast
Contrast Transform
Transform
If an image of the same area is taken at a different time, the information carried by the image
If an
an image
imageofofthethesame
same area is taken
area at aatdifferent
is taken time,time,
a different the information
the informationcarried by theby
carried image
the differs
image
differs because of changes in weather. An effective method is to transform the original image of
because of changes
differs because in weather.
of changes An effective
in weather. Anmethod
effectiveis to transform
method is tothe original image
transform of training
the original image data
of
training data from RGB color space into hue, saturation, and value (HSV) color space; subsequently,
from RGB
training color
data space
from RGB into hue,space
color saturation, andsaturation,
into hue, value (HSV) color
and space;
value subsequently,
(HSV) color space; thesubsequently,
details of the
the details of the hue, saturation, and value for these images were changed to obtain the
hue, saturation,
the details and hue,
of the value saturation,
for these images
and were
valuechanged
for these to obtain
imagesthewerecorresponding
changed to image in HSV
obtain the
corresponding image in HSV color space. Further, the image was restored to RGB color space to
color space. Further,
corresponding imagethe inimage
HSV was
colorrestored to RGB color
space. Further, space towas
the image obtain an additional
restored to RGB training image.
color space to
obtain an additional training image. The change value of hue and saturation should not be too large;
The change
obtain value of hue
an additional and saturation
training image. Theshould
changenot be too
value large;
of hue otherwise,
and saturation theshould
image not
restored
be tootolarge;
RGB
otherwise, the image restored to RGB color space will be seriously distorted. Reference [14] suggested
otherwise, the image restored to RGB color space will be seriously distorted. Reference [14] suggested
that, in HSV color space, the change value of H channel, S channel, and V channel are set below 15%,
that, in HSV color space, the change value of H channel, S channel, and V channel are set below 15%,
15%, and 30%, respectively.
15%, and 30%, respectively.
Sensors 2019, 19, 4115 5 of 19

color space will be seriously distorted. Reference [14] suggested that, in HSV color space, the change
Sensors of
value 2019,
H 19, x FOR PEER
channel, REVIEWand V channel are set below 15%, 15%, and 30%, respectively.
S channel, 5 of 19

(2)
(2) Spatial
Spatial Geometric
Geometric Transformation
Transformation
Label
Label quality
quality significantly
significantly impacts
impacts thethe deep
deep learning
learning model.
model. WhenWhen thethe label
label space
space oror quantity
quantity is
is
insufficient,
insufficient, it will seriously affect the training result or lead to the insufficient generalization degree
it will seriously affect the training result or lead to the insufficient generalization degree
of the trained
of the trainedmodel,
model,which
whichwill
willreduce
reduce thethe recognition
recognition raterate
andand accuracy
accuracy rate.rate. To obtain
To obtain diverse
diverse label
label
data, spatial geometric transformation including rotation, flip, zoom and affine transformation is
data, spatial geometric transformation including rotation, flip, zoom and affine transformation is
necessary. Partial examples of transformation results are shown in Figure 5. The
necessary. Partial examples of transformation results are shown in Figure 5. The data augmentation data augmentation was
processed
was processedin anin ambitious way according
an ambitious way according to References [14,15,17].
to References The road
[14,15,17]. The direction of original
road direction label
of original
is
label is from west north to east south, as seen in Figure 5. After image geometric transformation,road
from west north to east south, as seen in Figure 5. After image geometric transformation, the the
direction and its surrounding environment were changed greatly.
road direction and its surrounding environment were changed greatly. For the training of For the training of convolutional
neural network,
convolutional morenetwork,
neural information
more about the image
information features
about can befeatures
the image acquired. can be acquired.

Figure 5.
Figure 5. The
Theresults
resultsofofspatial
spatialgeometric
geometric transformation
transformation of of
thethe
label (a) (a)
label original label;
original (b) vertical
label; flip;
(b) vertical
(c) horizontal flip; (d) main diagonal rotation; (e) sub diagonal transpose; (f) 30 degree
flip; (c) horizontal flip; (d) main diagonal rotation; (e) sub diagonal transpose; (f) 30 degree clockwise clockwise
rotation; (g)
rotation; (g) 60
60 degree
degree clockwise
clockwise rotation;
rotation; (h)
(h) 90
90 degree
degree clockwise
clockwise rotation;
rotation; (i)
(i) 180
180 degree
degree clockwise
clockwise
rotation; (j)
rotation; (j) 270
270 degree
degree clockwise
clockwise rotation;
rotation; (k,l)
(k,l) random
randomshift
shiftand
andscaling
scaling(approximately
(approximately−10–10%).
−10–10%).

After HSV contrast transformation and spatial geometric transformation of the original image,
the training label data significant increase in quantity,
quantity, which
which is
is helpful
helpful for eliminating overfitting on
the train set during convolution neural network training.
during convolution neural network training.
(3)
(3) Image Clipping
If large-sized
large-sizedtraining images
training are used
images are directly for training,
used directly the demands
for training, the on computeron
demands performance
computer
are higher andare
performance thehigher
trainingandphase
thewill be longer.
training phase In will
this experiment,
be longer. In thethis
selected computer
experiment, thecontained
selected
one RTX 2080
computer Ti graphics
contained onecard
RTXand 32 GB
2080 running memory.
Ti graphics card andFurther,
32 GB each training
running image was
memory. cut into
Further, a
each
suitable size for this article. To guarantee that the net could learn the global characteristics,
training image was cut into a suitable size for this article. To guarantee that the net could learn the the overlap
rate
globalof characteristics,
adjacent imagesthe was set to 100
overlap ratepixels.
of adjacent images was set to 100 pixels.
After some of the above above steps,
steps, 47,152 pairs of training
training data and 1012 test test data
data without
without data
augmentation were obtained.obtained. To Toavoid
avoidthethemodel
modelbeing
beingover-adapted
over-adaptedtoto the image characteristics
the image characteristics of
of a certain
a certain region
region in the
in the training
training phase,
phase, the the training
training datadata ofregions
of all all regions
were were arranged
arranged outorder
out of of order
and
stored in a folder. The original image and the corresponding label image were stored in different
folders. TensorFlow provides an efficient way to read training data; accordingly, the training image
was generated into its standard format TF-Record. All training images were written into a single large
file.
Sensors 2019, 19, 4115 6 of 19

and stored in a folder. The original image and the corresponding label image were stored in different
folders. TensorFlow provides an efficient way to read training data; accordingly, the training image
was generated into its standard format TF-Record. All training images were written into a single
large
Sensorsfile.
2019, 19, x FOR PEER REVIEW 6 of 19

2.2.
2.2. Methods
Methods
With
With the
the popularization
popularization of of deep
deep learning,
learning, designing
designing aa network
network structure
structure has
has become
become simpler.
simpler.
However, a convolution neural network is stacked by common components such
However, a convolution neural network is stacked by common components such as convolution as convolution layer,
pooling layer, and
layer, pooling activation
layer, layer. Different
and activation network network
layer. Different structuresstructures
formed by different
formed by stacking
differentorders as
stacking
well as the
orders number
as well as theofnumber
these components have different
of these components haveeffects on the
different accuracy
effects on theofaccuracy
the results. Therefore,
of the results.
in a convolution
Therefore, network, the network,
in a convolution accuracy ofthethe final result
accuracy willfinal
of the be different. In practical
result will applications,
be different. the
In practical
problem is how
applications, thetoproblem
design aisgood
hownetwork
to designstructure.
a good network structure.

2.2.1.
2.2.1. Stem
Stem Block
Block
For
ForD-LinkNet,
D-LinkNet,therethereisisananinitial block
initial [15]
block (7 ×(77×convolution
[15] 7 convolutionlayer,layer, = 2 followed
stridestride × 3a
by a 3by
= 2 followed
max
3 × 3pooling, stride = stride
max pooling, 2). However, reference reference
= 2). However, [19–22] presented
[19–22] that two iterations
presented that twoof iterations
consecutive of
down-sampling will lose feature information, making it difficult to recover detailed
consecutive down-sampling will lose feature information, making it difficult to recover detailed information in the
decoding stage.
information Inspired
in the decodingby Inception v3 [20]by
stage. Inspired and Inception
Inception v3v4 [23],
[20] andreference
Inception [24]
v4replaced the initial
[23], reference [24]
block withthe
replaced stem block.
initial The
block stem
with block
stem consists
block. of three
The stem block3 ×consists
3 convolution
of threelayers
3 × 3and one 2 × 2 mean
convolution layers
pooling
and onelayer.
2 × 2The stride
mean of the layer.
pooling first convolution
The stride oflayer
theisfirst
2 and the others layer
convolution are 1. isThe output
2 and the channels
others arefor1.
all three convolution layers are 64. The experiment result in [19] demonstrated
The output channels for all three convolution layers are 64. The experiment result in [19] that the stem block
will successfullythat
demonstrated detect
theobjects, especially
stem block will small objects. detect
successfully The contrast of initial
objects, blocksmall
especially and stem block
objects. is
The
shown in Figure 6.
contrast of initial block and stem block is shown in Figure 6.

Stem block

Initial block
 (3 × 3 ) , 2 
Conv  
 ( 3, 64 ) 

(7 × 7 ), 2   ( 3 × 3 ) ,1 
Conv   Conv  
 ( 3, 64 )   ( 64, 64 ) 

 ( 3 × 3 ) ,1 
Max pooling Conv  
( 2 × 2, 2 )  ( 6 4, 64 ) 

Aver pooling
( 2×2,2)

(a) (b)
Figure6.6. The
Figure The contrast
contrast of
ofinitial
initialblock
blockand
andstem
stemblock:
block: (a)
(a) initial
initial block
block and
and (b)
(b) stem
stem block.
block.

2.2.2.
2.2.2. D-LinkNetPlus
D-LinkNetPlus Building
Building
When
When training
training aa network,
network, to
to obtain
obtain better
better precision,
precision, we
we constantly
constantly build
build aa deeper
deeper network
network byby
stacking
stacking more layers. In an extreme case, the adding layers did not learn any useful featuresin
more layers. In an extreme case, the adding layers did not learn any useful features in terms
terms
of accuracy, but only duplicate the characteristics of the shallower layers onto the deeper layers, that
is, the characteristics of the new layers identically map the shallow features. In this case, the
performance of deep network is at least the same as that of the shallow network, and degradation
should not occur. On the basis of this idea, He [15,16] proposed a new structure network, ResNet
(Residual Neural Network). ResNet module can be expressed by the following Formulas (1) and (2)
Sensors 2019, 19, 4115 7 of 19

of accuracy, but only duplicate the characteristics of the shallower layers onto the deeper layers, that is,
the characteristics of the new layers identically map the shallow features. In this case, the performance
of deep network is at least the same as that of the shallow network, and degradation should not occur.
On the basis of this idea, He [15,16] proposed a new structure network, ResNet (Residual Neural
Network). ResNet module can be expressed by the following Formulas (1) and (2):

yl = h(xl ) + F(xl , wl ) (1)

xl+1 = f ( yl ) (2)

where xl and xl+1 indicate the input and the output of the lth residual unit. f indicates activation
function while f indicates identity mapping, i.e., xl+1 = yl The formulas above can be shown as:

xl+1 = xl + F(xl , Wl ) (3)

So,
xl+2 = xl+1 + F(xl+1 , Wl+1 ) = xl + F(xl , Wl ) + F(xl+1 , Wl+1 ) (4)

The output of Lth layer could be expressed as:

L−1
X
xL = xl + F(xi , Wi ) (5)
i=l

The feature of arbitrary level xL could be expressed by the previous xl and the residual structure
L−1
P
F(xi , Wi ). Therefore, even if the network is deep, the shallow features can be transmitted to the
i=l
deep layers through identity mapping. If l = 0 the formula could be shown as:

L−1
X
xL = x0 + F(xi , Wi ) (6)
i=0

We find that the feature of the arbitrary layer could be obtained by adding input parameters x0
L−1
P
and residual structure F(xi , Wi ). So, the input could be jointed with output directly. Further, the
0
gradient of any layer can be determined by the chain rule:

L−1
 
∂loss ∂loss ∂L ∂loss  ∂ X 
= · = ·1 + F(xi , Wi ) (7)
∂x l ∂x L ∂x l ∂x L ∂x l
i=l

So, ResNet can address the problem of gradient dispersion very well even if the depth of the
network is very deep. The original structure of residual unit is shown in Figure 7a. The output can be
expressed by the following Formulas (8) and (9):

yl = xl + F(xl , Wl ) (8)

xl+1 = f ( yl ) (9)

where f indicates the activation function. The new structure of residual unit, ResNet variant, which was
introduced in [16] is shown in Figure 7b. The output can be expressed by the following Formula (10):

xl+1 = xl + F(xl , Wl ) (10)


Sensors 2019, 19, 4115 8 of 19
Sensors 2019, 19, x FOR PEER REVIEW 8 of 19

Sensors 2019, 19, x FOR PEER REVIEW xl xl 8 of 19

weight BN
xl xl
BN ReLU
weight BN
ReLU Weight
BN ReLU
weight BN
ReLU Weight
BN ReLU
weight BN
Weight
Addition BN ReLU

ReLU Addition Weight


Addition
xl +1 xl +1
ReLU Addition
(a) (b)
x
l +1 x
lresidual
+1
Figure
Figure 7.
7. The
The structure of the
structure different
of the residual
different units:units:
residual (a) Original
(a) Original unit andunit
residual (b) New residual
and (b) New
unit.
residual unit. (a) (b)
Figure 7. The structure of the different residual units: (a) Original residual unit and (b) New residual
In contrast to
D-LinkNet usestheseveral
originaldilated
residual unit, the new
convolution residual
layers unit is essential
with central for making
skip connections information
to increase the
unit.
propagation
receptive smooth
field. and obtain
It contains dilateda higher accuracy,
convolution bothasinseen in Figure
cascade mode7.and parallel mode. Each dilated
D-LinkNet
convolution
D-LinkNet uses
hasuses severaldilation
a different
several dilated convolution
dilated convolution layers with
rate so the receptive
layers with central
field skippath
of each
central skip connections to increase
is different;
connections to increase the
it can also
the
receptivefeatures
combine
receptive field.
field. ItIt contains
contains
from dilated
different
dilated convolution
scales. D-LinkNet
convolution both in
in cascade
bothtakes mode
advantage
cascade mode ofand parallel
multi-scale
and mode.
parallelcontext Each dilated
aggregation
mode. Each dilated
convolution
so that it has has
a a
greatdifferent dilation
performance forrate so
road the receptive
extraction. We field of
called each
this path
center
convolution has a different dilation rate so the receptive field of each path is different; it can is different;
part as DBlock it can
in also
this
also
combine
paper
combine and features from
its structure
features different scales.
is shownscales.
from different D-LinkNet
in Figure 8.
D-LinkNet takes advantage of multi-scale context aggregation
takes advantage of multi-scale context aggregation
so that it has a great performance for road
so that it has a great performance for road extraction. extraction. We We
called this this
called center part as
center DBlock
part in thisin
as DBlock paper
this
and its structure
paper and its structure is shown in
m is 1shownFigure 8.
m in Figure
2 m 8. 4 m 8 m m

m m
1 1
m m
2 2
m m
4 4
m m
8 m m

m m
1 1
m m
2 2
m m
4 m

m m
1 1
m m
2 m

m m
1 m

Figure 8. The structure of DBlock.


m

In conclusion, we rebuilt the D-LinkNet based on stem block and ResNet variant. The network
Figure 8. The
Figure 8. The structure
structure of
of DBlock.
DBlock.
called D-LinkNetPlus and its structure is shown in Figure 9. C indicates the classification category
and the structure ofwe
In conclusion, Res-block is shown
rebuilt the in Figure
D-LinkNet based7b.on stem block and ResNet variant. The network
called D-LinkNetPlus
D-LinkNetPlusand andits
itsstructure
structureis is
shown
shownin in
Figure 9. C
Figure 9.indicates the the
C indicates classification category
classification and
category
and the structure
the structure of Res-block
of Res-block is shown
is shown in Figure
in Figure 7b. 7b.
Sensors2019,
Sensors 2019,19,
19,4115
x FOR PEER REVIEW 99of
of19
19

Softmax
Input 128 1 32 1
1 C

Stem block
s 3× 3 Conv(dilation=s)

n1 4× 4 Tansposed-Conv

2× 2 Max pooling

n2 Addition

Skip connection

n3
m Feature-maps(channels=m)

n4 DBlock n building block(Res-block=n)


Deonv 
Conv 

Conv 
( m / 4, m / 4) 
 ( 3 × 3) ,2 
 ( m , m / 4 )
 (1 × 1 ) ,1 

 ( m / 4, n ) 
 (1 × 1 ) ,1 



Conv 

Conv 
Conv 

Aver pooling
Stem block

( 2×2,2)
 ( 64, 64 ) 
 ( 3 × 3 ) ,1 

 ( 64, 64 ) 
 ( 3 × 3 ) ,1 
 ( 3, 64 ) 
 (3 × 3 ) , 2 


Figure9.9. D-LinkNetPlus
Figure D-LinkNetPlus structure
structurechart.
chart.

2.2.3.
2.2.3. DBlock
DBlock Structure
Structure Optimization
Optimization
Because
Because thethe network
network structure
structure of of D-LinkNet
D-LinkNet came
came from
from the
the ResNet
ResNet framework,
framework, the the channel
channel
number of feature-maps increases gradually with
number of feature-maps increases gradually with deeper depths.deeper depths. At the end
end of the encoding stage,
of the encoding stage,
the
thenumber
numberofof output channels
output channelsof feature-map
of feature-mapis highest. Next,Next,
is highest. the feature-maps go through
the feature-maps the DBlock
go through the
and the output
DBlock and theisoutput
taken asis the input
taken of the
as the decoding
input stage. Thestage.
of the decoding numberTheofnumber
input and output
of input channels
and output
ischannels
the same in every
is the same dilated
in everyconvolution, leading to
dilated convolution, numerous
leading parameters
to numerous throughout
parameters the whole
throughout the
network. Even if the
whole network. Evennetwork performance
if the network is excellent,
performance is the network
excellent, thescale wouldscale
network limitwould
the application.
limit the
The
application. The 1×1 introduced
1 × 1 convolution, convolution, as aintroduced
bottleneck layer in [15], is usually
as a bottleneck layer applied
in [15], to
is change
usuallythe number
applied to
of channels
change of feature-maps,
the number of channelswhich can not only effectively
of feature-maps, which can notfuseonly
features but also
effectively fusegreatly reduce
features the
but also
greatly reduce the size of network parameters. This article added a 1×1 convolution
size of network parameters. This article added a 1 × 1 convolution before DBlock, which reduces the
before DBlock,
number of channels of the input feature-maps by half. Further, another 1 × 1 convolution was added
which reduces the number of channels of the input feature-maps by half. Further, another 1×1
after DBlock to recover the required channels. The structure, known as Dblock Plus, combines two
convolution was added after DBlock to recover the required channels. The structure, known as
1 × 1 convolution layers with DBlock and its structure is shown in Figure 10. We replaced the DBlock
Dblock
with Plus, combines
DBlockPlus two 1×1 convolution
in D-LinkNetPlus to build a new layers
networkwiththat
DBlock and B-D-LinkNetPlus.
we called its structure is shown in
Figure 10. We replaced the DBlock with DBlockPlus in D-LinkNetPlus to build a new network that
we called B-D-LinkNetPlus.
Sensors 2019, 19, 4115 10 of 19
Sensors 2019, 19, x FOR PEER REVIEW 10 of 19

m 1 m/2 1 m/2 2 m/2 4 m/2 8 m/2 m/2 1 m

m/2 1 m/2 2 m/2 4 m/2


Conv (1×1) ,1

Conv (1×1) ,1


m/2 1 m/2 2 m/2


m/2 1 m/2


m/2

Figure 10.The
Figure10. Thestructure
structureofofDBlockPlus.
DBlockPlus.

2.2.4. Net Parameters


2.2.4. Net Parameters
In this paper, the network structure of D-LinkNetPlus is constructed based on the ResNet variant.
In this paper, the network structure of D-LinkNetPlus is constructed based on the ResNet
Using 50 layers and 101 layers [15] within the ResNet variant as the encoding stage, the network
variant. Using 50 layers and 101 layers [15] within the ResNet variant as the encoding stage, the
was built. The network called D-LinkNetPlus_50 and D-LinkNetPlus_101, respectively. The network
network was built. The network called D-LinkNetPlus_50 and D-LinkNetPlus_101, respectively. The
parameters are shown in Table 1. The parameters of the B-D-LinkNetPlus are the same as D-LinkNetPlus.
network parameters are shown in Table 1. The parameters of the B-D-LinkNetPlus are the same as
D-LinkNetPlus. Table 1. D-LinkNetPlus structure parameters.

Layer Name Output1.Size


Table D-LinkNetPlus_50
D-LinkNetPlus structure parameters.D-LinkNetPlus_101
Conv Conv
Layer Name Output Size D-LinkNetPlus_50
 3 × 3, k, stride = 2 
 
 3D-LinkNetPlus_101

× 3, k, stride = 2 

 3 × 3, k, stride = 1   3 × 3, k, stride = 1 
Stem block 128 × 128
 3×33,×k3,, stride
k, stride==2 1 3 × 3, k=, stride
3 × 3, k,stride = 2
   
1
  
Conv  3 × 3, k , stride = 1  Conv  3 × 3, k , stride = 1 
2 × 2, avg pool, stride = 2 2 × 2, avg pool,  stride = 2 
Stem block 128 ×128  3 × 3, k , stride = 1   3 × 3, k ,stride = 1 
 1 × 1, 64 
  1 × 1, 64 
128 × 128 
Encoder1 Conv  3 × 3, 64  × 3 Conv  3 × 3, 64  × 3
 
2×2 , avg pool,

× 1, 256= 2
1stride

2×2 
, avg 256 stride = 2
1 × 1,pool,

1 × 1, 128  1 × 1, 128
   
 1 ×1, 64  
 3 × 3, 1281 × 1, 64 
Encoder2 64 × 64 Conv  3 × 3, 128  × 4
Conv  3 ×3,164 ×3 Conv 
 3 × 3, × 4
64  × 3
 
Encoder1 128 ×128  × 1, 512
 Conv 
1 × 1, 512
1 ×1, 256   1 × 1,256 1 × 1, 256 
 1 × 1, 256
  

32 × 32  3 × 3, 256 
Encoder3 Conv  3 × 3, 256  × 6 Conv   × 23

 1 × 1,128 1 × 1,1024 1 × 1,128 
1 × 1, 1024
   

64×64 Conv  3 ×3,128 1 × 1, × Conv  512× 3,128


 
×4
 4  1 × 1, 3
512
 
Encoder2   
 3 × 3, 512  

Encoder4 16 × 16 Conv 3 × 3,
 512  × 3 Conv  × 3
 1 ×1,1512  1 × 1, 512  
 
× 1,2048 1 × 1,2048

 

 

Center 16 × 16  1 × 1, DBlock
256   1 × 1, 256 
DBlock

32×32 Conv  3 ×Conv, 1 × 1,×512


3, 256 Conv  31,×512
3, 256  × 23
 6 
Conv, 1 ×
   
Encoder3 

 Deconv, 3×

Decoder4 32 × 32  Deconv, 3 × 3, 512
1 × 1,1024  3, 512 
 1,1024
Conv, 1 ×11,×1024
 

 Conv, 1 × 1,  1024
 

 1 Conv, 1 × 1, 256  Conv, 1 × 1, 256 


   
× 1, 512  
 Deconv, 3×13, × 1, 512 
Decoder3 64 × 64  Deconv,
 3 × 3, 256 256  
 Conv  3 × 3, 512

16×16 Conv 3 Conv, × 3, 1 × 1,×512
512 3  ×3
 
Encoder4 


 1, 512
Conv, 1 × 
1 × 1, 2048  1 × 1, 2048  
 Conv, 1 × 1, 128 Conv, 1 
× 1, 128  
 
 
Decoder2 128 × 128  Deconv, 3 × 3, 128   Deconv, 3 × 3, 128 
16×16
   
Center DBlock Conv, 1 × 1, 256 DBlock
Conv, 1 × 1, 256

 Conv, 1 × 1, 64  Conv, 1 × 1, 64 
   
 Conv  ,1 × 1, 5123 × 3, 64   Conv ,1 × 1, 512 

Decoder1 256 × 256  Deconv,
  Deconv, 3 × 3, 64 
 Deconv × 1,128
 
3 × 3, 512 

32×32 , 3 × 3,1512
Conv, Deconv
Conv, 1 × 1,,128
  
Decoder4    
F1 512 × 512  ConvDeconv [4 × 4, 32]
,1 × 1,1024  Conv
Deconv × 1,1024
,14,
[4 × 32] 
Logits 512 × 512 Conv [3 × 3, C, stride = 1] Conv [3 × 3, C, stride = 1]
 Conv ,1 × 1, 256   Conv ,1 × 1, 256 
 Deconv , 3 × 3, 256   Deconv , 3 × 3, 256 
Decoder3 64×64    
 Conv ,1 × 1, 512   Conv ,1 × 1, 512 
Sensors 2019, 19, 4115 11 of 19

3. Results and Discussion

3.1. Implementation Details


A RTX 2080 Ti graphics card was used for all network training and testing in this article. We used
TensorFlow as the deep learning framework to train and test all networks. All the networks applied
the same loss function defined in Equation (11):

N
X
Loss = L2loss + BECloss (Pi , GTi ) − lambda ∗ Jaccardloss (11)
i=1

N
P
where L2loss represents L2 regularization loss for all the parameters in the model. BECloss (Pi , GTi )
i=1
indicates binary cross entropy loss with N training images. P is the prediction results, and GT is
PN
label image. |Pi − GTi | represents the differences between prediction results and label images on
i=1
pixel-level. lambda means important the item is for total loss. The article suggests lambda to be set
to around 0.7. Further, Adam was chosen as the optimizer. Following [14], the learning rate was
originally set at 0.001 and reduced by 10 times while observing the slow decrease of the training loss.
The learning rate decayed 0.997 times for every four-epoch size. The batch size during training phase
was fixed as four; epoch size equals 2000.

3.2. Result and Discussion

3.2.1. D-LinkNetPlus Result and Discussion


The road extraction model of D-LinkNet, based on ResNet variant, is called D-LinkNetPlus.
Similar to D-LinkNet, there are also 50 layers of D-LinkNetPlus_50 and 101 layers of D-LinkNetPlus_101
in the encoding stage. We divided the test images into three levels: simple scene, general scene,
and complicated scene according to the complexity of image content scene. In the simple scene,
the road areas are highlighted and there are small parts of the road areas that are covered by shadow.
In a general scene, the road areas are around building groups; some parts of the road areas are covered
by buildings, trees, and shadows. Other objects may have the same features as the road areas, such as
highlighted greenhouses and highlighted buildings. These objects would affect the extraction results of
the road areas. In a complicated scene, a long part of the road area is covered by other objects, such as
trees or shadows, which make the road areas hard to extract. Partial experimental comparison results
of three scene levels are shown in Figures 11–13, respectively.
The new network with 101 layers in the encoding stage had better results than that with 50 layers
in the simple scene experiments.
In the third row of the image, D-LinkNet_50 failed to detect a part of the road area blocked by the
shadow of trees; however, the other three networks were able to connect the blocked area. The result
boundary of the road area extracted by D-LinkNet_101 was smoother and closer to the manual label
than that of D-LinkNetPlus_50 and D-LinkNetPlus_101. With regard to the highway in the fourth
line, the extraction results from D-LinkNet_101 were better than those three network models not only
broadly but also in terms of detail, which might be attributed to the fact that the stem block structure
replaced the initial block structure.
Sensors 2019, 19, x FOR PEER REVIEW 12 of 19

Sensors 2019, 19, 4115 12 of 19


Sensors 2019, 19, x FOR PEER REVIEW 12 of 19

Figure 11. Comparison results of simple scene experiment. (a) Original image; (b) label image; (c) D-
LinkNet_50; (d) D-LinkNet_101; (e) D-LinkNetPlus_50; and (f) D-LinkNetPlus_101.

The new
Figure 11.network
Comparisonwith
Comparison 101 layers
results
results in the
ofofsimple
simple encoding
scene
scene stage(a)
experiment.
experiment. had betterimage;
Original
(a) results
Original than
(b)
image; that
label
(b) with
image;
label 50D-
(c)
image; layers
(c)
in theD-LinkNet_50;
simple scene
LinkNet_50; experiments.
(d)(d)
D-LinkNet_101;
D-LinkNet_101; (e)(e)
D-LinkNetPlus_50; and
D-LinkNetPlus_50; and(f)(f)
D-LinkNetPlus_101.
D-LinkNetPlus_101.

The new network with 101 layers in the encoding stage had better results than that with 50 layers
in the simple scene experiments.

Figure 12.
Figure 12. Comparison
Comparisonresults
resultsofofgeneral
generalscene
sceneexperiments.
experiments.(a)(a)
Input image;
Input (b)(b)
image; label image;
label (c) D-
image; (c)
LinkNet_50; (d)(d)
D-LinkNet_50; D-LinkNet_101;
D-LinkNet_101;(e)(e)
D-LinkNetPlus_50; and
D-LinkNetPlus_50; (f)(f)
and D-LinkNetPlus_101.
D-LinkNetPlus_101.

Figure 12. Comparison results of general scene experiments. (a) Input image; (b) label image; (c) D-
LinkNet_50; (d) D-LinkNet_101; (e) D-LinkNetPlus_50; and (f) D-LinkNetPlus_101.
Sensors 2019, 19, 4115 13 of 19
Sensors 2019, 19, x FOR PEER REVIEW 13 of 19

Figure
Figure 13.
13. Comparison
Comparisonresults
resultsof
ofcomplex
complexscene
sceneexperiment.
experiment.(a)(a)
Original image;(b)
Original label
image; (b) image;
label (c) D-
image; (c)
LinkNet_50; (d) D-LinkNet_101; (e) D-LinkNetPlus_50; and (f) D-LinkNetPlus_101.
D-LinkNet_50; (d) D-LinkNet_101; (e) D-LinkNetPlus_50; and (f) D-LinkNetPlus_101.

In
Thethe third row ofresults
experimental the image,
from D-LinkNet_50
D-LinkNet_101failed of thetogeneral
detect scene
a partwere
of thebetter
roadthan
area those
blocked by
of the
the
othershadow of trees; however,
three networks, but the overallthe othereffectthree
was networks
acceptable. were
Theable
fourto connect can
networks the be
blocked
detected area.
forThe
the
result boundary
peninsula of the
ring area, roadthe
where areatwo extracted by D-LinkNet_101
road materials intersect in thewasthird
smoother and closer
line image. For theto the
roadmanual
in the
label
lowerthan that ofofD-LinkNetPlus_50
left corner the image in the fourth and row,
D-LinkNetPlus_101.
its shape and color With
are regard
similar to to the two
highway
roads in in the
fourth line, corner
upper right the extraction resultsHowever,
of the image. from D-LinkNet_101
only D-LinkNetPlus were better
couldthan those
detect three
it, and thenetwork
detection models
effect
not only broadly but was
of D-LinkNetPlus_50 also better
in terms than of that
detail, which might be attributed
of D-LinkNetPlus_101, to thebefact
which may duethat theinfluence
to the stem block of
structure replaced the initial block structure.
the bare space on the left side of the road in the lower right corner.
The
As canexperimental
be seen from results
Figure from 13,D-LinkNet_101
the experimental of the general
results scene were better
of D-LinkNet_101 showthanthat
thosetheofroad
the
other three networks, but the overall effect was acceptable. The four networks
area had better integrity without independent patches. From the third and fourth row of the image can be detected for the
peninsula ring area, where the
results, D-LinkNetPlus_101 two road
is better thanmaterials
the otherintersect
networks in in
theconnecting
third line image.
road area Forbreakpoints
the road in
the lowerbyleft
blocked corner of
buildings orthe image
trees. in the fourth
However, from the row,
firstitsand
shape and row,
second colormany
are similar to the
road areas are two roads
wrongly
in
detected in the road extraction result of D-LinkNetPlus; many independent patches are left in the
the upper right corner of the image. However, only D-LinkNetPlus could detect it, and
detection
image, which effect of affect
will D-LinkNetPlus_50
the calculation was better
of the than accuracy.
overall that of D-LinkNetPlus_101, which may be due
to theThe
influence
size of ofthethe bare space
network models,on the left side ofD-LinkNetPlus,
D-LinkNet, the road in the lowerand the right corner. results of IoU
calculated
As can be seen from Figure 13, the experimental results
accuracy, are shown in Table 2. The ResNet variant advances the nonlinear mapping of D-LinkNet_101 show thatelementthe road
to a
area had
pixel betteraddition.
by pixel integrityTheoretically,
without independent patches. From
the two networks the same
had the third model
and fourth row of the
parameters; image
however,
results,
the size D-LinkNetPlus_101
of D-LinkNetPlus is is 100 better than the
megabytes other
less thannetworks in connecting
that of D-LinkNet. road area
According breakpoints
to the precision
blocked by buildings
results calculated or trees.
by IoU, However,
the accuracy of from the first and second
D-LinkNetPlus_101 row,than
is higher many thatroad areasthree
of other are wrongly
models,
detected
which in the
is also road extraction
consistent with previous result analysis
of D-LinkNetPlus;
of the results many independent
of different scenes.patches are left in the
image, which will affect the calculation of the overall accuracy.
Table
The size2. Comparison results models,
of the network of network size and roadD-LinkNetPlus,
D-LinkNet, precision betweenand D-LinkNet and D-LinkNetPlus.
the calculated results of IoU
accuracy, are shown in Table 2. The ResNet variant advances the nonlinear mapping element to a
Network Name Network Size IoU
pixel by pixel addition. Theoretically, the two networks had the same model parameters; however,
D-LinkNet_50
the size of D-LinkNetPlus is 100 megabytes less792 M that of D-LinkNet. According
than 51.02% to the precision
D-LinkNet_101 0.98 G 52.67%
results calculated by IoU, the accuracy of D-LinkNetPlus_101
D-LinkNetPlus_50 686 M is higher than51.85%that of other three
models, which is also consistent with previous analysis
D-LinkNetPlus_101 758 M of the results of different 52.87%scenes.
LinkNetPlus.

Network Name Network Size IoU


D-LinkNet_50 792 M 51.02%
D-LinkNet_101 0.98 G 52.67%
D-LinkNetPlus_50
Sensors 2019, 19, 4115 686 M 51.85% 14 of 19
D-LinkNetPlus_101 758 M 52.87%

3.2.2. B-
3.2.2. B- D-LinkNetPlus
D-LinkNetPlus Results
Results and
and Discussion
Discussion
The D-LinkNetPlus
The D-LinkNetPluswith the the
with bottleneck layer islayer
bottleneck calledisB-D-LinkNetPlus. Similar to D-LinkNetPlus,
called B-D-LinkNetPlus. Similar to D-
there are also 50 layers and 101 layers in the encoding stage, known as B-D-LinkNetPlus_50
LinkNetPlus, there are also 50 layers and 101 layers in the encoding stage, known as B-D- and
B-D-LinkNetPlus_101,
LinkNetPlus_50 respectively. A comparison
and B-D-LinkNetPlus_101, of someAexperimental
respectively. comparison results
of someofexperimental
three scene levels is
results
shown in Figures 14–16, respectively.
of three scene levels is shown in Figures 14–16, respectively.

Figure 14. Comparison


Comparisonresults
resultsofofsimple
simplescene
sceneexperiment.
experiment.(a) (a)
Input image;
Input (b) (b)
image; label image
label (c) D-
image (c)
SensorsLinkNet_50;
D-LinkNet_50; (d)(d)
2019, 19, x FOR D-LinkNet_101;
PEERD-LinkNet_101;
REVIEW (e)(e)
B-D-LinkNetPlus_50;
B-D-LinkNetPlus_50; and
and(f)(f)
B-D-LinkNetPlus_101.
B-D-LinkNetPlus_101. 15 of 19

As seen in Figure 14, with a focus on the network of D-LinkNet or B-D-LinkNetPlus, the image
detection effect of layer 101 is better than that of layer 50 in the encoding stage. Further, fewer
independent patches were left in the image and the road area overall had a higher integrity. From the
details of road, D-LinkNet_101 is better than B-D-LinkNetPlus_101. But the high-road test showed
that B-D-LinkNetPlus _101 was better than D-LinkNet_101, with neater edges.

Figure
Figure 15.
15. Comparison
Comparisonresults of of
results general scene
general experiments.
scene (a) Input
experiments. image;
(a) Input (b) label
image; (b) image; (c)D-
label image;
LinkNet_50; (d) D-LinkNet_101; (e) B-D-LinkNetPlus_50; and (f) B-D-LinkNetPlus_101.
(c)D-LinkNet_50; (d) D-LinkNet_101; (e) B-D-LinkNetPlus_50; and (f) B-D-LinkNetPlus_101.

As seen in Figure 15, D-LinkNet and B-D-LinkNetPlus have their own advantages in detection
results for different scenes. In terms of connection road fracture, it can be seen from the third line
picture that D-LinkNet has a better detection effect. However, for the field road area in the lower
right corner of the third line image, this area can be judged as a road area or not a road area. Because
this area does not belong to the main road, it was not marked as a road area when the label was made.
Although B-D-LinkNetPlus detected this area, which cannot be considered as the false detection of
the network, it will reduce the evaluation accuracy of the network.
results for different scenes. In terms of connection road fracture, it can be seen from the third line
picture that D-LinkNet has a better detection effect. However, for the field road area in the lower
right corner of the third line image, this area can be judged as a road area or not a road area. Because
this area does not belong to the main road, it was not marked as a road area when the label was made.
Although
Sensors 2019, B-D-LinkNetPlus
19, 4115 detected this area, which cannot be considered as the false detection
15 of of
19
the network, it will reduce the evaluation accuracy of the network.

Figure
Figure 16.
16. Comparison
Comparisonresults of of
results complex scene
complex experiment.
scene (a) input
experiment. image;
(a) input (b) label
image; (b) image; (c) D-
label image;
LinkNet_50; (d) D-LinkNet_101; (e) B-D-LinkNetPlus_50; and (f) B-D-LinkNetPlus_101.
(c) D-LinkNet_50; (d) D-LinkNet_101; (e) B-D-LinkNetPlus_50; and (f) B-D-LinkNetPlus_101.

seen ininFigure
As seen Figure16,14,whether
with ausing
focusD-LinkNet
on the network of D-LinkNettheoroverall
or B-D-LinkNetPlus, B-D-LinkNetPlus,
image road
the imageresults
detection detection effectclose
are very of layer 101manual
to the is better than
label that of
images. layer 50ininterms
However, the encoding
of broken stage.
road
Further, fewer
connection, independent
D-LinkNet patches were leftare
and B-D-LinkNetPlus in better
the image
thanand the road area
D-LinkNetPlus. overall
With hadtoathe
regard higher
last
integrity. From the details of road, D-LinkNet_101 is better than B-D-LinkNetPlus_101. But the
high-road test showed that B-D-LinkNetPlus _101 was better than D-LinkNet_101, with neater edges.
As seen in Figure 15, D-LinkNet and B-D-LinkNetPlus have their own advantages in detection
results for different scenes. In terms of connection road fracture, it can be seen from the third line
picture that D-LinkNet has a better detection effect. However, for the field road area in the lower right
corner of the third line image, this area can be judged as a road area or not a road area. Because this
area does not belong to the main road, it was not marked as a road area when the label was made.
Although B-D-LinkNetPlus detected this area, which cannot be considered as the false detection of the
network, it will reduce the evaluation accuracy of the network.
As seen in Figure 16, whether using D-LinkNet or B-D-LinkNetPlus, the overall image road
detection results are very close to the manual label images. However, in terms of broken road
connection, D-LinkNet and B-D-LinkNetPlus are better than D-LinkNetPlus. With regard to the last
line of images, the road area is covered by trees; accordingly, the characteristics of the road cannot be
detected. However, they can be inferred according to the context concern, and the fractured area can
be connected.
The size of network models, D-LinkNet and B-D-LinkNetPlus, and the calculated results of IoU
accuracy are shown in Table 3. In comparison with D-LinkNet, B-D-LinkNetPlus reduces the channel
of input feature-maps of DBlock by half by adding 1 × 1 convolution before DBlock, greatly reducing
model parameters and shortening training time.

Table 3. D-LinkNet and B-D-LinkNetPlus network size and precision comparison results.

Network Name Network Model Size IoU


D-LinkNet_50 792 M 51.02%
D-LinkNet_101 0.98 G 52.67%
B-D-LinkNetPlus_50 298 M 52.86%
B-D-LinkNetPlus_101 370 M 52.94%
Sensors 2019, 19, 4115 16 of 19

3.2.3. Validation on Public Dataset


To verify the effectiveness of the proposed method, we also tested our method with the
Massachusetts Roads Dataset, which consists of 1108 train images and 14 validation images. The original
image size was 1500 × 1500 pixels. The dataset was formulated as a binary segmentation problem, in
which roads are labeled as foreground and other objects are labeled as background. We performed
HSV Contrast Transform, Spatial Geometric Transformation, and image clipping on the original data
set. Through training models and experiments, we obtained model size and IoU index. The size of the
network models, D-LinkNet, D-LinkNetPlus, B-D-LinkNetPlus, and the calculated results of IoU index
are shown in Table 4.

Table 4. Comparison results of road precision between D-LinkNet, D-LinkNetPlus, and B-D-LinkNetPlus.

Network Name Network Model Size IoU


D-LinkNet_50 702 M 57.18%
D-LinkNet_101 758 M 57.43%
D-LinkNetPlus_50 582 M 57.58%
D-LinkNetPlus_101 636 M 57.64%
B-D-LinkNetPlus_50 371 M 59.29%
B-D-LinkNetPlus_101 434 M 59.45%

The comparison of experimental results of these six models are shown in Figures 17
and 18,2019,
Sensors respectively.
19, x FOR PEER REVIEW 17 of 19

Figure17.
Figure 17. Comparison
Comparisonresults
results
of of complex
complex scene
scene experiment.
experiment. (a) Input
(a) Input image;image; (b)image;
(b) label label
image;
(c) (c) D-LinkNet_50;
D-LinkNet_50; and (d) D-LinkNet_101.
and (d) D-LinkNet_101.

According to the precision results calculated by IoU with the validation set, for D-LinkNetPlus and
B-D-LinkNetPlus, the IoU of the road detection results are better than that of D-LinkNet. Meanwhile,
the size of B-D-LinkNetPlus is also better than D-LinkNet. The extracted result of D-LinkNetPlus is
similar to that of D-LinkNet because a certain gap existed in the images in comparison with the label.
But D-LinkNetPlus greatly reduces model parameters and size; it also shortens the model training
time. With a focus on the extracted results from the Massachusetts Roads Dataset, we can see that
B-D-LinkNetPlus is the most effective among the three nets. In terms of the B-D-LinkNetPlus_101,
it yielded better results in comparison with the other network models. The results are also consistent
with the previous analysis on the results of different scenes.
SensorsFigure
2019, 19,17.
4115Comparison
results of complex scene experiment. (a) Input image; (b) label
17 of 19
image; (c) D-LinkNet_50; and (d) D-LinkNet_101.

Figure 18.
Figure 18.Comparison
Comparison results
results of complex
of complex scene experiment.
scene experiment. (a) D-LinkNetPlus_50;
(a) D-LinkNetPlus_50; (b) D-
(b) D-LinkNetPlus_101;
LinkNetPlus_101;
(c) (c) B-D-LinkNetPlus_50;
B-D-LinkNetPlus_50; and (d) B-D-LinkNetPlus_101.
and (d) B-D-LinkNetPlus_101.

In addition,totothe
According further analyze
precision the results
results of the
calculated by model
IoU withB-D-LinkNetPlus_101,
the validation set, for different rotating
D-LinkNetPlus
images were tested andthe
and B-D-LinkNetPlus, analyzed.
IoU of theWeroad
tookdetection
an imageresults
and rotated it counterclockwise
are better than that of D-LinkNet.30◦ , 45◦ ,
and ◦
60 . Further, we inputted the four images intobetter
the B-D-LinkNetPlus_101
Meanwhile, the size of B-D-LinkNetPlus is also than D-LinkNet. Theforextractedprediction, andof
result then
D-
transformed
LinkNetPlusthem back to
is similar to that
theiroforiginal orientation.
D-LinkNet becauseThe experimental
a certain results
gap existed show
in the that the
images IoU of four
in comparison
output
with the results
label.are
Butdifferent. The research
D-LinkNetPlus greatlyonreduces
data augmentation for deep
model parameters and learning
size; it is interesting
also shortensand the
valuable. Accordingly,
model training the asensitivity
time. With of rotation
focus on the extractedshould befrom
results given more
the attention inRoads
Massachusetts our future work.
Dataset, we
can see that B-D-LinkNetPlus is the most effective among the three nets. In terms of the B-D-
4. Conclusions it yielded better results in comparison with the other network models. The results
LinkNetPlus_101,
are also
Theconsistent with the previous
article presented improvedanalysis
neural on the results
networks to of different
extract roadscenes.
information from remote
In addition,
sensing to further
images using analyzesensor
a camera the results of the with
equipped modelUAV.
B-D-LinkNetPlus_101,
Because training different
data playrotating
a key
images
role in were
deep tested
learningandmethod,
analyzed.the Wesensitivity
took an image and
of the rotated isit counterclockwise
rotation discussed to add30°, 45°, and
images for
60°. Further, we inputted the four images into the B-D-LinkNetPlus_101 for
training in the article. The D-LinkNet was then introduced for road extraction, and the variant prediction, and then
transformed
of ResNet was them back tototheir
adopted original
improve theorientation.
neural net,Theknownexperimental results show
as D-LinkNetplus, that has
which the IoU
fewerof
four output results are different. The research on data augmentation for deep
parameters but offers similar IoU scores. However, the large net size still needs a large amount learning is interesting
of calculation; accordingly, the D-block was modified to further develop the improved neural
net called B-D-LinkNetPlus. The comparison among the original D-LinkNet_50, D-LinkNet_101,
D-LinkNetPlus_50, D-LinkNetPlus_101, B-D-LinkNetPlus_50, and B-D-LinkNetPlus_101 were
analyzed. The results were listed to compare the two-evaluator indicator, the model size, and IoU
scores. The improved neural net known as B-D-LinkNetPlus_101 was the best choice because of its
higher IoU score and smaller net size. Verification was also made using the public Massachusetts Roads
Dataset, and the results proved the value of the improved neural net. Further, the research is helpful
for extracting information from typical targets using conventional neural nets in remote sensing fields.

Author Contributions: Methodology, Y.L.; software, K.F., Z.L.; validation, B.P.; formal analysis, L.H.; investigation,
K.F., Z.L.; resources, Y.L.; data curation, B.P.; writing—original draft preparation, Y.L.; writing—review and
editing, L.E.; supervision, L.T.; funding acquisition, Y.L.
Funding: This research was funded by PRE-RESEARCH PROJECT (No.301021704), NSFC PROJECT (No. 41571333
and 60841006) SCIENCE AND TECHNOLOGY SUPPORT PROJECT OF SICHUAN PROVINCE (No.2018SZ0286,
2018CC0093 and 2018GZDZX0030).
Acknowledgments: Thanks to the support from the team at the University of Electronic Science and Technology
of China, and Chengdu University of Information Technology.
Sensors 2019, 19, 4115 18 of 19

Conflicts of Interest: The authors declare no conflict of interest.

References
1. Kim, J.I.; Kim, T.; Shin, D.; Kim, S. Fast and robust geometric correction for mosaicking UAV images with
narrow overlaps. Int. J. Remote Sens. 2017, 38, 2557–2576. [CrossRef]
2. Liu, X.; Wang, G.; Yang, H. Road extraction method of full convolution neural network remote sensing image.
Remote Sens. Inf. 2018, 33, 69–75.
3. Li, Y.; Xu, L.; Rao, J.; Guo, L.; Yan, Z.; Jin, S. A Y-Net deep learning method for road segmentation using
high-resolution visible remote sensing images. Remote Sens. Lett. 2019, 10, 381–390. [CrossRef]
4. Liu, R.; Miao, Q.; Song, J.; Quan, Y.; Li, Y.; Xu, P.; Dai, J. Multiscale road centerlines extraction from
high-resolution aerial imagery. Neurocomputing 2019, 329, 384–396. [CrossRef]
5. Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road Extraction from High-Resolution Remote Sensing Imagery Using
Deep Learning. Remote Sens. 2018, 10, 1461. [CrossRef]
6. Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett.
2018, 15, 749–753. [CrossRef]
7. Wang, J.; Song, J.; Chen, M.; Yang, Z. Road network extraction: A neural-dynamic framework based on deep
learning and a finite state machine. Int. J. Remote Sens. 2015, 36, 3144–3169. [CrossRef]
8. Sun, T.; Chen, Z.; Yang, W.; Wang, Y. Stacked u-nets with multi-output for road extraction. In Proceedings of
the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake
City, UT, USA, 18–22 June 2018; pp. 202–206.
9. Kestur, R.; Farooq, S.; Abdal, R.; Mehraj, E.; Narasipura, O.S.; Mudigere, M. UFCN: A fully convolutional
neural network for road extraction in RGB imagery acquired by remote sensing from an unmanned aerial
vehicle. J. Appl. Remote Sens. 2018, 12, 016020. [CrossRef]
10. Sameen, M.I.; Pradhan, B. A novel road segmentation technique from orthophotos using deep convolutional
autoencoders. Korean J. Remote Sens. 2017, 33, 423–436.
11. Costea, D.; Marcu, A.; Slusanschi, E.; Leordeanu, M. Roadmap Generation using a Multi-Stage Ensemble
of Deep Neural Networks with Smoothing-Based Optimization. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA,
18–22 June 2018; pp. 220–224.
12. Filin, O.; Zapara, A.; Panchenko, S. Road detection with EOSResUNet and post vectorizing algorithm.
In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 211–215.
13. Buslaev, A.; Seferbekov, S.S.; Iglovikov, V.; Shvets, A. Fully convolutional network for automatic road
extraction from satellite imagery. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4321–4324.
14. Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with Pretrained Encoder and Dilated Convolution for
High Resolution Satellite Imagery Road Extraction. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186.
15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016;
pp. 770–778.
16. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In European Conference on
Computer Vision; Springer: Cham, UK, 2016; pp. 630–645.
17. Perez, L.; Wang, J. The effectiveness of data augmentation in image classification using deep learning.
arXiv 2017, arXiv:1712.04621.
18. Wiedemann, C.; Heipke, C.; Mayer, H.; Jamet, O. Empirical evaluation of automatically extracted road axes.
In Empirical Evaluation Techniques in Computer Vision; IEEE Computer Society Press: Los Alamitos, CA, USA,
1998; pp. 172–187.
19. Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-transferrable object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018;
pp. 528–537.
Sensors 2019, 19, 4115 19 of 19

20. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 26 June–1 July 2016; pp. 2818–2826.
21. Perciano, T.; Tupin, F.; Hirata, R., Jr.; Cesar, R.M., Jr. A two-level Markov random field for road
network extraction and its application with optical, SAR, and multitemporal data. Int. J. Remote Sens.
2016, 37, 3584–3610. [CrossRef]
22. Li, M.; Stein, A.; Bijker, W.; Zhan, Q. Region-based urban road extraction from VHR satellite images using
Binary Partition Tree. Int. J. Appl. Earth Obs. Geoinf. 2016, 44, 217–225. [CrossRef]
23. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual
connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
San Francisco, CA, USA, 4–9 February 2017.
24. Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.G.; Chen, Y.; Xue, X. Dsod: Learning deeply supervised object detectors
from scratch. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy,
22–29 October 2017; pp. 1919–1927.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like