Professional Documents
Culture Documents
branches to jointly optimize two complementary tasks A. Fully Convolutional Network for Semantic Segmentation
– namely, vehicle semantic segmentation and semantic Long et al. [30] first proposed FCN architecture for semantic
boundary detection. The latter subproblem is beneficial segmentation tasks, which is both efficient and effective. Later,
for differentiating vehicles with an extremely close dis- some extensions of the FCN model have been proposed to
tance and further improving instance segmentation per- improve semantic segmentation performance. To name a few,
formance. in [31], the authors removed some of the max-pooling opera-
The remainder of this paper is organized as follows. After tions and, accordingly, introduced atrous/dilated convolutions
the introductory Section I, detailing vehicle extraction from in their network, which can expand the field of view without
high-resolution remote sensing imagery, we enter Section II, increasing the number of parameters. As post-processing, a
dedicated to the details of the proposed semantic boundary- dense conditional random field (CRF) was trained separately to
aware multi-task learning network for vehicle instance seg- refine the estimated category score maps for further improve-
mentation. Section III then provides dataset information, the ment. Zhang et al. [32] introduced a new form of network
network setup, and experimental results and discussion. Fi- that combines FCN- and CRF-based probabilistic graphical
nally, Section IV concludes the paper. modeling to simulate mean-field approximate inference for the
CRF, with Gaussian pairwise potentials as the recurrent neural
network (RNN).
II. M ETHODOLOGY
B. Residual Fully Convolutional Network (ResFCN)
We formulate the vehicle instance segmentation task by
Here, we first explain how to construct a ResFCN according
two subproblems, namely vehicle detection and semantic
to existing works in literature; mainly, the ResNet [29] and
segmentation. The training set is denoted by {(xi , yi , zi )},
FCN [30]. Then, we theoretically analyze why ResFCN is
where i = 1, 2, · · · , N and N is the number of training
able to offer better performance than other FCNs based on
samples. Since we consider each image independently, the
the traditional feedforward network architectures (e.g., VGG
subscript i is dropped hereafter for notational simplicity.
Nets [33]).
x = {xj , j = 1, 2, · · · , |x|} represents a raw input image,
Network design. Several recent studies in computer vi-
y = {yj , j = 1, 2, · · · , |x|, yj ∈ {0, 1}} denotes its corre-
sion [26–28] have shown that ResNet [29] is capable of
sponding manually annotated pixel-wise segmentation mask,
offering better features for pixel-wise prediction tasks such as
and z = {rk , k = 0, 1, · · · , K} is the instance label, where
semantic segmentation [26, 27] and depth estimation [28]. We,
rk indicates a set of pixels inside the k-th region1 . K is the
therefore, make use of ResNet to construct the segmentation
total number of vehicle instances in the image, and r0 is the
network in our work. We initialize a ResFCN from the original
background area. When k takes other values, it denotes the
version of ResNet [29], instead of the newly presented pre-
corresponding vehicle instance. Note that instance labels only
activation version [34]. Unlike [30], we directly remove the
count vehicle instances, thus they are commutative. Our aim
fully connected layers from the original ResNet but do not
is to segment vehicles while ensuring that all instances are
convolutionalize these layers so as to make one prediction per
differentiated. In this work, we approximate vehicle detec-
spatial location. Moreover, we keep the 7 × 7 convolutional
tion by semantic boundary detection2 . We generate semantic
layer and 3 × 3 max-pooling layer, which can enlarge the field
boundary labels b through z to train a boundary detector, in
of view for feature representations. One of recent trend in
which b = {bj , j = 1, 2, · · · , |x|, bj ∈ {0, 1}} and bj equals
network architecture design is stacking convolutional layers
1 when it belongs to boundaries.
with small convolution kernels (e.g., 3 × 3 and 1 × 1) in the
In this section, we describe in detail our proposed se- entire network because the stacked small kernels are more
mantic boundary-aware multi-task learning network for accu- efficient than a large filter, given the same computational
rate vehicle instance segmentation. We start by introducing complexity. However, a recent study [35] found that the
the FCN architecture for end-to-end semantic segmentation large filter also plays an important role when classification
in Section II-A. Furthermore, we propose to exploit multi- and localization tasks are performed simultaneously. This
level contextual feature representations, generated by different can be easily understood through the analogy of individuals
stages of a residual network, to construct a residual FCN commonly confirming the category of a pixel by referring to
for producing better likelihood maps of vehicle regions or its surrounding context region.
semantic boundaries (see Section II-B). Then, in Section II-C, By now, the output feature maps are only 1/32 the reso-
we elaborate the semantic boundary-aware unified multi-task lution of their original input image, which is apparently too
learning network drawn from the residual FCN for effective low to precisely differentiate individual pixels. To deal with
instance segmentation by jointly optimizing the complemen- this problem, Long et al. [30] made use of backwards strided
tary tasks. convolutions that upsample the feature maps and output score
masks. The motivation behind this is that the convolutional
1 Regions in the image satisfy r ∩ r = ∅, ∀k 6= t and ∪r = Ω, in
k t k layers and max-pooling layers focus on extracting high-level
where Ω is the whole image region. abstract features whereas the backwards strided convolutions
2 Semantic boundary detection is to detect the boundaries of each object
instance in the images. Compared to edge detection, it focuses more on the estimate the score masks in a pixel-wise way. Ghiasi et
association of boundaries and their object instances. al. [36] proposed a multi-resolution reconstruction architecture
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 4
Fig 2. The network architecture of the ResFCN we use, as illustrated in Section II-B. We incorporate multi-level contextual features from the last 32 × 32,
16 × 16, and 8 × 8 layers of a classification ResNet since making use of information from fairly early fine-grained layers is beneficial to segmenting small
objects such as vehicles. To get the desired full resolution output, we use 1 × 1 convolutional layers followed by upsampling operations to upsample back to
the spatial resolution of the input image. Then, predictions from different residual blocks are fused together with a summing operation.
based on a Laplacian pyramid that uses skip connections in training and test errors (i.e., so-called degradation problem).
from higher resolution feature maps and multiplicative gating A residual learning-based network is composed of a sequence
to successively refine segment boundaries reconstructed from of residual blocks and exhibits significantly improved training
lower-resolution maps. Inspired by the existing works, in this characteristics, providing the opportunity to make network
paper, we exploit multi-level contextual feature representations depths that were previously unattainable. The output xn of
that include information from different residual blocks (i.e., the n-th residual block in a ResNet can be computed as
different levels of contextual information). Fig. 2 shows the
illustration of the ResFCN architecture we use with multi- xn = H(xn−1 ; Θn ) + xn−1 , (2)
level contextual features. More specifically, we incorporate
feature representations from the last 32 × 32, 16 × 16, and where H(xn−1 ; Θn ) is the residual, which is parametrized
8 × 8 layers of the original ResNet since making use of by Θn . The core insight of ResNet is that the addition of a
information from fairly early fine-grained layers is beneficial shortcut connection from the input xn−1 to the output xn
to segmenting small objects such as vehicles. To get the bypasses two or more convolutional layers by performing
desired full resolution output, we used a 1 × 1 convolutional identity mapping and is then added together with the output of
layer, which adaptively squashes the number of channels down stacked convolutions. By doing so, H only computes a residual
to the number of labels (1 for binary classification), takes instead of computing the output xn directly.
advantage of the upsampling operation to upsample back to the In the experiments, we found that ResFCN can offer better
spatial resolution of the input image, and makes predictions performance than other FCNs based on traditional feedforward
based on contextual cues from the given fields of view. Then, network architecture, such as VGG-FCN. What is the reason
these predictions are fused together with a summing operation, behind this? To answer this question, we need to go deeper.
and the final segmentation results are generated after sigmoid According to the characteristics of ResFCN, we can easily get
classification. the following recurrence formula
Why residual learning? Until recently, the majority of
m−1
feedforward networks, like AlexNet [37] and VGG Nets [33], X
were made up of a linear sequence of layers. xn−1 and xn xm = H(xi ; Θi+1 ) + xn−1 , (3)
i=n−1
are denoted as the input and output of the n-th layer/block,
respectively, and each layer in such a network learns the for any deeper residual block m and any shallower residual
mapping function F: block n. Eq. (3) shows that ResFCN creates a direct path
xn = F(xn−1 ; Θn ) , (1) for propagating information of shallow layers (i.e., xn−1 )
through the entire network. Several recent studies [38, 39] that
where Θn is the parameters of the n-th layer. This kind of attempt to reveal what were learned by CNNs show that deeper
network is also often referred to as a traditional feedforward layers exploit filters to grasp global high-level information
network. while shallower layers capture low-level details, such as object
According to a study by He et al. [29], simply deepening boundaries and edges, which are of great importance in small
traditional feedforward networks usually leads to an increase object detection/segmentation. In addition, when we dive into
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 5
Fig 3. Overall architecture of the proposed semantic boundary-aware ResFCN. We propose to use such a unified multi-task learning network for vehicle
instance segmentation, which creates two separate, yet identical branches to jointly optimize two complementary tasks, namely, vehicle semantic segmentation
and semantic boundary detection. The latter subproblem is beneficial for differentiating “touching” vehicles and further improving the instance segmentation
performance.
the backward propagation process, according to the chain rule Kirillov et al. [40] propose InstanceCut, which represents
of backpropagation, we can obtain instance segmentation by two modalities, namely a seman-
∂E ∂E ∂xm tic segmentation and all instance-boundaries. The former is
= computed from a CNN for semantic segmentation, and the
∂xn−1 ∂xm ∂xn−1
latter is derived from a instance-aware edge detector. But
m−1 (4)
∂E ∂ X this approach does not address end-to-end learning. In the
= (1 + H(xi ; Θi+1 )) ,
∂xm ∂xn−1 i=n−1 remote sensing community, Marmanis et al. [41] propose a
two-step model that learns a CNN to separately output edge
where E is the loss function of the network. As exhibited likelihoods at multiple scales from color-infrared (CIR) and
in Eq. (4), the gradient ∂x∂En−1
can be decomposed into two height data. Then, the boundaries detected with each source
∂E ∂
Pm−1
i=n−1 H) that passes
additive terms: the term ∂x ( are added as an extra channel to each source, and a network
m ∂xn−1
information through the weight layers, and the term ∂x∂E
that is trained for semantic segmentation purposes. The intuition
m
directly propagates without concerning any weight layers. The behind this work is that using predicted boundaries helps
latter term ensures that the information can also be directly to achieve sharper segmentation maps. In contrast, we train
propagated back to any shallower residual block n. one end-to-end network that takes as input color images and
In brief, the properties of the forward and backward prop- predicts segmentation maps and object boundaries, in order to
agation procedures of the ResFCN make it possible to shuttle augment the performance of segmentation at instance level.
the low-level visual information directly across the network, To this end, we train a deep semantic boundary-aware Res-
which is quite helpful for our vehicle (small object) instance FCN for effective vehicle instance segmentation (i.e., segment-
segmentation tasks. ing the vehicles and splitting clustered instances into individual
ones). Fig. 3 shows an overview of the proposed network.
Specifically, we formulate it as a unified multi-task learning
C. Semantic Boundary-Aware ResFCN network architecture by exploring the complementary informa-
By exploiting the multi-level contextual features, the Res- tion (i.e., vehicle region and semantic boundaries), instead of
FCN is capable of producing good likelihood maps of vehicles. treating the vehicle segmentation problem as an independent
It is, however, still difficult to differentiate vehicles with a very and single task, which can simultaneously learn the detections
close distance by only leveraging the probability of vehicles, of vehicle regions and corresponding semantic boundaries. As
due to the ambiguity in “touching” regions. This is rooted shown in Fig. 3, the feature representations extracted from
in the loss of spatial details caused by max-pooling layers multiple residual blocks are upsampled with two separate, yet
(downsampling) along with feature abstraction. The semantic identical branches to predict the semantic segmentation masks
boundaries of vehicles provide good complementary cues that of vehicles and semantic boundaries, respectively. In each
can be used for separating instances. branch, the mask is estimated by the ResFCN with multi-level
Some approaches in computer vision and remote sensing contextual features as illustrated in Section II-B. Since we
have been explored for modeling segmentation and boundary have only two categories (foreground/vehicles vs. background
prediction jointly in a combinatorial framework. For example, and semantic boundaries vs. non-boundaries), sigmoid and
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 6
TABLE I
V EHICLE C OUNTS AND N UMBER OF V EHICLE P IXELS IN ISPRS P OTSDAM DATASET
Test Set
Training Set
2 12 5 12 7 7 7 8 7 9
Vehicle Count 4,433 123 427 301 309 305
Number of Pixels 1,184,789 36,236 122,332 76,892 77,669 74,404
TABLE II
V EHICLE C OUNTS AND N UMBER OF V EHICLE P IXELS IN B USY PARKING L OT UAV V IDEO DATASET
train the network, since, for this task, it shows much faster
convergence than standard stochastic gradient descent (SGD)
with momentum [45] or Adam [46]. We fixed almost all of
the parameters of Nesterov Aadam as recommended in [43]:
β1 = 0.9, β2 = 0.999, = 1e−08, and a schedule decay of
0.004, making use of a fairly small learning rate of 2e−04. All
weights in the newly added layers are initialized with a Glorot
uniform initializer [47] that draws samples from a uniform
distribution. In our experiments, we note that the pixel-wise
F1 score of the network is less sensitive to the parameter λ,
and the instance-level performance is relatively sensitive to λ.
Based on the sensitivity analysis (cf. Fig. 5), we set it as 0.1.
Fig 6. Frame@1s from the proposed Busy Parking Lot UAV Video dataset
for vehicle instance segmentation. Four zoomed-in areas are shown on the
bottom.
TABLE III
P IXEL - LEVEL OA S AND F1- SCORES FOR THE CAR CLASS ON ISPRS
P OTSDAM DATASET
Fig 5. A sensitivity analysis for the parameter λ on ISPRS Potsdam dataset. avoid overfitting. Moreover, we use fairly small mini-batches
of 8 image pairs because, in a sense, every pixel is a training
The networks are trained on the training set of the ISPRS sample. We train our network on a single NVIDIA GeForce
Potsdam dataset to predict instance segmentation maps. The GTX TITAN with 12 GB of GPU memory, which takes about
training set has only 931 unique 256 × 256 patches. We make two hours.
use of the data augmentation technique to increase the number
of training samples. The RGB patches and corresponding
pixel-wise ground truth are transformed by horizontally and C. Qualitative Evaluation
vertically flipping three-quarters of the patches. By doing so, Some vehicle instance segmentation results are shown in
the number of training samples increases to 14,896. To monitor Fig. 7 (test set of ISPRS Potsdam dataset) and Fig. 9 (Busy
overfitting during training, we randomly select 10% of the Parking Lot dataset), respectively, in order to qualitatively
training samples as the validation set; i.e., splitting the training illustrate the efficacy of our model. First, we compare various
set into 13,406 training and 1,490 validation pairs. We train CNN variants used for FCN architecture to determine which
the network for 50 epochs and make use of early stopping to one is the best-suited for our task. In Fig. 7, we qualitatively
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 8
Fig 7. Instance segmentation results of ISPRS Potsdam dataset (from left to right): ground truth, VGG-FCN, Inception-FCN, Xception-FCN, ResFCN, and
B-ResFCN (different colors denote individual vehicle objects). The three areas are derived from Fig. 4.
TABLE IV
D ETECTION R ESULTS OF D IFFERENT N ETWORKS ON ISPRS P OTSDAM S EMANTIC L ABELING DATASET (I NSTANCE - LEVEL F1 S CORE , P RECISION , AND
R ECALL )
2 12 5 12 7 7 7 8 7 9
Model
F1 P R F1 P R F1 P R F1 P R F1 P R
VGG-FCN 66.04 70.00 62.50 57.00 61.45 53.14 59.21 61.95 56.70 57.21 66.84 50.00 61.31 65.91 57.31
B-VGG-FCN 70.27 68.42 72.22 69.85 67.42 72.47 71.03 68.47 73.79 67.96 66.86 69.09 66.47 60.96 73.08
Inception-FCN 51.91 55.45 48.80 31.65 37.42 27.42 40.00 43.41 37.08 27.79 31.70 24.74 40.87 45.02 37.42
B-Inception-FCN 55.15 50.61 60.58 46.14 47.42 44.92 53.81 52.91 54.75 43.47 42.45 44.54 50.74 47.49 54.47
Xception-FCN 96.92 98.21 95.65 83.55 81.11 86.14 93.33 94.59 92.11 92.05 93.10 91.01 93.92 96.59 91.40
B-Xception-FCN 97.00 100 94.17 88.40 88.60 88.19 93.65 96.47 91.00 93.58 97.54 89.94 94.63 97.50 91.92
ResFCN 97.93 100 95.93 83.88 80.84 87.15 94.72 96.86 92.67 95.62 97.93 93.42 95.25 96.23 94.30
B-ResFCN 98.31 100 96.67 88.57 87.08 90.11 96.43 97.12 95.74 95.19 97.88 92.64 95.76 97.83 93.77
TABLE V
D ETECTION R ESULTS OF D IFFERENT M ETHODS ON PROPOSED B USY PARKING L OT UAV V IDEO DATASET (I NSTANCE - LEVEL F1 S CORE , P RECISION ,
AND R ECALL )
harmonic mean of instance-level precision P and recall R, high-density parking, strong light conditions, critical effects
defined as: of shadow, and a slightly blurry image quality lead to the fact
2P R Ntp Ntp that networks achieved a more inferior performance on the
F1 = ,P = ,R = , (7) proposed dataset than on the Potsdam scene.
P +R Ntp + Nf p Ntp + Nf n
2) Segmentation: The dice similarity coefficient is often
where Ntp , Nf p , and Nf n are the number of true positives, used to evaluate segmentation performance. Given a set of
false positives, and false negatives, respectively. Here, the pixels V denoted as a segmented vehicle and a set of pixels
ground truth for each segmented vehicle is the object in G annotated as a ground truth object, the Dice similarity
the manually labeled segmentation mask that has maximum coefficient is defined as:
overlap with the segmented vehicle. When calculating Ntp and
Nf p , a segmented vehicle that intersects with at least 50% D(V , G) = 2(|V ∩ G|)/(|V | + |G|) . (8)
of its ground truth is considered a true positive; otherwise This, however, is not suitable for segmentation evaluation on
it is regarded as a false positive. For Nf n , a false negative individual objects (i.e., instance segmentation). Instead, in this
indicates a ground truth object that has less than 50% of its paper, an instance-level Dice similarity coefficient is defined
area overlapped by its corresponding segmented vehicle or has and employed as:
no corresponding segmented vehicle.
NV NG
The detection results of different networks on the ISPRS 1X X
Potsdam dataset and Busy Parking Lot scene are shown in Dins (V , G) = [ ωi D(Vi , Gi )+ ω̃j D(Ṽj , G̃j )] , (9)
2 i=1 j=1
Table IV and Table V, respectively. Among the networks
without semantic boundary component, ResFCN surpasses where Vi , Gi , G̃j , and Ṽj are the i-th segmented vehicle,
all other models (VGG-FCN, Inception-FCN, and Xception- ground truth object that maximally overlaps Vi , j-th ground
FCN), highlighting the strength of residual learning-based truth object, and segmented vehicle that maximally overlaps
FCN architecture with multi-level contextual feature represen- G̃j , respectively. NV and NG respectively denote the total
tations in our task. The network with the semantic boundary number of segmented vehicles and ground truth objects. Fur-
component – i.e., B-ResFCN – achieved the best results on thermore, ωi and ω̃j are both coefficients and can be calculated
most test images of the ISPRS Potsdam scene and surpassed as:
the others by a significant margin on the Busy Parking |Vi | |G̃j |
Lot dataset, demonstrating the effectiveness of the semantic ωi = PNV , ω̃j = PNG . (10)
boundary-aware multi-task learning network in this instance k=1 |Vk | k=1 |G̃k |
segmentation problem. From Table IV and Table V, we Table VII and Table VI show the segmentation results of
observe that all networks yield a fairly lower instance-level different approaches on the Potsdam scene and Busy Parking
F1, precision, and recall on the Busy Parking Lot dataset Lot dataset, respectively. We can see that our B-ResFCN
than on the ISPRS Potsdam dataset. This mainly comes from achieves the best performance on both these two datasets.
the different difficulty levels of the two datasets. Specifically, Compared to the ResFCN, there is a 1.16% increment in terms
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 10
TABLE VI
S EGMENTATION R ESULTS OF D IFFERENT M ETHODS ON B USY PARKING L OT UAV V IDEO DATASET (I NSTANCE - LEVEL D ICE S IMILARITY C OEFFICIENT )
Fig 8. Example segmentations using the proposed B-ResFCN in several frames of the Busy Parking Lot dataset.
analysis,” in IEEE International Geoscience and Remote tures,” in IEEE International Geoscience and Remote
Sensing Symposium (IGARSS), 2016. Sensing Symposium (IGARSS), 2012.
[10] G. Kopsiaftis and K. Karantzalos, “Vehicle detection [12] T. Moranduzzo and F. Melgani, “Automatic car count-
and traffic density monitoring from very high resolution ing method for unmanned aerial vehicle images,” IEEE
satellite video data,” in IEEE International Geoscience Transactions on Geoscience and Remote Sensing, vol. 52,
and Remote Sensing Symposium (IGARSS), 2015. no. 3, pp. 1635–1647, 2014.
[11] W. Shao, W. Yang, G. Liu, and J. Liu, “Car detection [13] ——, “Detecting cars in UAV images with a catalog-
from high-resolution aerial imagery using multiple fea- based approach,” IEEE Transactions on Geoscience and
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 12
Fig 9. Instance segmentation maps of Busy Parking Lot dataset (from left to right): ground truth, Inception-FCN, Xception-FCN, ResFCN, and B-ResFCN
(different colors denote individual vehicles). The four areas are derived from Fig. 6.
Remote Sensing, vol. 52, no. 10, pp. 6356–6367, 2014. Transactions on Geoscience and Remote Sensing, vol. 55,
[14] K. Liu and G. Mattyus, “Fast multiclass vehicle detection no. 7, pp. 3639–3655, 2017.
on aerial images,” IEEE Geoscience and Remote Sensing [20] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Vehicle
Letters, vol. 12, no. 9, pp. 1938–1942, 2015. detection in satellite images by hybrid deep convolutional
[15] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, neural networks,” IEEE Geoscience and Remote Sensing
and F. Fraundorfer, “Deep learning in remote sensing: Letters, vol. 11, no. 10, pp. 1797–1801, 2014.
A comprehensive review and list of resources,” IEEE [21] N. Ammour, H. Alhichri, Y. Bazi, B. Benjdira, N. Ala-
Geoscience and Remote Sensing Magazine, vol. 5, no. 4, jlan, and M. Zuair, “Deep learning approach for car
pp. 8–36, 2017. detection in UAV imagery,” Remote Sensing, vol. 9, no. 4,
[16] L. Mou and X. X. Zhu, “IM2HEIGHT: Height p. 312, 2017.
estimation from single monocular imagery via fully [22] N. Audebert, B. Le Saux, and S. Lefèvre, “Segment-
residual convolutional-deconvolutional network,” before-detect: Vehicle detection and classification
arXiv:1802.10249, 2018. through semantic segmentation of aerial images,”
[17] L. Mou, P. Ghamisi, and X. X. Zhu, “Unsupervised spec- Remote Sensing, vol. 9, no. 4, p. 368, 2017.
tralspatial feature learning via deep residual convdeconv [23] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet:
network for hyperspectral image classification,” IEEE A deep convolutional encoder-decoder architecture for
Transactions on Geoscience and Remote Sensing, vol. 56, image segmentation,” arXiv:1511.00561, 2015.
no. 1, pp. 391–406, 2018. [24] M. Kampffmeyer, A.-B. Salberg, and R. Jenssen, “Detec-
[18] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral- tion of small objects, land cover mapping and modelling
spatial-temporal features via a recurrent convolutional of uncertainty in urban remote sensing images using deep
neural network for change detection in multispectral convolutional neural networks,” in IEEE International
imagery,” arXiv:1803.02642, 2018. Conference on Computer Vision and Pattern Recognition
[19] L. Mou, P. Ghamisi, and X. Zhu, “Deep recurrent neural (CVPR) Workshop, 2016.
networks for hyperspectral image classification,” IEEE [25] D. Eigen and R. Fergus, “Predicting depth, surface
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 13
normals and semantic labels with a common multi- Recognition (CVPR), 2015.
scale convolutional architecture,” in IEEE International [40] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy,
Conference on Computer Vision and Pattern Recognition and C. Rother, “Instancecut: from edges to instances
(CVPR), 2015. with multicut,” in IEEE International Conference on
[26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Computer Vision and Pattern Recognition (CVPR), 2017.
scene parsing network,” in IEEE International Con- [41] D. Marmanis, K. Schindler, J. D. W. amd S. Galliani,
ference on Computer Vision and Pattern Recognition M. Datcu, and U. Stilla, “Classification with an edge:
(CVPR), 2017. Improving semantic image segmentation with boundary
[27] Z. Wu, C. Shen, and A. van den Hengel, “High- detection,” ISPRS Journal of Photogrammetry and Re-
performance semantic segmentation using very deep fully mote Sensing, vol. 135, pp. 158–172, 2018.
convolutional networks,” arXiv:1604.04339, 2016. [42] F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard,
[28] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and S. Benitez, and U. Breitkopf, “The ISPRS benchmark
N. Navab, “Deeper depth prediction with fully convolu- on urban object classification and 3D building recon-
tional residual networks,” in International Conference on struction,” ISPRS Annals of the Photogrammetry, Remote
3D Vision (3DV), 2016. Sensing and Spatial Information Sciences, vol. 1, no. 3,
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual pp. 293–298, 2012.
learning for image recognition,” in IEEE International [43] T. Dozat, “Incorporating Nesterov momentum into
Conference on Computer Vision and Pattern Recognition Adam,” http://cs229.stanford.edu/proj2015/054 report.
(CVPR), 2016. pdf, online.
[30] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- [44] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On
tional networks for semantic segmentation,” in IEEE In- the importance of initialization and momentum in deep
ternational Conference on Computer Vision and Pattern learning,” in IEEE International Conference on Machine
Recognition (CVPR), 2015. Learning (ICML), 2013.
[31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [45] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard,
A. L. Yuille, “Deeplab: Semantic image segmentation W. Hubbard, and L. Jackel, “Backpropagation applied to
with deep convolutional nets, atrous convolution, and handwritten zip code recognition,” Neural Computation,
fully connected CRFs,” in arXiv:1606.00915, 2016. vol. 1, no. 4, pp. 541–551, 1989.
[32] S. Zhang, S. Jayasumana, B. Romera-Paredes, V. Vineet, [46] D. P. Kingma and J. L. Ba, “Adam: A method for stochas-
Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional tic optimization,” in IEEE International Conference on
random fields as recurrent neural networks,” in IEEE Learning Representations (ICLR), 2015.
International Conference on Computer Vision (ICCV), [47] X. Glorot and Y. Bengio, “Understanding the difficulty of
2015. training deep feedforward neural networks,” in Interna-
[33] K. Simonyan and A. Zisserman, “Very deep convolu- tional Conference on Artificial Intelligence and Statistics
tional networks for large-scale image recognition,” in (AISTATS), 2010.
IEEE International Conference on Learning Represen- [48] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
tation (ICLR), 2015. jna, “Rethinking the inception architecture for computer
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings vision,” in IEEE International Conference on Computer
in deep residual networks,” in European Conference on Vision and Pattern Recognition (CVPR), 2016.
Computer Vision (ECCV), 2016. [49] F. Chollet, “Xception: Deep learning with depthwise sep-
[35] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large arable convolutions,” in IEEE International Conference
kernel matters – improve semantic segmentation by on Computer Vision and Pattern Recognition (CVPR),
global convolutional network,” in IEEE International 2017.
Conference on Computer Vision and Pattern Recognition
(CVPR), 2017.
[36] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid re-
construction and refinement for semantic segmentation,” Lichao Mou (S’16) received the bachelor’s degree
in automation from the Xi’an University of Posts and
in European Conference on Computer Vision (ECCV), Telecommunications, Xi’an, China, in 2012, and the
2016. master’s degree in signal and information processing
[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet from the University of Chinese Academy of Sci-
ences, Beijing, China, in 2015. In 2015, he spent
classification with deep convolutional neural networks,” six months at the Computer Vision Group at the
in Advances in Neural Information Processing Systems University of Freiburg in Germany. He is currently
(NIPS), 2012. working toward the Ph.D. degree at the German
Aerospace Center (DLR), Wessling, Germany, and
[38] M. D. Zeiler and R. Fergus, “Visualizing and understand- the Technical University of Munich (TUM), Munich,
ing convolutional networks,” in European Conference on Germany. His research interests include remote sensing, computer vision, and
Computer Vision (ECCV), 2014. machine/deep learning, especially remote sensing video analysis and deep
networks with their applications in remote sensing. He was the recipient of
[39] A. Mahendran and A. Vedaldi, “Understanding deep the first place in the 2016 IEEE GRSS Data Fusion Contest and a finalist
image representations by inverting them,” in IEEE In- for the Best Student Paper Award at the 2017 Joint Urban Remote Sensing
ternational Conference on Computer Vision and Pattern Event.
,IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, IN PRESS 14