Liu2018 Article RGB-DJointModellingWithSceneGe

Multimed Tools Appl (2018) 77:22475–22488
https://doi.org/10.1007/s11042-018-6056-8
RGB-D joint modelling with scene geometric

information for indoor semantic segmentation
Hong Liu 1 & Wenshan Wu 1 & Xiangdong Wang 1 &

Yueliang Qian 1
Received: 15 September 2017 / Revised: 4 April 2018 / Accepted: 24 April 2018 /

Published online: 21 May 2018
# Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract This paper focuses on the problem of RGB-D semantic segmentation for indoor
scenes. We introduce a novel gravity direction detection method based on vertical lines fitting
combined 2D vision information and 3D geometric information to improve the original HHA
depth encoding. Then to fuse two-stream networks of deep convolutional networks from RGB
and depth encoding, we propose a joint modelling method by learning a weighted summing layer
to fuse the prediction results. Finally, to refine the pixel-wise score maps, we adopt fully-
connected CRF as a post-processing and propose a pairwise potential function combined normal
kernel to explore geometric information. Experimental results show our proposed approach
achieves state-of-the-art performance of RGB-D semantic segmentation on public dataset.
Keywords RGB-D . Gravity direction . Semantic segmentation . CRF
1 Introduction
Semantic segmentation is to label every pixel in the scene image with the category of the
object, such as person, car, or tree. This task is challenge for combining traditional tasks such
as segmentation, multi-label recognition in a single process.
* Hong Liu
hliu@ict.ac.cn
Wenshan Wu
wuwenshan@ict.ac.cn
Xiangdong Wang
xdwang@ict.ac.cn
Yueliang Qian
ylqian@ict.ac.cn
1
Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing
Technology, Chinese Academy of Sciences, Beijing 100190, China
22476 Multimed Tools Appl (2018) 77:22475–22488
Some previous research work focus on outdoor scenes [8, 13, 20, 22, 23]. Recently,
semantic segmentation of indoor RGB-D images attracts increasing attentions [2, 5, 6, 10,
11, 14, 21, 24, 25]. Some of former method used manually designed features [1, 9, 10, 15, 21,
24]. Recently, Convolutional Neural Networks (CNN) were shown to be effective for RGB
visual tasks including image classification [16, 26, 27], object detection [16, 26, 27] and
semantic segmentation [1, 8, 11, 20]. Inspired by above work, CNN is also adopt for RGB-D
semantic segmentation to learn representations from depth image and RGB image together.
Farabet [8] treated depth image as one channel and combined RGB as four channels to input
into CNN, but the improvement is small compared with original RGB semantic segmentation.
Instead of using depth image directly, most of the recent work use HHA encode from depth
data proposed by Gupta [11] to train CNN model, which considers Horizontal disparity, Height
above the ground, Angle between normal and gravity. And gravity direction in point cloud is
one of important information to calculate the height above the ground, angle between normal
and gravity in HHA encoding. Gupta [11] estimated the gravity direction by finding the
direction of analogous horizontal planes and vertical planes which are most aligned to or most
orthogonal to locally estimated surface normal directions. One limitation is that there might not
be enough horizontal and vertical planes in some scenes. We find that there are several
numbers of vertical lines in most indoor scenes, such as edges of door, cabinet, corner of
wall. These verticals are consistent with the gravity direction of the scene. While detecting
vertical lines in 3D point cloud directly is time consuming. We propose a novel algorithm to
detect gravity direction in 2D RGB image firstly and confirm them in 3D point cloud to reduce
computing complexity.
With depth encoding, we can get the geometric representation of the scene by training CNN
model. For RGB-D semantic segmentation, how to combine the RGB model and depth
encoding model, and jointly model visual and geometric representation are important prob-
lems. Existing work is mainly divided into early fusion and late fusion, for example feature
fusion [8, 28] and prediction fusion [19]. Long [19] extended FCN network to RGB-D
semantic segmentation, which trained two CNN networks separately then averagely summed
two prediction results at the final layer. In this paper, we adopt DCNN model [4] to train RGB
and depth models, and propose a joint modelling method by learning a weighted summing
layer to fuse the prediction results of RGB-DCNN and HHA-DCNN models.
In addition, results of semantic segmentation from CNN or DCNN are very coarse for
object edge, which loses fine contour information of the objects. Several papers [4, 18] used
Conditional Random Field (CRF) to promote consistency of RGB scene image. As we know,
we are the first work to adopt CRF as post-processing for RGB-D image semantic segmen-
tation, and refine the results of above jointed DCNN output. In our paper, we add pixel-wise
normal information as geometric constraint into pairwise potential besides the RGB kernel to
promote consistency of pixel-wise labelling. Our main work is as following:
(1) We propose a fast gravity direction detection algorithm based on vertical fitting. To
reduce computing time, we firstly detect vertical lines in 2D RGB image, then fit these
lines in 3D point cloud space, and finally use the mean direction as gravity direction. The
new gravity detection method improve the robustness of HHA depth encoding.
(2) We adopt more superior DCNN network to train RGB and HHA depth encoding. Then we
learn a weighted summing convolution layer to joint model the prediction results from
RGB-DCNN and HHA-DCNN. This two-stream networks can be learned end-to-end.
Multimed Tools Appl (2018) 77:22475–22488 22477
(3) We propose a CRF pairwise potential function combining normal information with color,
location information as geometric constraint, and use grid search by stages to reduce
computational complexity of parameter selection.
(4) We achieve the state of the art performance on public NYUDv2 dataset for RGB-D
semantic segmentation task.
2 Related work
2.1 CNN and CRF in semantic segmentation
CNN can learn hierarchical feature representations of visual objects, which has great
progress in image classification and object detection. But for semantic segmentation task,
CNN only produce coarse output map. Pixel-wise semantic segmentation needs dense
labelling with the original image resolution. To solve this problem, Long [19] proposed
fully convolutional neural networks (FCN), which achieved end-to-end dense labelling.
Chen [4] used deep convolutional nets (DCNN) to improve pixel-wise labelling and
refined the boundary by fully connected CRF. But above work focused on semantic
segmentation of RGB images.
2.2 RGB-D semantic segmentation
Farabet [8] extended CNN hierarchical features to RGB-D images, which used depth image
along with RGB image as four-channel input of CNN to extract multi-scale features. But it
has little enhancement compared with only RGB model. For RGB-D indoor semantic
segmentation, Gupta [11] extended RCNN to RGB-D object detection and segmentation
with HHA depth encoding. They proposed HHA depth encoding, which considered depth,
height above ground, angle between normal and gravity. These features contain rich scene
geometric information, and there has common structure between HHA images and RGB
images. So a network designed for RGB images can also learn a suitable representation
from HHA images.
Long [19] extended FCN framework to RGB-D indoor semantic segmentation with HHA.
They trained two networks separately then averagely summing two predictions results at the
final layer, which achieved state of the art results on NYUDv2 dataset [25]. Eigen [7] used
RGB, depth and normal as input, and concatenated multi-scale CNN features. For RGB-D
feature fusion, Wang [28] proposed feature transformation network to complement RGB and
depth features with a deconvolution network. Li [17] proposed LSTM fusion layer for RGB
and HHA features, which achieved the best result in mean accuracy on NYUDv2 for forty
class task. He [12] concatenated pool-4 features of RGB and HHA model, and used super-
pixel to promote boundary during upsampling, which achieved the state of the art
performance.
Most of the above methods use HHA as depth encoding. Gravity direction in point cloud is
one of important information for HHA. Gupta [11] estimated the gravity direction by finding
the direction of analogous horizontal planes and vertical planes, which is limited when there
might not have enough horizontal and vertical planes in some cases. And there is no work
using CRF for RGB-D semantic segmentation.
3 Our proposed method
3.1 Our framework
We propose a system framework as Fig. 1 shows. Firstly, we improve depth encoding of HHA
based on a novel gravity direction detection method to learn better representation from scene
geometric information. Secondly, we use DCNN [4] to train RGB and HHA model separately,
then add a fusion layer to learn weights of the two networks and weighted sum the two
prediction results. The two-stream network can be jointly modelled and trained by end-to-end.
Thirdly, we adopt CRF as post-processing for RGB-D semantic segmentation to further
improve the consistency of segmentation. In our CRF, we use DCNN output as unary potential
and add normal information into the pairwise potential function to refine the results of
segmentation. We will introduce the details in following.
3.2 Gravity direction based on vertical lines fitting
Most existing work adopted HHA as depth encoding for RGB-D semantic segmentation. And
two dimensions of HHA are calculated based on gravity, i.e. height and angle. So detection of
gravity direction is important for HHA. We find most of indoor objects and structures have
clear vertical edges, which are consistent with the gravity direction of the scene, such as edges
of door, cabinet, corner of wall. We propose a novel method to detect gravity direction by
vertical lines fitting. While detecting vertical lines in 3D point cloud is time consuming, we
firstly extract lines in 2D RGB image by probabilistic Hough transformation, then filter these
lines to get candidate vertical lines, finally perform lines fitting in 3D point cloud to get gravity
direction of scene.
3.2.1 Candidate vertical lines detection in RGB image
Firstly, we detect all lines in 2D RGB image by Hough transform. In order to get the most
reliable verticals from the scene, we sort these lines according to the length of projection along
Y-axis. After this step, we can get lines that are visually vertical by selecting those lines within
certain angle with Y-axis. But some horizontal edges are visually vertical in two-dimension
RGB images. In 3D point cloud we check the normal information of candidate lines to reduce
Fig. 1 Our proposed system framework

some near horizontal lines. Figure 2 shows some results of vertical lines detection in RGB
image, which are labelled with red color. Algorithm1 shows the details of candidate vertical
lines detection in 2D image.
3.2.2 Direction of gravity based on vertical fitting
After extracting near vertical lines from RGB images, next step is to fit these verticals in 3D point
cloud to get gravity direction. We use least square line fitting proposed by Wang [3]. As Fig. 3
Fig. 2 Vertical detection results in two scenes with red color

shows, using the estimate of the line direction (l, m, n), the line pass point (x0, y0, z0) and total
points including (x1, y1, z1), ⋯, (xk, yk, zk). Then the line function in 3D space is as following:
x−x0 y−y0 z−z0
¼ ¼ ð1Þ
l m n
The cost function is the summation of the squares of the distance that all the points to the
fitting line.
8
< k
min ∑ d 2i
ð2Þ
: i¼1
s:t l2 þ m2 þ n2 ¼ 1
Using di as distance (xi, yi, zi) to fitting line:
d 2i ¼ ðxi −x0 Þ2 þ ðyi −y0 Þ2 þ ðzi −z0 Þ2 −½l ðxi −x0 Þ þ mðyi −y0 Þ þ mðzi −z0 Þ2 ð3Þ
And fitting line must go through the centroid of all the points. Then we construct the
covariance matrix and apply singular value decomposition. And the Eigenvector correspond-
ing to the largest eigenvalue is the final answer. Then we adopt mean vector of several vertical
lines as final gravity direction.
3.2.3 Improved HHA depth encoding HHA_adv
We improve the HHA depth encoding based on the new gravity algorithm above. In the case
that the ground do not appear, the lowest point in Y-axis should exceed the true ground level,
Fig. 3 Line fitting by least square in 3D

which needs a height compensation. So we need two parameters, one for lowest point along Y-
axis where ground doesn’t appear we called abnormal scene, another for normal distance
between ground and camera. We analyze the distribution of distance between ground and the
camera, and find that most distance falls in 110 cm~160 cm. The original HHA takes −90 as
threshold for abnormal scene and − 130 for normal distance. That is when lowest point from
cameral is smaller than 90, it needs height compensation to normal distance 130. In our paper,
we test three thresholds for abnormal scene to get improved HHA (HHA_adv), and train HHA
models using DCNN [4]. The details will be discussed in experiment section.
3.3 Joint modelling of RGB-D models
We use DCNN proposed in [4], which has no paddings during training process and is more
efficient than FCN [19]. In addition, DCNN introduced a multi-scale prediction method to
increase the accuracy of boundary localization. For we have two-stream networks from RGB
and HHA, to fuse these two networks, we test early fusion and late fusion, which refers to
feature fusing and result fusing respectively. For early fusion, we concatenate pool-5 features
of RGB and HHA network and only fine-tune the layers after pool-5, which achieved
comparable results to late fusion. And we notice that RGB and HHA model have different
representations for different objects recognition. For late fusion, we test average sum [19] and
weighted sum for outputs of two networks. As shown in Fig. 4, we add predictions from both
networks in fusion layer and the loss is backward propagated, both of which are trained end-to-
end.
Long used average sum for late fusing [19], which has fixed parameters to fuse RGB and
HHA networks for all of the objects. While we find from recognition results, RGB model and
Depth model have different recognition performance on different object. Unlike average sum
where the final layer is a concatenation layer in Caffe, our final layer in weighted sum is a
convolution layer to learn weights from the two-stream network. We propose collaborative
training method by learning weighted fusion layer from the two-stream DCNNs. This weight-
ed fusing layer convolutes prediction results for the two networks and uses back propagation
of errors to complete end-to-end training.
On 40 classed semantic segmentation task on RGBD dataset of NYUDv2 [25], for each

pixel Pi, let the output probability of RGB model is pci ¼ pci1 ; ⋯; pci40 , the output probability

of Depth model is pdi ¼ pdi1 ; ⋯; pdi40 , and the learned weight W is a 80 × 40 matrix. For each
RGB Data RGB Layers
Data fusion Loss

Layer Depth Data Depth Layers layer
ground-truth
Fig. 4 Joint model training of two-stream network

category, its fusion results are weighted sum of the probability outputs from two models. After
weighted fusion, the output probability of the pixel Pi is [pc pd] × W, which can get probability
distribution of 40 classes.
3.4 Fully-connected CRF combined normal information
As Fig. 5 shows, the output score maps of DCNN are quite smooth. Chen propose a fully
connected pairwise CRF model [4], which can largely improve the performance of a boosting-
based pixel-level classifier and capture fine edge details for RGB semantic segmentation. In
this paper we extend Chen’s CRF model to RGBD semantic segmentation by adding normal
information as geometric constraint to get refined results from depth information.
The model employs the energy function including unary potential φ(xi) and pairwise
potential ψ(xi, xj).

E ðxÞ ¼ ∑ φðxi Þ þ ∑ ψ xi ; x j ð4Þ
i∈N i; j∈N
The unary potential comes from label probability at each pixel by above RGB-HHA fused

model. The pairwise potential is ψ xi ; x j ¼ μ xi ; x j ∑km¼1 wm ∙k m f i ; f j , which is fully
connected for each pair of pixels. And each km is the Gaussian kernel and is weighted by
parameter wm. Chen adopted bilateral position and color terms as kernel for RGB semantic
segmentation. In this paper, we add normal term and propose the following kernels as:
! !
p −p 2 I i −I j 2 p −p 2
i j i j
w1 exp − − þ w2 exp −
2σ2α1 2σ2β1 2σ2γ
!
p −p 2 N i −N j 2
i j
þ w3 exp − − ð5Þ
2σ2α2 2σ2β2
where the first kernel depends on both pixel positions and pixel Color intensities, the second
kernel only depends on pixel Positions, and the third kernel depends on both pixel positions
and Normal intensities. The hyper parameters σα1, σβ1, σγ, σα2 and σβ2 control the Bscale^ of
the Gaussian kernels.
There are eight parameters to determine in above formulation. We take default values of w2
and σγ as [4]. For other six parameters, to reduce inference complexity, we adopt phased grid
search which searches color kernel and normal kernel in isolated process. Taking output of
HHA-DCNN as unary potential, use normal kernel and position kernel as pairwise potential,
grid search another three parameters w3, σα2, σβ2, similarly for color kernel. Taking advantage
of phased grid search, the complexity of solving parameters is reduced from 56 to 53. As a
(a) RGB image (b) DCNN predictions (c) DCNN+CRF result (d) ground-truth
Fig. 5 DCNN output and CRF output for a kitchen scene
result, the inference time is reduced from one month to one day, which is the square root of
normal grid search.
4 Experimental results
4.1 Dataset and setting
We conduct three tests to evaluate our proposed method including depth encoding as input
preprocessing, joint modelling of RGB and HHA, CRF post-processing with normal kernel for
RGB-D images.
We test our framework on public RGB-D dataset NYUDv2 [25], which has 1449 RGB-D
images with pixel-wise labels for 40 class semantic segmentation task by Gupta [10]. We
report results on the standard split of 795 training images and 654 testing images. We adopt
four metrics proposed by [19] that are variations on pixel accuracy and region intersection over
union (IU).
Instead of CNN or FCN [19] we adopt DCNN [4] for training in this paper and we use
VGG-16 network which has been pre-trained on ImageNet. For single model, we fine-tuned
the VGG-16 network on the NYUDv2 40-class pixel-classification task by stochastic gradient
descent on the cross-entropy loss function. We use a mini-batch of 10 images and initial
learning rate of 0.0001 (0.001 for the final classifier layer), multiplying the learning rate by
0.1 at every 2000 iterations. We use momentum of 0.99 and a weight decay of 0.0005.
This paper adopt four metrics from common semantic segmentation and scene parsing
evaluations that are variations on pixel accuracy and region intersection over union (IU)
as [19].
& pixel accuracy: ∑inii/∑iti

& mean accuracy: (1/ncl)∑inii/ti
& mean IU: (1/ncl)∑inii/(ti + ∑jnji − nii)
& frequency weighted IU: (∑ktk)−1∑itinii/(ti + ∑jnji − nii)
Where nij is the number of pixels of class i predicted to belong to class j, there are ncl
different classes, and ti = ∑jnij is the total number of pixels of class i.
4.2 HHA_adv model based on improved gravity direction detection
We use vertical lines fitting method to get gravity direction and improve the description of
HHA. As described in section 3.2.3, we try three thresholds −90, −100, −110 for abnormal
scene and produce three group of improved HHA (HHA_adv). As Table 1 shows, the
Table 1 Model comparison using different threshold for HHA model
threshold data pixel acc. mean acc. mean IU f.w IU
−90 DCNN-HHA_adv 61.3 40.9 29.7 45.0

−100 DCNN-HHA_adv 61.7 40.7 29.8 45.3
−110 DCNN-HHA_adv 61.7 40.8 29.6 45.3
−90 FCN-HHA [19] 57.1 35.2 24.2 40.4
Table 2 Comparison results of fusion performance for RGB-D
Method pixel acc. mean acc. mean IU f.w IU
RGB 66.1 47.7 35.8 50.0

HHA_adv 61.7 40.8 29.6 45.3
RGB + HHA_adv (early fusing, pool5) 68.8 51.4 39.4 53.3
RGB + HHA_adv (late fusing, average sum) 69.0 51.0 39.4 53.8
RGB + HHA_adv (late fusing, weighted sum) 69.3 51.5 39.4 53.8
threshold of −110 get the better results and our HHA_adv trained by DCNN are better than
original HHA [19]. For NYUDv2 dataset, we find there are some complex or dark scenes,
where the detected verticals are not reliable. In this case, we take the original gravity direction
as Gupta’s [11], which involves 267 images of total 1449 images. And all the results are better
than FCN-HHA [19] by 4%.
4.3 RGB-D joint modelling based on two-stream network
After training single models of RGB and HHA_adv using DCNN, we fine tuning the two-
stream network in three ways including early fusing, average sum and weighted sum as
described in section 3.3. Table 2 shows the semantic segmentation results from RGB and
HHA_adv with DCNN model for three different fusing strategies. Simply concatenating
features from two networks of pool5 has about 2.7% promotion compared with 66.1%. Our
proposed weighted sum is better than average sum, which has 3.2% and 7.6% promotion
compared with original RGB and HHA_adv model.
4.4 CRF combined geometric information
As described in section 3.4, to explore the geometric information of RGB-D scene, we add
another normal kernel to pairwise potential. To increase the parameter determining processing,
we use the default values of w2 = 3 and σγ = 3 as [4] and we search for the best values of w1,
σα1, σβ1, w3, σα2 and σβ2 by cross-validation on a small subset of the validation set (we use
100 images). We use phased grid search, solve the two groups of parameters in independent
process. After the DCNN has been fine-tuned for RGB and HHA_adv respectively, we cross-
validate the parameters of the fully connected CRF model. The search range of the parameters
are w1 ∈ [1 : 5 : 20], σα1 ∈ [10 : 10 : 50], σβ1 ∈ [10 : 10 : 50] in RGB kernel and same for other
group of parameters in normal kernel.
For HHA_adv geometric model, we try the same kernel as RGB vision model to compare
with normal kernel. Table 3 shows that CRF with normal kernel works better than color kernel
for HHA model and also promote 1.3% compared with original DCNN by HHA_adv.
Table 3 CRF performance with the different kernels of HHA_adv model
Method pixel acc. mean acc. mean IU f.w IU
HHA_adv 61.7 40.8 29.6 45.3

HHA_adv + CRF(color kernel) 62.3 41.3 30.7 45.5
HHA_adv + CRF(normal kernel) 63.0 41.2 31.3 45.7
Table 4 Comparison performance of 40 classes semantic segmentation on NUYDv2 for state-of-the-art methods
Method pixel Mean Mean f.w

acc. acc. IU IU
Long [19] RGB + HHA/FCN/average sum 65.4 46.1 34.0 49.5

Liu [18] RGBD super-pixel/CNN/CRF 63.1 39.0 29.5 48.4
Eigen [7] RGB + Depth + Normal/ CNN/feature 65.6 45.1 34.1 51.4
concatenate
Li [17] RGB + HHA/CNN/LSTM fusion layer – 49.4 – –
Wang [28] RGB + HHA/feature transform/Deconv – 47.3 – –
He [12] RGB + HHA/super-pixel upsample 68.4 52.1 38.1 54.0
He(Multi) [12] RGB + HHA/multi-frame/super-pixel 70.1 53.8 40.1 55.7
DCNN(ours) RGB + HHA_adv/DCNN/weighted sum 69.4 51.6 40.0 53.7
DCNN+CRF(ours) RGB + HHA_adv/DCNN/weighted 70.3 51.7 41.2 54.2
sum/CRF-normal
After all the eight parameters are determined, we take the two-stream network output as
unary potential and combine these pairwise potentials for CRF inference. We compare our
DCNN+CRF model with other state of the art methods on NYUDv2 for 40 class task. As
Table 4 shows our DCNN (RGB + HHA_adv) and DCNN+CRF (RGB + HHA_adv) achieve
the state of the art performance compared with existing methods especially on pixel accuracy
metric. In these work, only method of He [12] used multi frames to enlarged training and
testing samples from NUYDv2 videos, all the other work including ours only use single frame.
So our proposed method achieves the best performance using single frame.
5 Conclusion
In this paper, we explore the geometric information of RGB-D scene to improve the perfor-
mance of semantic segmentation. We propose a novel gravity detection algorithm to improve
the robustness of HHA depth encoding. Then we learn a weighted sum layer to combining
two-stream networks from RGB and HHA_adv, and finally we modify the pairwise potential
function with normal information in CRF to refine the performance of semantic segmentation.
Experimental results show that our proposed method achieves the state of the art performance
on NYUDv2 dataset for 40 class task.
In future work, we will consider improving the robustness of the gravity detection
algorithm by randomly selecting points to fit vertical lines. We also plan to improve the
multi-model fusion method for indoor semantic segmentation.
Acknowledgments This work is supported in part by Beijing Natural Science Foundation: 4142051.
References
1. Anand A, Koppula HS, Joachims T, Saxena A (2013) Contextually guided semantic labeling and search for
three-dimensional point clouds. Int J Robot Res 32(1):19–34
2. Banica D, Sminchisescu C (2015) Second-order constrained parametric proposals and sequential search-
based structured prediction for semantic segmentation in rgb-d images. In: Computer Vision and Pattern
Recognition
3. Bingjie W, Junpeng Z, Chunjie W (2014) Spatial straightness error evaluation based on three-dimensional
least squares method. Journal of Beijing University of Aeronautics and Astronautics 40:1477–1480 (in
Chinese)
4. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep
convolutional nets and fully connected crfs. Comp Sci 357–361. https://arxiv.org/abs/1412.7062
5. Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information.
In: international conference on learning Representa- tions. Number arXiv preprint arXiv:1301.3572
6. Deng Z, Todorovic S, Latecki L J (2015) Semantic segmentation of rgbd images with mutex constraints. In:
ICCV
7. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale
convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp.
2650–2658
8. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE
Trans Pattern Anal Mach Intell 35(8):1915–1929
9. Filliat D, Battesti E, Bazeille S, et al (2012) RGBD object recognition and visual texture classification for
indoor semantic mapping. Technologies for Practical Robot Applications (TePRA), 2012 I.E. International
Conference on IEEE, pp. 127–132
10. Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from rgb-d
images. In: CVPR. 564–571
11. Gupta S, Girshick R, Arbelaez P, Malik J (2014) Learning rich features from RGB-D images for object
detection and segmentation. In: ECCV
12. He Y, Chiu WC, Keuper M, Fritz M (2017) Std2p: rgbd semantic segmentation using spatio-temporal data-
driven pooling. In CVPR, 7158–7167
13. Hong S, Noh H, Han B (2015) Decoupled deep neural network for semi- supervised semantic segmentation.
NIPS 2015
14. Khan S, Bennamoun M, Sohel F, Togneri R (2014) Geometry driven semantic labeling of indoor scenes.
ECCV 2014 8689:679–694
15. Koppula H S, Anand A, Joachims T, et al (2011) Semantic labeling of 3D point clouds for indoor scenes.
International Conference on Neural Information Processing Systems. Curran Associates Inc, pp. 244–252
16. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. In NIPS
17. Li Z, Gan Y, Liang X, et al (2016) LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for
RGB-D Scene Labeling. In: European Conference on Computer Vision. Springer International Publishing,
541–557
18. Liu F, Lin G, Shen C (2016) Discriminative Training of Deep Fully-connected Continuous CRF with Task-
specific Loss. arXiv preprint arXiv:1601.07649
19. Long J, Shelhamer E, and Darrell T (2015) Fully convolutional networks for semantic segmentation, In
CVPR, pp. 3431–3440
20. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. arXiv preprint
arXiv:1505.04366
21. Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: features and algorithms. In: CVPR 2759–2766
22. Shuai B, Zuo Z, Wang B, et al (2016) DAG-recurrent neural networks for scene labeling. In: Computer
Vision and Pattern Recognition. IEEE, pp. 3620–3629
23. Shuai B, Zuo Z, Wang G, Wang B (2016) Scene parsing with integration of parametric and non-parametric
models. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society
25(5):2379–2391
24. Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor. In: ICCV
Workshops 601–608
25. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd
images. In: ECCV, pp. 746–760
26. Simonyan K and Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
CoRR, abs/1409.1556
27. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, and Rabinovich A
(2014) Going deeper with convolutions. CoRR, abs/1409.4842
28. Wang J, Wang Z, Tao D, et al (2016) Learning common and specific features for rgb-d semantic
segmentation with deconvolutional networks. In: European Conference on Computer Vision. Springer
International Publishing, pp. 664–679
Hong Liu is an associate professor in the Institute of Computing Technology, Chinese Academy of Sciences,
Beijing, China. She received her Ph.D. degree in computer application technology at Institute of Computing
Technology, Chinese Academy of Sciences, in 2007. Her current research interests are in the fields of pattern
recognition, computer vision, intelligent surveillance system and intelligent human-computer interaction.
Wenshan Wu received her MS degree in the Institute of Computing Technology, Chinese Academy of Sciences,
in 2017. She is currently an engineer at Tencent. Her research interests include pattern recognition and image
processing.
Xiangdong Wang is an associate professor in the Institute of Computing Technology, Chinese Academy of
Sciences, Beijing, China. He received his Ph.D. degree in computer application technology at Institute of
Computing Technology, Chinese Academy of Sciences, in 2007. The main focus is on the pattern recognition,
multimedia technology and intelligent human-computer interaction.
Yueliang Qian is a professor in the Institute of Computing Technology, Chinese Academy of Sciences, Beijing,
China. His research interests include multimedia technology, natural language processing and intelligent human-
computer interaction.

Liu2018 Article RGB-DJointModellingWithSceneGe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Liu2018 Article RGB-DJointModellingWithSceneGe

Uploaded by

Copyright:

Available Formats

Multimed Tools Appl (2018) 77:22475–22488

RGB-D joint modelling with scene geometric

Hong Liu 1 & Wenshan Wu 1 & Xiangdong Wang 1 &

Received: 15 September 2017 / Revised: 4 April 2018 / Accepted: 24 April 2018 /

Keywords RGB-D . Gravity direction . Semantic segmentation . CRF

2.1 CNN and CRF in semantic segmentation

2.2 RGB-D semantic segmentation

3 Our proposed method

3.1 Our framework

3.2 Gravity direction based on vertical lines fitting

3.2.1 Candidate vertical lines detection in RGB image

Fig. 1 Our proposed system framework

3.2.2 Direction of gravity based on vertical fitting

Fig. 2 Vertical detection results in two scenes with red color

Using di as distance (xi, yi, zi) to fitting line:

3.2.3 Improved HHA depth encoding HHA_adv

Fig. 3 Line fitting by least square in 3D

3.3 Joint modelling of RGB-D models

RGB Data RGB Layers

Data fusion Loss

Fig. 4 Joint model training of two-stream network

3.4 Fully-connected CRF combined normal information

4.1 Dataset and setting

& pixel accuracy: ∑inii/∑iti

4.2 HHA_adv model based on improved gravity direction detection

Table 1 Model comparison using different threshold for HHA model

threshold data pixel acc. mean acc. mean IU f.w IU

−90 DCNN-HHA_adv 61.3 40.9 29.7 45.0

Table 2 Comparison results of fusion performance for RGB-D

Method pixel acc. mean acc. mean IU f.w IU

RGB 66.1 47.7 35.8 50.0

4.3 RGB-D joint modelling based on two-stream network

4.4 CRF combined geometric information

Table 3 CRF performance with the different kernels of HHA_adv model

Method pixel acc. mean acc. mean IU f.w IU

HHA_adv 61.7 40.8 29.6 45.3

Method pixel Mean Mean f.w

Long [19] RGB + HHA/FCN/average sum 65.4 46.1 34.0 49.5

You might also like