Professional Documents
Culture Documents
Bo Li1,2 , Chunhua Shen2,3 , Yuchao Dai4 , Anton van den Hengel2,3 , Mingyi He1
1
Northwestern Polytechnical University, China
2
University of Adelaide, Australia; 3 Australian Centre for Robotic Vision
4
Australian National University
96
256 384 384 4096 4096
. 256
. 16384
. .concat Output
Input patches Fixed weights, shared by all scale patches Learned weights
rescale: 227x227 FC1
Figure 1: Visualization of our multi-scale framework. Each patch goes through five convolutional layers and the first fully-connected layer
(here transferred from AlexNet). The features are concatenated before they are fed to two additional fully-connected layers. We then refine
the predictions from the CNN by inference of a hierarchical CRF (not shown here; see text for details).
closed-form solution. An overall CNN architecture is pre- Effect of multi-scale features and long-range context
sented in Fig. 1. Note that the fixed-weights part of the In our depth regression deep network we use multi-scale
CNN can be transferred from pre-trained models such as patches to extract depth cues. Since local features alone
Krizhevsky et al.’s AlexNet [15] or the deeper VGGNet may not be sufficient to estimate depth at a point, we need
[22]. to consider the global context of the image [10]. With in-
creasing the patch size, more information is included and
3.1. Depth regression with CNNs image context encodes depth more accurately. Therefore
Most existing work predicts depth by regressing with to regress depth or surface normals from image patches, it
hand-crafted features. We note that, for pixel-based ap- makes sense to use large patches to extract non-local infor-
proaches a local feature vector extracted from a local neigh- mation.
borhood area can be insufficient to predict the depth la- In real-world data sets, generally there is a scale change
bel. Thus, a certain form of context, combining information between scenes due to varying focal lengths. To deal with
from neighboring pixels capturing a spatial configuration, these scaling effects, one strategy is to extract the charac-
has to be used. To encode the depth, we use a deep network teristic scale for each point according to scale space the-
and formulate depth estimation as a regression problem, as ory [23]. Here we exploit another strategy by applying a
shown in Fig. 1. Here the first five convolutional layers and discrete multi-scale approach due to efficiency considera-
the first fully-connected layer (FC1) are transferred from tion. We extract patches of different sizes to capture the
the AlexNet, and the weights are fixed, shared by all input scaling effect across the data set.
patches. The outputs of FC1 are then concatenated to feed To analyze the effects of both multi-scale and context, we
into two additional fully-connected layers (FC-a and FC-b). conducted experiments to evaluate performance of depth es-
The weights of FC-a and FC-b are learned using training timation on NYU V2 data set with increasing size of single
data. patch. Experimental results are reported in Table 1. Per-
To predict the depth of a pixel/super-pixel, we first ex- formance of depth estimation gradually improves with the
tract patches of different sizes around that point, then resize increasing patch size. With increased number of scales, per-
all the patches to 227 × 227 pixels to form the multi-scale formance of depth estimation improves with respect to the
inputs. Details of the network and training details are de- increase in the number of image patches. Both experiments
scribed in Section 3.3. demonstrate that multi-scale image patches and large image
The multi-scale structure is inspired by the relationship context are of critical importance in achieving good perfor-
between depth and scale. In addition, context information mance in depth estimation.
often includes rich cues as to the depth of a point. In our
3.2. Refining the results via hierarchical CRF
experiments, we provide extensive comparisons and analy-
sis to demonstrate that a large-size patch with rich context So far we have shown how depth may be predicted for
information and multi-scale observations are critical to the super-pixels using regression. Now our goal is to refine
task of depth regression. the predicted depth or surface normals from the super-pixel
δ < 1.25
input patch size δ < 1.252 rel log10 rms
where ES denotes the set of pairs of super-pixels that share
δ < 1.253 a common boundary and P is the set of patches designed on
47.00% the pixel level, aiming at capturing the local relationships in
55 × 55 pixels 76.85% 0.328 0.129 1.052
91.19%
depth map.
53.48% Generally speaking, this is similar to a high-order CRF
121 × 121 pixels 82.89% 0.280 0.112 0.972 defined on both super-pixel and pixel levels. Now, we ex-
94.41%
57.68% plain the potentials used in the energy function Eq. (1),
271 × 271 pixels 86.27% 0.254 0.103 0.889 where the first two potentials are defined on the super-pixel
95.84%
59.15% level, while the third one is defined on the pixel level.
407 × 407 pixels 86.71% 0.247 0.099 0.8717 Potential 1: Data term
96.13% 2
62.07% φi (di ) = di − di , (2)
Final result (4 scales of patches) 88.61% 0.232 0.094 0.821
96.78%
where di denotes the depth regression result from our multi-
Table 1: Depth estimation results on the NYU V2 data set under scale deep network, This term is defined at the super-pixel
different sizes of single-scale image patch setting and multi-scale level, measuring the quadratic distance between the esti-
setting. The error metrics definition could be found in Section 4 mated depth d and regressed depth d.
Potential 2: Smoothness at the super-pixel level
2
level to pixel level. To address this problem, we formulate a di − dj
φij (di , dj ) = w1 , (3)
hierarchical CRF built upon both the super-pixel and pixel λij
levels. The structure of our hierarchical CRF is illustrated
in Fig. 2. this pairwise term enforces coherence between neighbour-
More specially, let D = {d1 , ...., dn } be the set of depth ing super-pixels. Here we define the smoothness at super-
for each pixel, S = {s1 , ...., sm } be the set of super-pixels. pixel level. The quadratic distance is weighted by λij , i.e.
n is the total number of pixels in one image, and m is the the color difference between connected super-pixels in the
number of super-pixels. In our model, we assume the depth CIELUV color space [24].
value of the super-pixel to be the same as its centroid pixel. Potential 3: Auto-regression model Here we use the
Thus, we remove the super-pixels variable explicitly in our auto-regression model to characterize the local correlation
formulation. structure in the depth map, which has been used in im-
Here our energy function is: age colorization [25], depth in-painting [4], and depth im-
age super resolution [26, 27]. Depth maps for generic 3D
E(d) =
X
φi (di ) +
X
φij (di , dj ) +
X
φC (dC ), scenes contain mainly smooth regions separated by curves.
i∈S C∈P
The auto-regression model can well characterize such local
(i,j)∈ES
structure. The key insight of the auto-regression model is
(1)
that a depth map can be represented by the depth map itself
locally. Denote by du the depth value at location u. The
predicted depth map by the model could be expressed as:
X
du = αur dr , (4)
r∈C/u
sides, we add the ReLU layer and dropout layer after these Here d is the ground truth depth, dˆ is the estimated depth,
two layers. The size of layer F-cat is 16384. The size of and T denotes the set of all points in the images.
both FC-a and FC-b layers are 4096. For more detail about
4.1. NYU2 data set
the “shared weights”, please refer to [29]. The learning
rate is initialized as 0.01, and divided by 10 after 5 cycles The NYU V2 data set contains 1449 images, of which
through the training set. In our experiment, we trained the 795 images are used as a training set and 654 images are
network for roughly 20 to 30 epochs on both data sets. used as a testing set. All images were resized to 427 × 561
pixels in order to preserve the aspect ratio of the original im-
ages. In Table 2, we compare our method with state-of-the-
Image
art methods: depth transfer [13], discrete-continuous depth
estimation [11], pulling things out of perspective [18]. Our
method outperforms these methods by a large margin un-
δ < 1.25
δ < 1.252
Ground Truth
Image
C2 0.338 0.134 12.6
C1 0.283 0.094 7.01
Regression only
C2 0.281 0.102 10.74
C1 0.278 0.092 7.188
Our method
C2 0.279 0.102 10.27
Acknowledgements
Figure 5: Demonstration of the generalization capability of our
B. Li’s contribution was made when he was a visiting
method, where we estimate depth map for images not in the NYU student at the University of Adelaide, sponsored by the Chi-
V2 or Make3D data set. nese Scholarship Council.
This work was also in part supported by ARC Grants
(FT120100969, DE140100180), National Natural Science
trained a regression model for this surface normal estima- Foundation of China (61420106007), and the Data to Deci-
tion task. It is expected that following the idea of convert- sions Cooperative Research Centre, Australia.
ing surface normal regression into classification (triangular
References Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
2, 3, 5
[1] X. Ren, L. Bo, and D. Fox, “RDB-D scene labeling: Fea-
tures and algorithms,” in Proc. IEEE Conf. Comp. Vis. Patt. [16] A. Saxena, J. Schulte, and A. Y. Ng, “Depth estimation using
Recogn., 2012, pp. 2759–2766. 1 monocular and stereo cues,” in Proc. IEEE Int. Joint Conf.
Artificial Intell., 2007, vol. 7. 2
[2] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finoc-
chio, A. Blake, M. Cook, and R. Moore, “Real-time human [17] B. Liu, S. Gould, and D. Koller, “Single image depth esti-
pose recognition in parts from single depth images,” Com- mation from predicted semantic labels,” in Proc. IEEE Conf.
munications of the ACM, vol. 56, no. 1, pp. 116–124, 2013. Comp. Vis. Patt. Recogn., 2010, pp. 1253–1260. 2, 5
1 [18] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of
[3] S. R. Fanello, C. Keskin, S. Izadi, P. Kohli, D. Kim, perspective,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
D. Sweeney, A. Criminisi, J. Shotton, S. B. Kang, and IEEE, 2014, pp. 89–96. 2, 5, 6
T. Paek, “Learning to be a depth camera for close-range [19] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction
human capture and interaction,” ACM T. Graphics, vol. 33, from a single image using a multi-scale deep network,” in
no. 4, pp. 86, 2014. 1 Proc. Adv. Neural Inf. Process. Syst., 2014. 2, 5, 6
[4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
segmentation and support inference from rgbd images,” in feature hierarchies for accurate object detection and seman-
Proc. Eur. Conf. Comp. Vis., pp. 746–760. Springer, 2012. 1, tic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt.
4 Recogn., 2014. 2
[5] A. Gupta, A. Efros, and M. Hebert, “Blocks world revis- [21] M. Oquab, L. Bottou, I. Laptev, J. Sivic, et al., “Learning
ited: Image understanding using qualitative geometry and and transferring mid-level image representations using con-
mechanics,” in Proc. Eur. Conf. Comp. Vis., pp. 482–496. volutional neural networks,” in Proc. IEEE Conf. Comp. Vis.
2010. 1 Patt. Recogn., 2013. 2
[6] D. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor [22] K. Simonyan and A. Zisserman, “Very deep convolutional
origami world,” in Proc. Eur. Conf. Comp. Vis., pp. 687–702. networks for large-scale image recognition,” in Proc. Int.
2014. 1, 8 Conf. Learning Representations. 2015. 3, 5
[7] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, “Shape-from- [23] T. Lindeberg, “Scale-space theory: A basic tool for analysing
shading: a survey,” IEEE Trans. Pattern Anal. Mach. Intell., structures at different scales,” J. Applied Statistics, vol. 21,
vol. 21, no. 8, pp. 690–706, 1999. 1 no. 2, pp. 224–270, 1994. 3
[8] C. Wu, J.-M. Frahm, and M. Pollefeys, “Repetition-based [24] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation
dense single-view reconstruction,” in Proc. IEEE Conf. and object localization with superpixel neighborhoods,” in
Comp. Vis. Patt. Recogn., 2011, pp. 3113–3120. 1 Proc. IEEE Int. Conf. Comp. Vis., 2009, pp. 670–677. 4
[9] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d [25] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using
scene structure from a single still image,” IEEE Trans. Pat- optimization,” in ACM T. Graphics, 2004, vol. 23, pp. 689–
tern Anal. Mach. Intell., vol. 31, no. 5, pp. 824–840, 2009. 694. 4
1, 2, 5
[26] J. Diebel and S. Thrun, “An application of markov random
[10] A. Saxena, S. Chung, and A. Ng, “3-d depth reconstruction fields to range sensing,” in Proc. Adv. Neural Inf. Process.
from a single still image,” Int. J. Comp. Vis., vol. 76, no. 1, Syst., 2005, pp. 291–298. 4
pp. 53–69, 2008. 1, 2, 3
[27] O. Mac Aodha, N. D. Campbell, A. Nair, and G. J. Bros-
[11] M. Liu, M. Salzmann, and X. He, “Discrete-continuous tow, “Patch based synthesis for single depth image super-
depth estimation from a single image,” in Proc. IEEE Conf. resolution,” in Proc. Eur. Conf. Comp. Vis., pp. 71–84.
Comp. Vis. Patt. Recogn., 2014, pp. 716–723. 1, 2, 5, 6, 7 Springer, 2012. 4
[12] L. Ladick, B. Zeisl, and M. Pollefeys, “Discriminatively [28] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
trained dense surface normal estimation,” in Proc. Eur. Conf. S. Susstrunk, “SLIC superpixels compared to state-of-the-
Comp. Vis., pp. 468–484. 2014. 1, 2, 8 art superpixel methods,” IEEE Trans. Pattern Anal. Mach.
[13] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from Intell., vol. 34, no. 11, pp. 2274–2282, 2012. 5
video using non-parametric sampling,” in Proc. Eur. Conf. [29] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
Comp. Vis., pp. 775–788. Springer, 2012. 1, 2, 6, 7 shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
[14] J. Konrad, M. Wang, and P. Ishwar, “2d-to-3d image conver- architecture for fast feature embedding,” in Proc. ACM Int.
sion by learning depth from examples,” in Proc. IEEE Conf. Conf. Multimedia, 2014, pp. 675–678. 5
Computer Vis. & Pattern Recogn. Workshops. IEEE, 2012, [30] D. F. Fouhey, A. Gupta, and M. Hebert, “Data-driven 3D
pp. 16–22. 1, 2 primitives for single image understanding,” in Proc. IEEE
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Int. Conf. Comp. Vis., 2013. 8
classification with deep convolutional neural networks,” in