Professional Documents
Culture Documents
Figure 1. Examples of challenging in-the-wild images and their surface normals predicted by our method.
Figure 2. Motivation. In this paper, we propose to utilize the per-pixel ray direction and estimate the surface normals by learning the
relative rotation between nearby pixels. (a) Ray direction serves as a useful cue for pixels near occluding boundaries as the normal should
be perpendicular to the ray. (b) It also gives us the range of normals that would be visible, effectively halving the output space. (c)
The surface normals of certain scene elements — in this case, the floor — may be difficult to estimate due to the lack of visual cues.
Nonetheless, we can infer their normals by learning the pairwise relationship between nearby normals (e.g. which surfaces should be
perpendicular). (d) Modeling the relative change in surface normals is not just useful for flat surfaces. In this example, the relative angle
between the normals of the yellow pixels can be inferred from that of the red pixels assuming circular symmetry.
Despite the aforementioned usefulness, state-of-the-art (locally) flat surfaces intersect at a line, the normals rotate
methods [2, 14] do not encode the ray direction and use around that intersection. As the image intensity generally
CNNs with translational weight sharing, preventing the changes sharply near such intersections, the task can be as
model from learning ray direction-aware inference. While simple as edge detection.
recent transformer-based models [29, 56] have the capabil- Secondly, the estimated rotation can help improve the
ity of encoding the ray direction in the form of learned posi- accuracy for surfaces with limited visual cues. For instance,
tional embedding, it is not trivial to inter/extrapolate the po- while it is difficult to estimate the normal of a texture-less
sitional embedding when testing the model on images taken surface, the objects that are in contact with the surface can
with out-of-distribution intrinsics. provide evidence for its normal (see Fig. 2-c).
Lastly, as long as the relative rotations between the nor-
3.2. Modeling inter-pixel constraints mals are captured correctly, any misalignment between the
Consider a pixel i and its neighboring pixel j. Since their prediction and the ground truth can be resolved via a single
surface normals have unit length and share the same origin global rotation. For example, in the case of a flat surface,
(camera center), they are related by a 3D rotation matrix estimating inaccurate but constant normals is better than es-
R ∈ SO(3). While there are different ways to parameter- timating accurate but noisy normals, as the orientation can
ize R, we choose the axis-angle representation, θ = θe, easily be corrected (e.g. via sparse depth measurements or
where a unit vector e represents the axis of rotation and visual odometry). This is again analogous to depth estima-
θ is the angle of rotation. Then, the exponential map tion where a relative depth map is easier to learn and can be
exp : so(3) → SO(3) — which is readily available in mod- aligned via a global scaling factor.
ern deep learning libraries [1, 43] — can map θ back to R.
4. Our approach
Within flat surfaces (which are prevalent in man-made
scenes/objects), θ would be zero and R would simply be the From Sec. 4.1 to 4.3, we explain how a dense prediction
identity. In a typical indoor scene, the surfaces of objects are network can be modified to encode the inductive biases dis-
often perpendicular or parallel to the ground plane, creating cussed in Sec. 3. We then explain the network architecture
lines across which the normals should rotate by 90◦ (see and our training dataset in Sec. 4.4 and 4.5.
Fig. 2-c). For a curved surface, the relative angle between
the pixels can be inferred from the occluding boundaries by
4.1. Ray direction encoding
assuming a certain level of symmetry (see Fig. 2-d). To avoid having to learn ray direction-aware prediction, one
But why should we learn R instead of directly estimating can crop and zero-pad the images such that the principal
the normals? Firstly, learning the relative rotation is much point is at the center and the field of view is always θ◦ ,
easier, as the angle between the normals, unlike the normals where θ is set to some high value. Then, a pixel at (u, v)
themselves, is independent of the viewing direction (it is will always have the same ray direction, allowing the net-
also zero — or close to zero — for most pixel pairs). Find- work to encode it, e.g., in the form of position embedding.
ing the axis of rotation is also straightforward. When two However, such an approach (1) wastes the compute for the
𝑢 − 𝑐! Ray
𝑓! ReLU
𝑣 − 𝑐"
Input FoV: 60° Ray
𝑓" ReLU
Target FoV: 90°
Figure 3. Encoding camera intrinsics. (left) To avoid having Figure 4. Ray ReLU activation. An important constraint for sur-
to learn camera intrinsics-aware prediction, one can zero-pad or face normal estimation is that the predicted normal should be vis-
crop the images such that they always have the same intrinsics. ible. We achieve this by zeroing out the component that is in the
(right) Instead, we compute the focal length-normalized image direction of the ray.
coordinates and provide them as additional input to the network.
ConvGRU
Hidden state update
×N (can be replaced with a different RNN)
Hidden state
Figure 5. Network architecture. A lightweight CNN extracts a low-resolution feature map, from which the initial normal, hidden state
and context feature are obtained. The hidden state is then recurrently updated using a ConvGRU [9] unit. From the updated hidden state,
we estimate three quantities: rotation angle and axis to define a pairwise rotation matrix for each neighboring pixel; and a set of weights
that will be used to fuse the rotated normals.
Table 2. Quantitative evaluation of the generalization capabilities possessed by different methods. For each metric, the best results are
colored in green. For evaluation on [6, 11, 20, 30, 47], we used the official code and model weights to generate predictions and measured
their accuracies. For methods that assume a specific aspect ratio and resolution, the images were zero-padded and resized accordingly to
match the requirements. The numbers in red mean that the method was trained on the same dataset. We excluded such methods in ranking
to ensure a fair comparison.
Figure 6. Comparison to Omnidata v2 [14] (DPT [42] model trained on 12 million images using 3D data augmentation [29] and
cross-task consistency [59]). Our method shows a stronger generalization capability for challenging in-the-wild objects. For texture-less
regions (e.g. sky in the fourth column), our model resolves any inconsistency in the prediction and outputs a flat surface, while preserving
sharp boundaries around other objects.
show a significantly higher level of detail. just 12 hours on a single NVIDIA 4090 GPU, does not re-
quire geometry-aware 3D augmentations, and does not re-
One notable advantage of our method over ViT-based quire any additional supervisory signal. Our model also has
models (e.g. [29]) lies in the simplicity and efficiency 40% fewer parameters compared to [29] (72M vs 123M).
of network training. For example, Omnidata v2 [29] was
trained for 2 weeks on four NVIDIA V100 GPUs. A set 5.3. Ablation study
of sophisticated 3D data augmentation functions [29] were
used to improve the generalization performance and cross- We now run an ablation study to examine the effectiveness
task consistency [59] was enforced by utilizing other ground of the proposed usage of two new inductive biases — uti-
truth labels. On the contrary, our model can be trained in lizing dense per-pixel ray direction and modeling the pair-
NYUv2 [47] ScanNet [11] iBims-1 [30] Sintel [6] Virtual KITTI [20]
Method
mean 11.25◦ 22.5◦ 30◦ mean 11.25◦ 22.5◦ 30◦ mean 11.25◦ 22.5◦ 30◦ mean 11.25◦ 22.5◦ 30◦ mean 11.25◦ 22.5◦ 30◦
baseline 16.6 59.1 76.8 82.9 16.5 60.5 77.7 83.5 18.0 66.0 77.7 81.2 36.6 18.4 38.3 50.0 30.5 48.0 55.9 60.6
baseline + ray 16.6 59.2 77.0 82.9 16.4 61.2 77.9 83.6 17.6 66.4 78.1 81.4 36.0 19.8 38.8 50.4 29.4 50.6 58.4 62.4
baseline + ray + rot 16.4 59.6 77.7 83.5 16.2 61.0 78.7 84.4 17.1 67.4 79.0 82.3 34.9 21.5 41.5 52.7 28.9 51.3 59.2 63.2
Table 3. Ablation study - quantitative results. Adding per-pixel ray direction as input (+ ray) and updating the initial prediction via
iterative rotation estimation (+ rot) both lead to an overall improvement in the metrics. The benefit of using ray direction encoding is
clearer for datasets with out-of-distribution camera intrinsics (Sintel and Virtual KITTI).
wise relative rotation between nearby pixels. As can be Surface normal estimation is an inherently ambiguous task
seen from Tab. 3, encoding the ray direction helps improve when the camera intrinsics are not known. This was why
the accuracy, especially for out-of-distribution camera in- we proposed to encode the camera intrinsics in the form of
trinsics (Sintel and Virtual KITTI). This is in line with our dense per-pixel ray direction. While this helped us push
observations in Tab. 2. the limits of single-image surface normal estimation, it also
On the other hand, the improvement coming from the means that the model requires prior knowledge about the
rotation estimation is not big. As the accuracy metrics are camera.
dominated by the pixels belonging to large planar surfaces, Note, however, that most RGB-D datasets already pro-
they do not convey the improvements near surface bound- vide pre-calibrated camera parameters, and that monocular
aries. The metrics also do not penalize the inconsistencies cameras can be calibrated easily using patterns with known
within piece-wise smooth surfaces. Qualitative comparison relative coordinates. For in-the-wild images, we demon-
in Fig. 7 clearly shows that the proposed refinement via ro- strated in Sec. 5.2 that the intrinsics can be approximated
tation estimation improves the piece-wise consistency and using the image metadata. If no information is available, we
the sharpness of the prediction near surface boundaries. can attempt to estimate the camera intrinsics from a single
image. For instance, vanishing points with known relative
6. Conclusion angles can be used to recover the camera parameters [7]. As
our model is designed to learn the relative angle between
In this paper, we discussed the inductive biases needed for surfaces, it can in turn be used for camera calibration. This
surface normal estimation and introduced how per-pixel ray will be explored in our future work.
8. Acknowledgement [10] James Coughlan and Alan L Yuille. The manhattan world
assumption: Regularities in scene statistics which enable
Research presented in this paper was supported by Dyson bayesian inference. In NeurIPS, pages 845–851, 2000. 2
Technology Ltd. The authors would like to thank Shikun [11] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
Liu, Eric Dexheimer, Callum Rhodes, Aalok Patwardhan, ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
Riku Murai, Hidenobu Matsuki, and members of the Dyson Richly-annotated 3d reconstructions of indoor scenes. In
Robotics Lab for insightful feedback and discussions. CVPR, pages 5828–5839, 2017. 6, 7, 8
[12] Tien Do, Khiem Vuong, Stergios I Roumeliotis, and
Hyun Soo Park. Surface normal estimation of tilted images
References via spatial rectifier. In Proceedings of the European Confer-
ence on Computer Vision (ECCV), Part IV, pages 265–280,
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene 2020. 2
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
[13] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Morcos, Giulio Biroli, and Levent Sagun. Convit: Improv-
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
ing vision transformers with soft convolutional inductive bi-
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
ases. In International Conference on Machine Learning,
Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga,
pages 2286–2296. PMLR, 2021. 6
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
[14] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-
Zamir. Omnidata: A scalable pipeline for making multi-
war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer-
task mid-level vision datasets from 3d scans. In ICCV, pages
nanda Viégas, Oriol Vinyals, Pete Warden, Martin Watten-
10786–10796, 2021. 1, 2, 3, 6, 7
berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-
Flow: Large-scale machine learning on heterogeneous sys- [15] David Eigen and Rob Fergus. Predicting depth, surface nor-
tems, 2015. Software available from tensorflow.org. 3 mals and semantic labels with a common multi-scale convo-
lutional architecture. In ICCV, pages 2650–2658, 2015. 2
[2] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Es-
[16] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
timating and exploiting the aleatoric uncertainty in surface
prediction from a single image using a multi-scale deep net-
normal estimation. In ICCV, pages 13137–13146, 2021. 1,
work. In NeurIPS, pages 2366–2374, 2014. 2
2, 3, 7
[17] Jose M Facil, Benjamin Ummenhofer, Huizhong Zhou,
[3] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Luis Montesano, Thomas Brox, and Javier Civera. Cam-
Irondepth: Iterative refinement of single-view depth us- convs: Camera-aware multi-scale convolutions for single-
ing surface normal and its uncertainty. arXiv preprint view depth. In CVPR, pages 11826–11835, 2019. 4
arXiv:2210.03676, 2022. 1, 5
[18] David F Fouhey, Abhinav Gupta, and Martial Hebert. Data-
[4] Aayush Bansal, Bryan Russell, and Abhinav Gupta. Marr driven 3d primitives for single image understanding. In
revisited: 2d-3d alignment via surface normal prediction. In ICCV, pages 3392–3399, 2013. 2
CVPR, pages 5965–5974, 2016. 2 [19] David Ford Fouhey, Abhinav Gupta, and Martial Hebert.
[5] Oliver Boyne, Gwangbin Bae, James Charles, and Roberto Unfolding an indoor origami world. In Proceedings of the
Cipolla. Found: Foot optimization with uncertain normals European Conference on Computer Vision (ECCV), Part VI,
for surface deformation using synthetic data. arXiv preprint pages 687–702, 2014. 2
arXiv:2310.18279, 2023. 1 [20] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora
[6] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Vig. Virtual worlds as proxy for multi-object tracking anal-
Michael J Black. A naturalistic open source movie for optical ysis. In CVPR, pages 4340–4349, 2016. 6, 7, 8
flow evaluation. In Proceedings of the European Conference [21] Richard Hartley and Andrew Zisserman. Multiple View Ge-
on Computer Vision (ECCV), Part VI, pages 611–625, 2012. ometry in Computer Vision. Cambridge University Press,
6, 7, 8 2003. 2
[7] Bruno Caprile and Vincent Torre. Using vanishing points for [22] Derek Hoiem, Alexei A Efros, and Martial Hebert. Auto-
camera calibration. International journal of computer vision, matic photo pop-up. In ACM SIGGRAPH, pages 577–584.
4(2):127–139, 1990. 8 2005. 2
[8] Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Ko- [23] Derek Hoiem, Alexei A Efros, and Martial Hebert. Recover-
jima, Max Hamilton, and Jia Deng. Oasis: A large-scale ing surface layout from an image. IJCV, 75:151–172, 2007.
dataset for single image 3d in the wild. In Proceedings of 2
the IEEE/CVF Conference on Computer Vision and Pattern [24] Hugues Hoppe, Tony DeRose, Tom Duchamp, Mark Hal-
Recognition, pages 679–688, 2020. 6, 7, 3 stead, Hubert Jin, John McDonald, Jean Schweitzer, and
[9] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, Werner Stuetzle. Piecewise smooth surface reconstruction.
and Yoshua Bengio. On the properties of neural machine In Proceedings of the 21st annual conference on Computer
translation: Encoder-decoder approaches. In Eighth Work- graphics and interactive techniques, pages 295–302, 1994. 2
shop on Syntax, Semantics and Structure in Statistical Trans- [25] Youichi Horry, Ken-Ichi Anjyo, and Kiyoshi Arai. Tour into
lation (SSST-8), 2014. 5 the picture: using a spidery mesh interface to make anima-
tion from a single image. In Proceedings of the 24th an- Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
nual conference on Computer graphics and interactive tech- imperative style, high-performance deep learning library. In
niques, pages 225–232, 1997. 2 NeurIPS, 2019. 6, 1
[26] Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexan- [41] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun,
der G Schwing. Sail-vos 3d: A synthetic dataset and base- and Jiaya Jia. Geonet: Geometric neural network for joint
lines for object detection and 3d mesh reconstruction from depth and surface normal estimation. In CVPR, pages 283–
video data. In Proceedings of the IEEE/CVF Conference 291, 2018. 1
on Computer Vision and Pattern Recognition, pages 1418– [42] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
1428, 2021. 6 sion transformers for dense prediction. In ICCV, pages
[27] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra 12179–12188, 2021. 2, 7
Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view [43] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay-
stereopsis. In IEEE Conference on Computer Vision and Pat- lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia
tern Recognition (CVPR), 2018. 6 Gkioxari. Accelerating 3d deep learning with pytorch3d.
[28] Katsushi Ikeuchi and Berthold KP Horn. Numerical shape arXiv preprint arXiv:2007.08501, 2020. 3
from shading and occluding boundaries. Artificial intelli- [44] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
gence, 17(1-3):141–184, 1981. 2 Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
[29] Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir and Joshua M. Susskind. Hypersim: A photorealistic syn-
Zamir. 3d common corruptions and data augmentation. In thetic dataset for holistic indoor scene understanding. In
Proceedings of the IEEE/CVF Conference on Computer Vi- International Conference on Computer Vision (ICCV) 2021,
sion and Pattern Recognition, pages 18963–18974, 2022. 1, 2021. 6
2, 3, 6, 7 [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
[30] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and net: Convolutional networks for biomedical image segmen-
Marco Korner. Evaluation of cnn-based single-image depth tation. In International Conference on Medical Image Com-
estimation methods. 2018. 6, 7, 8 puting and Computer Assisted Intervention (MICCAI), Part
[31] Jana Košecká and Wei Zhang. Extraction, matching, and III, pages 234–241, 2015. 2
pose recovery based on dominant rectangular structures. [46] Shreeyak Sajjan, Matthew Moore, Mike Pan, Ganesh Na-
Computer Vision and Image Understanding, 100(3):274– garaja, Johnny Lee, Andy Zeng, and Shuran Song. Clear
293, 2005. 2 grasp: 3d shape estimation of transparent objects for manip-
ulation. In 2020 IEEE International Conference on Robotics
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
and Automation (ICRA), pages 3634–3642. IEEE, 2020. 6
Imagenet classification with deep convolutional neural net-
[47] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
works. In NeurIPS, pages 1106–1114, 2012. 2
Fergus. Indoor segmentation and support inference from
[33] Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto
rgbd images. In Proceedings of the European Conference
Cipolla. Sparc: Sparse render-and-compare for cad
on Computer Vision (ECCV), Part V, pages 746–760, 2012.
model alignment in a single rgb image. arXiv preprint
6, 7, 8
arXiv:2210.01044, 2022. 1
[48] Leslie N Smith and Nicholay Topin. Super-convergence:
[34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Very fast training of residual networks using large learning
Haffner. Gradient-based learning applied to document recog- rates. arXiv preprint arXiv:1708.07120, 2018. 6
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[49] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik
1
Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl
[35] David C Lee, Martial Hebert, and Takeo Kanade. Geomet- Ren, Shobhit Verma, et al. The replica dataset: A digital
ric reasoning for single image structure recovery. In 2009 replica of indoor spaces. arXiv preprint arXiv:1906.05797,
IEEE conference on computer vision and pattern recogni- 2019. 6
tion, pages 2136–2143. IEEE, 2009. 2 [50] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field
[36] Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei transforms for optical flow. In ECCV, 2020. 5, 1
Xiao, and Anima Anandkumar. Prismer: A vision- [51] Rui Wang, David Geraghty, Kevin Matzen, Richard Szeliski,
language model with an ensemble of experts. arXiv preprint and Jan-Michael Frahm. Vplnet: Deep single view normal
arXiv:2303.02506, 2023. 1 estimation with vanishing points and lines. In CVPR, pages
[37] Ilya Loshchilov and Frank Hutter. Decoupled weight decay 689–698, 2020. 2
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 [52] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu,
[38] David Marr. Analysis of occluding contour. Proceedings of Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se-
the Royal Society of London. Series B. Biological Sciences, bastian Scherer. Tartanair: A dataset to push the limits of
197(1129):441–475, 1977. 2 visual slam. In 2020 IEEE/RSJ International Conference
[39] Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3d on Intelligent Robots and Systems (IROS), pages 4909–4916.
ken burns effect from a single image. ACM Transactions on IEEE, 2020. 6
Graphics (ToG), 38(6):1–15, 2019. 6 [53] Xiaolong Wang, David Fouhey, and Abhinav Gupta. Design-
[40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, ing deep networks for surface normal estimation. In CVPR,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming pages 539–547, 2015. 2
[54] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J
Black. Icon: Implicit clothed humans obtained from nor-
mals. In 2022 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 13286–13296.
IEEE, 2022. 1
[55] Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and
Michael J Black. Econ: Explicit clothed humans optimized
via normal integration. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 512–523, 2023. 1
[56] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and
Elisa Ricci. Transformer-based attention networks for con-
tinuous pixel-wise prediction. In ICCV, pages 16269–16279,
2021. 1, 3
[57] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan
Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs:
A large-scale dataset for generalized multi-view stereo net-
works. Computer Vision and Pattern Recognition (CVPR),
2020. 6
[58] Amir R Zamir, Alexander Sax, William Shen, Leonidas J
Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:
Disentangling task transfer learning. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 3712–3722, 2018. 6
[59] Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri,
Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. Ro-
bust learning through cross-task consistency. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 11197–11206, 2020. 2, 7
[60] Guangyao Zhai, Dianye Huang, Shun-Cheng Wu, HyunJun
Jung, Yan Di, Fabian Manhardt, Federico Tombari, Nassir
Navab, and Benjamin Busam. Monograspnet: 6-dof grasping
with a single rgb image. In 2023 IEEE International Confer-
ence on Robotics and Automation (ICRA), pages 1708–1714.
IEEE, 2023. 1
[61] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera,
Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learn-
ing of monocular depth estimation and visual odometry with
deep feature reconstruction. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
340–349, 2018. 2
[62] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3836–3847, 2023. 1
[63] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui,
Martin R Oswald, Andreas Geiger, and Marc Pollefeys.
Nicer-slam: Neural implicit scene encoding for rgb slam.
arXiv preprint arXiv:2302.03594, 2023. 1
Rethinking Inductive Biases for Surface Normal Estimation
Supplementary Material
9. Network architecture
Tab. 4 shows the architecture of the CNN used to extract the
initial surface normals, initial hidden state, and context fea-
ture. For the ConvGRU cell and convex upsampling layer,
we use the architecture of [3] and [50], respectively.
Table 4. Network architecture. In each 2D convolutional layer, ”ks” and Cout are the kernel size and the number of output channels,
respectively. FN represents the feature-map of resolution H/N × W/N , and rN is a dense map of per-pixel ray direction in the same
resolution. X + Y means that the two tensors are concatenated, and up(·) is bilinear upsampling by a factor of 2.
Input Image Omnidata v2 Ours
Figure 8. Additional comparison to Omnidata v2 [14] on in-the-wild images from the OASIS dataset [8].