You are on page 1of 11

1190

 
IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

Avoiding Non-Manhattan Obstacles Based on


Projection of Spatial Corners in
Indoor Environment
Luping Wang and Hui Wei

   Abstract—Monocular vision-based navigation is a considerable robots is becoming an increasingly urgent issue. Visual
ability for a home mobile robot. However, due to diverse navigation in an indoor environment has considerable value
disturbances, helping robots avoid obstacles, especially non-
Manhattan obstacles, remains a big challenge. In indoor for monitoring and mission planning. However, there exist a
environments, there are many spatial right-corners that are multitude of disturbance of clutters and occlusion in an indoor
projected into two dimensional projections with special geometric environment, resulting in the predicament of avoiding
configurations. These projections, which consist of three lines, obstacles, especially non-Manhattan obstacles (e.g., shelf,
might enable us to estimate their position and orientation in 3D sofa, chair), which remains a difficult challenge for vision-
scenes. In this paper, we present a method for home robots to
based robots.
avoid non-Manhattan obstacles in indoor environments from a
monocular camera. The approach first detects non-Manhattan Instead of current methods (e.g., 3D laser scanners), visual
obstacles. Through analyzing geometric features and constraints, navigation that uses a single low-cost camera draws more
it is possible to estimate posture differences between orientation of attention, because is advantageous in consumption and
the robot and non-Manhattan obstacles. Finally according to the efficiency. In the human visual system, Gibson described that
convergence of posture differences, the robot can adjust its perception of depth is inborn and does not require additional
orientation to keep pace with the pose of detected non-Manhattan
obstacles, making it possible avoid these obstacles by itself. Based knowledge via experiment “visual cliff” [1]. It used to be seen
on geometric inferences, the proposed approach requires no prior that humans can recover the three dimensional structures
training or any knowledge of the camera’s internal parameters, using a binocular parallax. However, it was indicated that the
making it practical for robots navigation. Furthermore, the human ability to estimate the depth of isolated points is
method is robust to errors in calibration and image noise. We extremely weak, and that we are more likely to infer relative
compared the errors from corners of estimated non-Manhattan
obstacles against the ground truth. Furthermore, we evaluate the depths of different surfaces from their jointed points [2]. This
validity of convergence of differences between the robot indicates binocular features are not that important, and that it
orientation and the posture of non-Manhattan obstacles. The is possible to understand scenes using only monocular images.
experimental results showed that our method is capable of Meanwhile, it was reported that humans are sensitive to
avoiding non-Manhattan obstacles, meeting the requirements for surfaces of different orientations, allowing us to extract
indoor robot navigation.
surface and orientation information for understanding a scene
    Index Terms—Avoiding obstacle, monocular vision, navigation, [3]. Accordingly, it can be assumed that there are some simple
non-Manhattan obstacle, spatial corner.
 
rules that can be used to infer 3D structure over a short period
of time. Methods were presented to understand indoor scenes
I.  Introduction
based on projections of rectangles and right angles, but non-
ITH the aging population and growing amount of
W disabled peoples, the development of home service
Manhattan obstacles remains an undiscussed issue [4], [5].
In this paper, we present a method which allows for
Manuscript received June 16, 2019; revised December 6, 2019, February 6, understanding of non-Manhattan obstacles in an indoor
2020; accepted February 21, 2020. This work was supported by the National environment from a single image, without prior training or
Natural Science Foundation of China (61771146, 61375122), the National
Thirteen 5-Year Plan for Science and Technology (2017YFC1703303), and in internal calibration of a camera. First, straight lines were
part by Shanghai Science and Technology Development Funds detected, and the spatial corners projections consisting of
(13dz2260200, 13511504300). Recommended by Associate Editor Pu Wang.
(Corresponding author: Luping Wang.)
three lines can be extracted. Secondly, through geometric
Citation: L. P. Wang and H. Wei, “Avoiding non-Manhattan obstacles inferences, it is possible to understand the non-Manhattan
based on projection of spatial corners in indoor environment,” IEEE/CAA J. obstacles. Finally, through convergence of differences in
Autom. Sinica, vol. 7, no. 4, pp. 1190–1200, Jul. 2020. geometric features, it is possible to adjust robot orientation to
L. P. Wang is with the School of Mechanical Engineering, University of
Shanghai for Science and Technology, Shanghai 200093, China (e-mail:
keep pace with the posture of non-Manhattan obstacles,
wangluping@usst.edu.cn). allowing for the avoidance of such objects.
H. Wei is with the Laboratory of Algorithms for Cognitive Models, Instead of data-driven methods, such as those using deep
Shanghai Key Laboratory of Data Science, School of Computer Science, learning, the proposed approach requires no prior training.
Fudan University, Shanghai 201203, China (e-mail: weihui@fudan.edu.cn).
Color versions of one or more of the figures in this paper are available
With the use of simple geometric inferences, the proposed
online at http://ieeexplore.ieee.org. algorithm is robust to changes in illumination and color. For
Digital Object Identifier 10.1109/JAS.2020.1003117 disturbances, the method can understand non-Manhattan

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
WANG AND WEI: AVOIDING NON-MANHATTAN OBSTACLES BASED ON PROJECTION OF SPATIAL CORNERS IN INDOOR ENVIRONMENT 1191

obstacles with neither knowledge of the camera’s intrinsic [25]. However, this method sampled possible spatial layout
parameters nor the relation between the camera and the world, hypotheses without clutter, was prone to errors because of
making it practical and efficient for a navigating robot. occlusions, and tended to fit rooms where walls coincided with
Besides, without other external devices, the method has the object surfaces. Meanwhile, the relative depth-order of
advantages of lower required investment. rectangular surfaces were inferred by considering their
For classic benchmarks, our algorithm is capable of relationships [26], [27], but it just provided depth cues of partial
describing details of non-Manhattan obstacles. We compared rectangular regions in the image and not the entire scene.
the corners estimated by the proposed approach against the Approaches that can estimate what part of the 3D space is
corner ground truth, measuring the error through the free and what part is occupied by objects are modeled either in
percentage of pixels from summing up all Euclidean distances terms of clutter [28], [29] or bounding boxes [30], [22]. A
between estimated corners and the associated ground truth significant work was found to combine 3D geometry and
corners. Furthermore, the experimental results demonstrated semantics in the scope of outdoor scenes. Hedau proposed a
that robots can understand the non-Manhattan obstacles and method that identified beds by combining image appearances
avoid them via the convergence of posture difference between and 3D reasoning made possible by estimating the room
the robot orientation and the non-Manhattan obstacle, meeting layout [31].
the requirements of indoor robot navigation.
  As to Dasgupta’s work [32], indoor layout can be estimated
by using a fully convolutional neural network in conjunction
II.  Related Work with an optimization algorithm. It evenly sampled a grid of a
There are previous works which have made impressive feasible region to generate candidates for vanishing points.
progress, including structure-from-motion [6]–[9] and visual Nevertheless, the vanish point may not lie in the feasible
SLAM [10]–[14]. Through a series of visual observations, region when the robot faces certain layout scenarios, such as,
they propose a scene model in the form of a 3D point cloud. A a two-wall layout scenario. Additionally, because of the
method showed that three dimensional point clouds and image iterative refinement process, optimization took approximately
data were combined for semantic segmentation [15]. 30 seconds per frame, with a step size of 4 pixels for sampling
Nevertheless, just a fraction of the information from original lines, and a grid of 200 vanishing points. Hence, the efficiency
images can be provided via point clouds and geometric cues, of this method cannot meet the requirements of robot
thus some aspects such as edge textures are sometimes lost. navigation in an indoor environment. Also, a method was
Also, 3D structures can be reconstructed through inferring presented to predict room layout from a panoramic image
the relationship between connected super pixels. Saxena et al. [33]. Meanwhile, other methods using convolutional neural
assigned each pixel of an image of grass, trees, sky, or network were proposed to infer indoor scenes from a single
something else, through heuristic knowledge [16]. But these image [34]–[38]. Since these methods have no regard for non-
methods hardly work in indoor settings with different levels of Manhattan structures, it is difficult for them to understand
clutter and incomplete surfaces and coverage. non-Manhattan obstacles.
Furthermore, there are approaches that model geometric Recently, a method was presented to detect horizontal
scene structures from a single image, including approaches for vanishing points and the zenith vanishing point in man-made
geometric label classification [17] and for finding vertical/ environments [39]. Also, another method was proposed to
ground fold-lines [18]. As to others [19], local image estimate the camera orientation and vanishing points through
properties were linked to a classification system of local nonlinear Bayesian filtering in a non-Manhattan world [40].
surface orientation, and walls were extracted based on jointed However, it is difficult for these methods to understand non-
points with the floor. However, due to a great dependance on Manhattan obstacles. In previous works, the proposed
precise floor segmentation, these methods may fail in an algorithm can estimate the layout of an indoor scene via
indoor environment with clutter and covers. There has been projections of spatial rectangles, but there was difficulty in
renewed interest in 3D structures in restricted domains such as handling non-Manhattan structures [5]. Also, a method can
the Manhattan world [20], [21]. Based on vanishing points, provide understanding of indoor scenes that satisfy the
method detected rectangular surfaces aligned with major Manhattan assumption [4]; however, it failed to understand
orientations [5]. But dominant directions alone were discussed non-Manhattan obstacles because the structures that do not
and object surface information were not extracted. satisfy the Manhattan assumption were not discussed.
Additionally, a top down approach for understanding indoor Therefore, it is necessary to develop an algorithm to
scenes was presented by Pero et al. [22]. However, it was understand the non-Manhattan obstacles for visual navigation
difficult to explain room box edges when there were no that uses a single low-cost camera in robots. Furthermore the
additional objects. Although Pero’s algorithm [23] can method of low consumption and high efficiency meet the
understand the 3D geometry of indoor environments, it required requirement of robot navigation.
 

objects and prior knowledge such as relative dimensions, size,


and locations. Also a comprehensive Bayesian generative III.  Inference
model was proposed to understand indoor scenes [24], but it In an indoor environment, there are many rectangles and
relied on more specific and detailed geometric models, and spatial right-angles in 3D indoor scenes (e.g., windows, doors,
suffered greatly from hallucinating objects. Conversely, shelves, screens). These projections of spatial right-angles,
parameterized models of indoor environments were developed which satisfy the Manhattan world assumption, enable us to

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
1192 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

Corner projection

Non-Manhattan Dt
obstacle DtA
A DtB
Posture
DtC
estimating B
C
t
Line extraction
Ft
Y
If Dt = 0? Forwarding
Difference
Dt
Robot N
orientation

Ft+1 LVP detection


Dt > 0 Dt < 0
turn left turn right

Orientation Turning
updating

Fig. 1. Our system architecture.

estimate the layout and details satisfying the constraint of the LVP
Manhattan world assumption [4], [5]. The Manhattan world at infinity
assumption means that many surfaces are aligned with the
three main world axes. In other words, many surfaces are
parallel to the three principle ones. However, there are many
surface (e.g., chairs and sofas in clutter) that are not aligned
with the three main world axes, and these surfaces can be seen
as non-Manhattan structures. Due to clutter, there are many LVP
obstacles consisting of non-Manhattan spatial corners, LMVP
resulting in difficulty in navigation for home robots. In 2D
images, the part projections of these obstacles could be
considered a composition of corner-pairs, which would enable
us to estimate their original positions in a 3D scene.
Fig. 1 shows our system architecture. F t was a monocular Fig. 2. The LVPs and LMVP [4].
capture at time t. After preprocessing, we can detect non-
Manhattan obstacles and the robot’s orientation. By estimating can be projected onto the image, resulting in all kinds of
the pose of non-Manhattan obstacles, it is possible to calculate corner projections. Each of them can be seen as a composition
the difference Dt , which can help us determine the robot's of three lines. Hence, one corner can be defined as follows:
action (forwarding or turning). The details are shown in the
following sections.
 
C = {L s ; Ln ; Lr }. (2)
The integrity of a corner can be defined as follows:
A. Preprocessing
ls ln
Firstly, the edges are detected and straight lines are found ΛLP = |d(pc , p s ) − | + |d(pc , pn ) − | (3)
[41]. The lines are defined as follows: 2 2

Li = [x1i , yi1 , x2i , yi2 ], i ∈ N (1) lr


ΛC = ΛLP + |d(pc , pr ) − | (4)
2
where N is the number of straight lines.
where pc is the intersection of two lines (L s , Ln ), and d is a
Also, the layout of scene, layout-vanishing-points (LVPs)
and layout-main-vanishing-points (LMVP) can be detected distance function. p s , pn and l s , ln are the midpoint and length,
[4], as shown in Fig. 2. Obviously, the lines that belong to respectively. A smaller ΛLP represents the better line-pair
these LVPs, including LMVP, all satisfy the Manhattan world integrity. Similarly, pr and lr are the midpoint and length of
assumption.
 
Lr , respectively. A smaller ΛC represents better corner
projection integrity. Therefore, through searching for corner
B. Non-Manhattan Obstacles projections with smaller Λ, we can find spatial corner
1) Corner Projection Extraction: For obstacles, their corners projections of better integrity. Here, we ranked the corners

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
WANG AND WEI: AVOIDING NON-MANHATTAN OBSTACLES BASED ON PROJECTION OF SPATIAL CORNERS IN INDOOR ENVIRONMENT 1193

100 100

50 50

0 0

−50 −50

−100 −100

−150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150
(a) (b)

100 100

50 50

0 0

−50 −50

−100 −100

−150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150
(c) (d)

Fig. 3. Extraction of spatial corners projections. (a) line segments; (b) projection of spatial corners (e.g., in red, blue, brown); (c) spatial corner projections of
better integrity; (d) spatial corner projections satisfying the constraint.

g
λ1 = Lr ΘLrh → 0 (6)
TABLE I g
Lines in Corner Pairs Should Satisfy the Constraint. where Lr represents the line which comes from and Lrh Cg
LV1 , LV2 , LV3 Represents the Line Is Assigned to V1 , V2 , V3 represents the other. Here Θ represents an operator which
Line/Corner Cg Ch determines whether two lines are collinear. A smaller λ1
g
{ } { } represents that these two lines are more likely to belong to the
Ls Ls < LV1 ∪ LV2 Lhs < LV1 ∪ LV2
g
{ } { } same line, as shown in Fig. 4.
Ln Ln < LV1 ∪ LV2 Lnh < LV1 ∪ LV2
g
Secondly, these two lines should also satisfy the following
Lr Lr ∈ L V3 Lrh ∈ LV3
condition:
g
according to Λ value, and it is possible to select top-ranked λ2 = vr Θvhr → π (7)
(e.g., top 100) corners via smaller ΛC. The projection of worse g
where vr represents the vector from Cg
center to midpoint,
g
Lr
integrity (e.g., in brown) would be eliminated. and vhr represents the other. Here Θ is the operator that
For each corner of an obstacle that does not satisfy the determines whether two vectors are collinear and π indicates
Manhattan world assumption, there are at least two lines that that they are in opposite directions. For smaller λ2, for
do not belong to the scene layout VPs. Since the obstacle to be example, a smaller angle β represents that these two vectors
discussed is placed on the floor, there is at least one line in the are more likely to run in the opposite direction, as shown in
corner that belongs to the LVP (V1 is the LMVP and V3 Fig. 4.
infinity). Here, lines in corner pairs should satisfy the Thirdly, the corners of the obstacle share the same obstacle-
constraint as shown in Table I. Therefore, corners can be vanishing-points (OVPs).
extracted as shown in Fig. 3.
λ3 = Og ΘOh ΘOt → 0 (8)
2) Detection of Non-Manhattan Obstacles: Therefore, it is
possible to determine the obstacle via corner pairs as following: where the OVPs ( Og
are vanishing points of and Cgbelong Oh
to C h , and here, for example, are O1 , O2 , O3) of the obstacle
G = {C g ;C h }. (5) can be computed from corner pairs (C g and C h ). Since all the
Firstly, there exist two lines, which respectively come from other corners of the obstacle would share the same OVPs(Ot),
two corners, and these two lines should belong to a same line. it is possible to determine other corners via λ3. Here Θ stands
It can be defined as follows: for an operator which determines whether the corners share

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
1194 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

V3 V3
Lgn
Lg
Cg
vg
Lgr

Lhn
Lhr β
vhr
Ch
Lhs

(a) (b)

O3
O2 100
O1
50

0
β
−50

−100

−150 −100 −50 0 50 100 150


(c) (d)

Fig. 4. Non-Manhattan estimation. (a) Collinear condition; (b) Opposite directions; (c) Sharing the OVPs; (d) Estimation of non-Manhattan obstacles.

the same OVPs. A smaller λ3 implies that corners (e.g., in Dt = [ηt , ϕt ] (11)
yellow) are more likely to share the same OVPs with the where Dt represents the difference between the orientation of
obstacle, as shown in Fig. 4. the robot and the pose of the non-Manhattan obstacle.
 

Therefore, it is possible to determine the planes of non-


Manhattan obstacles via corner pairs.
 
D. Turning
Here Dt = [ηt , ϕt ] represents the horizontal and the vertical
C. Difference Between Robot and Non-Manhattan Obstacles angles of rotation, respectively. As shown in Fig. 5, since the
In facing non-Manhattan obstacles, robots should recognize point here is to determine whether turning left or right is
the pose of non-Manhattan obstacles and turn the robot’s needed to avoid the obstacle, then ηt representing the
orientation as to avoid them. Firstly, M is defined as follows: horizontal angles of rotation is key to the difference between
M = [M x , My ] (9) the orientation of robot and the pose of non-Manhattan
obstacle. In other words, the difference Dt can be
where the M is the VP (among OVPs) that is closest to the
approximately seen as ηt , which helps the robot determine
center of frame.
whether turn left or right.
Since size and depth cannot be computed accurately from a
According to the Dt , it is possible for a vision-based robot
single image, the relative relationship between the projected to determine whether and how to turn. In order to avoid non-
vertex and the VPs must be estimated. Since the camera Manhattan obstacles, robot orientation (RO) should be
coordinate system and world coordinate system are not adjusted to keep pace with the pose of the non-Manhattan
coincident, this renders the optical center and the VP non- obstacle. Thus, the motion rules of a robot in the camera
coincident, thus the estimated non-Manhattan obstacles can be coordinate system are as shown in Table II.
considered as the ones that are translated and rotated relative As shown in Fig. 5, it was assumed that robot path is
to the coordinates. Assuming that the maximum of depth is D, A → B → C . When the robot was at point A, the robot first
the angle of rotation can be approximately estimated by understood the indoor scene from the captured frame,
solving the following: calculated the geometric posture difference between the
 (M ) obstacle (i.e., the non-Manhattan structure) and its orientation


 x −DtA . According to Table II, it is possible for the robot to

 η = atan


 D determine how to turn its orientation (e.g., turn right here).
   (10)


 
 My  Then, the robot arrives at point B , and retried to determine

 ϕ = 

 √ 


 atan   whether or not and how it should to turn so that it can avoid
D2 + M x 2 the obstacle, which can be considered as a second loop.
where η and ϕ represents the horizontal and vertical angles of Similarly, DtB and DtC represented the posture differences
rotation, respectively. For the F t , the following can be between the obstacle and robot orientation at point B and C ,
approximated: respectively.

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
WANG AND WEI: AVOIDING NON-MANHATTAN OBSTACLES BASED ON PROJECTION OF SPATIAL CORNERS IN INDOOR ENVIRONMENT 1195

Dt
M
OVP DtA
DtB DtC
A
DtB
DtA
DtC
B
Non-Manhattan RO C
obstacle
t
C
RO Fig. 6. The decreasing posture difference between the obstacle and robot
orientation.
B
summing up all Euclidean distances between the estimated
RO
corners and the associated ground truth corners. The
performance on the LSUN dataset [44] is compared in Table III.
A
Although the errors of corners appear lower in methods [32],
[34], they only measured the error of corners that belong to
the layout of their indoor setting, without competence of
Fig. 5. An example of avoiding an obstacle of non-Manhattan structure. understanding and estimating corners of non-Manhattan
obstacles. However, our method estimates the error of corners
TABLE II that are non-Manhattan obstacles, which plays an important
Turn Motion Mode in the Camera Coordinate System role in the navigation of the robot, allowing the robot to avoid
non-Manhattan obstacles in the indoor setting.
Difference Motion Angle
Dt < 0 turn left abs(Dt) TABLE III
Dt > 0 trun right abs(Dt) Performance on the LSUN Dataset
Dt = 0 forward NA Method Corner error (%)
Hedau et al. [42] 15.48
Based on its understanding of the indoor scene, the robot Mallya et al. [43] 11.02
can turn its orientation in order to keep pace with the posture
Dasgupta et al. [32] 8.2
of different structures (Manhattan or non-Manhattan). The
Ren et al. [34] 7.95
turning of its orientation can be modeled as a convergence of
the function Dt , as shown in Fig. 6. With a converging value Wei [5] 10.86
for posture difference, the robot can adjust its orientation, step Our method 9.98
by step. As Dt → 0, the robot’s orientation is in accordance
with the posture of the obstacle, allowing it to avoid the Experimental comparisons were conducted between our
obstacle by itself.
 
method and Wei’s method [4], as shown in Fig. 7. In Fig. 7,
the image size (height and width) of the Wei’s scene
IV.  Experimental Results understanding and for our non-Manhattan obstacle detection
We design experiments to evaluate the performance of the are the same. For example, the scene understanding (the fifth
robot in avoiding non-Manhattan obstacles through the row, fourth column) and the non-Manhattan obstacle detection
proposed approach. The focus of the experiments is to (the sixth row, fourth column) are from the same group of line
evaluate algorithms underlying the execution of a real robot segments, in which the line segments’ numbers are the same.
mounted with only one camera. The goals of the experiment What is different is that some line segments which do not
are to evaluate not only their performance in detecting non- satisfy the constraint of angle projections in the scene
Manhattan obstacles in indoor settings, but also their ability to
understanding (the fifth row, fourth column) are eliminated,
avoid such non-Manhattan obstacles by turning its orientation
resulting in displaying less lines. Since Wei’s method only
via Dt .
considers lines belonging to vanishing points of layout in
 

A. Performance of Detecting Non-Manhattan Obstacles indoor scenes, it is prone to failure in detecting non-
For an input image that contains many occlusions and Manhattan obstacles. However, our method can deal with
clutter, our method copes with clutter without prior training. clutter, and can efficiently detect details, especially non-
Based on geometric constraints of spatial corners, our Manhattan obstacles, without any prior training.
approach not only detects the obstacles satisfying the Experimental comparisons were conducted between our
Manhattan assumption, but also can estimate the pose of the method and Wang’s method [5], as shown in Fig. 8. Since
obstacles, especially non-Manhattan obstacles. Wang’s method only considers rectangular projections that
We compare the obstacles estimated by our algorithm belong to vanishing points of layout of indoor scenes, it is also
against the ground truth, measuring the corner error by difficult to detect non-Manhattan obstacles.
 

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
1196 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

(a)
150 150
100
100 100 100

50 50
50 50

0 0 0 0

−50 −50 −50


−50

−100 −100
−100
−100
−150 −150
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
(b)
150 150
100
100 100 100

50 50
50 50

0 0 0 0

−50 −50 −50


−50

−100 −100
−100
−100
−150 −150
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
(c)

(d)
250 250 250
200
200 200 200
150
150 150 150
100 100 100 100
50 50 50 50
0 0 0 0
−50 −50 −50 −50
−100 −100 −100 −100
−150 −150 −150 −150
−200 −200 −200
−200
−250 −250 −250
−200 −150 −100 −50 0 50 100 150 200 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
(e)
250 250 250
200
200 200 200
150 150 150
150
100 100 100 100

50 50 50 50

0 0 0 0

−50 −50 −50 −50

−100 −100 −100


−100
−150 −150 −150
−150
−200 −200 −200
−200
−250 −250 −250
−200 −150 −100 −50 0 50 100 150 200 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
(f)

Fig. 7. Experimental comparisons. (a) input frames from UCB dataset [26]; (b) understanding of indoor scenes by Wei’s method [4]; (c) non-Manhattan
obstacles estimated by our method; (d) images from Hedau dataset [42]; (e) results estimated by Wei’s method [4]; (f) non-Manhattan structures estimated by
our method.

B. Avoiding Non-Manhattan Obstacles The vision information was transmitted to the computer with
Here, as shown in Fig. 9, an unmanned aerial vehicle with a the CPU Intel Core i7-6500, 2.50 GHz. Then, our method can
two mega-pixel fixed camera was used for capturing video. efficiently be applied to identify non-Manhattan obstacles in a

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
WANG AND WEI: AVOIDING NON-MANHATTAN OBSTACLES BASED ON PROJECTION OF SPATIAL CORNERS IN INDOOR ENVIRONMENT 1197

(a)
150 200
200
100
150
150 100
100 100
50
50
50 50

0 0 0 0

−50 −50
−50
−50 −100 −100
−150 −100
−150
−100
−200
−150 −200
−200 −150 −100 −50 0 50 100 150 200 −300 −200 −100 0 100 200 300 −250 −200 −150 −100 −50 0 50 100 150 200 250 −250 −200 −150 −100 −50 0 50 100 150 200 250

(b)
150 200
200
100
150
150 100
100 100
50
50
50 50

0 0 0 0

−50 −50
−50
−50
−100 −100
−150 −100
−150
−100
−200
−150 −200
−200 −150 −100 −50 0 50 100 150 200 −300 −200 −100 0 100 200 300 −250 −200 −150 −100 −50 0 50 100 150 200 250 −250 −200 −150 −100 −50 0 50 100 150 200 250

(c)

Fig. 8. Experimental comparisons. (a) input images from LSUN dataset [44]; (b) understanding of indoor scenes by Wang’s method [5]; (c) non-Manhattan
obstacles estimated by our method.

obstacles and orientation turning, the converging of


differences ( M x and η) can be seen as a convergence process,
as shown in Fig. 11. In the left image of Fig. 11, the horizontal
axis indicates the index of frame in Table IV, and the vertical
axis represents the value M x . Meanwhile, in the right image in
Fig. 11, the horizontal axis indicates the index of frame in
Table IV, and the vertical axis represents the value η. With the
decreasing value of difference between M x and η, the
orientation of the robot can be adjusted to keep pace with the
pose of detected non-Manhattan obstacles, step by step,
avoiding the non-Manhattan obstacles by itself.
Obviously, in facing the non-Manhattan obstacle, the pose
Fig. 9. An unmanned aerial vehicle with a two mega-pixel fixed camera
difference between the robot orientation and the non-
that was used for capturing video.
Manhattan obstacle can be approximately estimated so as to
determine whether or not and how to change the robot
scene, without any prior training. Take a F t1 as an example in orientation in order to avoid the obstacles.
 

Fig. 10 (first column); it is obvious that there exists a V.  Conclusion


difference between the orientation of the robot and the pose of
The current work presents an approach for home mobile
non-Manhattan obstacle. Based on the equation above, the Dt1
robots to avoid non-Manhattan obstacles in indoor
can be approximately estimated (ηt1 = 76.43). According to
environments using a monocular camera. The method first
the Table II, the robot understood the scene and identified the detects projections of spatial right-corners and estimates their
non-Manhattan obstacle, and turned right by a prelimited position and orientation in three dimensional scenes.
angle (the max turning angle prelimited in robot controlling). Accordingly, it is possible to model non-Manhattan obstacles
Then, the robot caught F t2 (Fig. 10, the second column) and via the projections of corners. Then, based on understanding
entered a next loop of unerstanding-turn. For the frames as such non-Manhattan obstacles, the difference between the
shown in Fig. 10, non-Manhattan obstacles are detected and robot orientation and the posture of the obstacles can be
pose differences between the robot’s orientation and the non- estimated via geometric features and constraints. Finally,
Manhattan obstacles are estimated in Table IV. according to the difference, it is possible for the robot to
Through successive understanding of the non-Manhattan determine whether and how to turn its orientation so as to

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
1198 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

(a)

200 200 200 200

150 150 150 150

100 100 100 100

50 50 50 50

0 0 0 0

−50 −50 −50 −50

−100 −100 −100 −100

−150 −150 −150 −150

−200 −200 −200 −200

−100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100

(b)

(c)

200 200 200 200

150 150 150 150

100 100 100 100

50 50 50 50

0 0 0 0

−50 −50 −50 −50

−100 −100 −100 −100

−150 −150 −150 −150

−200 −200 −200 −200

−100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100

(d)

Fig. 10. Pose difference estimation; (a) (c) input frame; (b) (d) pose difference (Mx and η) between the robot orientation and the non-Manhattan obstacles.

keep pace with the posture of the detected non-Manhattan this method has the advantages of lower investment and
obstacles, making it possible to avoid such obstacles. Instead energy efficiency. The experiments measure the error of
of data driven approaches, the proposed method requires no corners by comparing the corners of non-Manhattan obstacles
prior training. With use of geometric inference, the presented estimated by our algorithm against the ground truth.
method is robust against changes in illumination and color. Moreover, we demonstrated the validity of avoiding obstacles
Furthermore, without any knowledge of the camera’s internal via the convergence of difference between the robot
parameters, the algorithm is more practical for robotic orientation and non-Manhattan obstacle posture. The
application in navigation. In addition, using features from a experimental results showed that our method can understand
monocular camera, the approach is robust to the errors in and avoid the non-Manhattan obstacles, meeting the
calibration and image noise. Without other external devices, requirements of indoor robot navigation.

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
WANG AND WEI: AVOIDING NON-MANHATTAN OBSTACLES BASED ON PROJECTION OF SPATIAL CORNERS IN INDOOR ENVIRONMENT 1199

1000 90
900 80
800 70
700 60
600
50
500
40
400
300 30
200 20
100 10
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Mx η
(a) (b)

Fig. 11. Convergence. (a) convergence curve of Mx; (b) convergence curve of η.

TABLE IV [12] S. P. Li, T. Zhang, X. Gao, D. Wang, and Y. Xian, “Semi-direct


monocular visual and visual-inertial SLAM with loop closure
Difference Between the Robot Orientation and the Non-
Manhattan Obstacles detection,” Robotics and Autonomous Systems, vol. 112, pp. 201–210,
2019.
Frame Mx η Motion [13] L. H. Xiao, J. G. Wang, X. S. Qiu, Z. Rong, and X. D. Zou, “Dynamic-
1 861.77 76.43 turn right SLAM: Semantic monocular visual localization and mapping based on
deep learning in dynamic environment,” Robotics and Autonomous
2 768.64 72.65 turn right Systems, vol. 117, pp. 1–16, 2019.
3 693.24 70.90 turn right [14] R. H. Li, S. Wang, and D. B. Gu, “Ongoing evolution of visual SLAM
4 516.5 65.07 turn right from geometry to deep learning: Challenges and opportunities,”
Cognitive Computation, vol. 10, no. 6, pp. 875–889, 2018.
5 279.43 49.34 turn right [15] Y. Wei, J. Yang, C. Gong, S. Chen, and J. J. Qian, “Obstacle detection
6 257.28 46.99 turn right by fusing point clouds and monocular image,” Neural Processing
Letters, vol. 49, no. 3, pp. 1007–1019, 2019.
7 233.17 44.17 turn right
[16] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Depth perception from a
8 164.82 34.48 turn right single still image,” AAAI, vol. 3, pp. 1571–1576, 2008.
[17] A. Saxena, M. Sun, and A. Y. Ng, “Learning 3-d scene structure from a
References single still image,” in Proc. 11th IEEE Int. Conf. Computer Vision.
IEEE. pp. 1–8, 2007.
[1] E. J. Gibson and R. D. Walk, “The visual cliff,” Sci. American, vol. 202,
pp. 64–71, 1960. [18] E. Delage, L. Honglak, and A. Y. Ng, “A dynamic bayesian network
model for autonomous 3d reconstruction from a single indoor image,”
[2] J. J. Koenderink, A. J. V. Doorn, and A. M. Kappers, “Pictorial surface CVPR, vol. 2, pp. 2418–2428, 2006.
attitude and local depth comparisons,” Percept. Psychophys, vol. 58,
no. 2, pp. 163–173, 1996. [19] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from
predicted semantic labels,” CVPR, vol. 119, no. 5, pp. 1253–1260, 2010.
[3] Z. J. He and K. Nakayama, “Visual attention to surfaces in
threedimensional space,” Proc. Natl. Acad. Sci. USA, vol. 92, no. 24, [20] A. Shariati, B. Pfrommer, and C. J. Taylor, “Simultaneous localization
pp. 11155–11159, 1995. and layout model selection in Manhattan worlds,” IEEE Robotics and
Automation Letters, vol. 4, no. 2, pp. 950–957, 2019.
[4] H. Wei and L. P. Wang, “Visual navigation using projection of spatial
rightangle in indoor environment,” IEEE Trans. Image Processing [21] J. Straub, O. Freifeld, G. Rosman, J. J. Leonard, and J. W. F. III, “The
(TIP), vol. 27, no. 7, pp. 3164–3177, 2018. Manhattan frame model – Manhattan world inference in the space of
[5] H. Wei and L. P. Wang, “Understanding of indoor scenes based on surface normals,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40,
projection of spatial rectangles,” Pattern Recognition, vol. 81, no. 1, pp. 235–249, 2018.
pp. 497–514, 2018. [22] L. D. Pero, J. Y. Guan, E. Brau, J. Schlecht, and K. Barnard, “Sampling
[6] L. Magerand and A. Del Bue, “Revisiting projective structure from bedrooms,” CVPR, vol. 1, pp. 2009–2016, 2011.
motion: A robust and efficient incremental solution,” IEEE Trans. [23] L. D. Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hartley, and K.
Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 430–443, 2020. Barnard, “Bayesian geometric modeling of indoor scenes,” in Proc.
[7] H. Mohamed, K. Nadaoka, and T. Nakamura, “Towards benthic habitat IEEE Conf. Computer Vision and Pattern Recognition. IEEE, pp.
3d mapping using machine learning algorithms and structures from 2719–2726, 2012.
motion photogrammetry,” Remote Sensing, vol. 12, no. 1, pp. 127, 2020. [24] L. D. Pero, J. Bowdish, B. Kermgard, E. Hartley, and K. Barnard,
[8] Y. S. Hung and P. B. Zhang, “An Articulated deformable structure “Understanding Bayesian rooms using composite 3d object models,” in
approach to human motion segmentation and shape recovery from an Proc. IEEE Conf. Computer Vision and Pattern Recognition. IEEE, pp.
image sequence,” IET Computer Vision, vol. 13, no. 3, pp. 267–276, 153–160, 2013.
2018. [25] D. C. Lee, M. Hebert, and T. Kanade, “Geometric reasoning for single
[9] K. Sun and W. B. Tao, “A center-driven image set partition algorithm image structure recovery,” in Proc. IEEE Conf. Computer Vision and
for efficient structure from motion,” Inf. Sci., vol. 479, pp. 101–115, Pattern Recognition. IEEE, pp. 2136–2143, 2009.
2019. [26] S. X. Yu, H. Zhang, and J. Malik, “Inferring spatial layout from a single
[10] M. R. U. Saputra, A. Markham, and N. Trigoni, “Visual SLAM and image via depth-ordered grouping,” in Proc. IEEE Conf. Computer
structure from motion in dynamic environments: A survey,” ACM Vision and Pattern Recognition Workshops. IEEE, pp. 1–7, 2008.
Comput. Surv., vol. 51, no. 2, pp. 37: 1–37: 36, 2018. [27] J. Li, C. Yuce, R. Klein, and A. Yao, “A two-streamed network for
[11] S. Hong and J. Kim, “Selective image registration for efficient visual estimating fine-scaled depth maps from single RGB images,” Computer
SLAM on planar surface structures in underwater environment,” Auton. Vision and Image Understanding, vol. 186, pp. 25–36, 2019.
Robots, vol. 43, no. 7, pp. 1665–1679, 2019. [28] S. H. Ding, Q. Zhai, Y. Li, J. D. Zhu, Y. F. Zheng, and D. Xuan,

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.
1200 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020

“Simultaneous body part and motion identification for human-following using global image context in a non-Manhattanworld,” in Proc. IEEE
robots,” Pattern Recognition, vol. 50, pp. 118–130, 2016. Conf. Computer Vision and Pattern Recognition, 2016, pp. 5657–5665.
[29] Z. Y. Jia, A. Gallagher, A. Saxena, and T. Chen, “3d-based reasoning [40] J. Lee and K. Yoon, “Joint estimation of camera orientation and
with blocks, support, and stability,” in Proc. IEEE Conf. Computer vanishing points from an image sequence in a non-Manhattan world,”
Vision and Pattern Recognition. IEEE, pp. 1–8, 2013.
Int. J. Computer Vision, vol. 127, no. 10, pp. 1426–1442, 2019.
[30] D. Lee, A. Gupta, M. Hebert, and T. Kanade, “Estimating spatial layout
of rooms using volumetric reasoning about objects and surfaces,” NIPS, [41] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik “From contours to
pp. 1288–1296, 2010. regions: An empirical evaluation,” in Proc. Asian Conf. Computer
[31] V. Hedau, D. Hoiem, and D. Forsyth, “Thinking inside the box: Using Vision, Springer, Cham, pp. 2294–2301, 2009.
appearance models and context based on room geometry,” in Proc. [42] V. Hedau, D. Hoiem, and D. Forsyth, “Recovering the spatial layout of
European Conf. Computer Vision: Part VI. Berlin, Heidelberg, cluttered rooms,” In Proc. 12th IEEE Int. Conf. Computer Vision,
Germany: Springer, pp. 224–237, 2010. Kyoto, Japan: IEEE, pp. 1849–1856, 2009.
[32] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, “Delay: Robust spatial
[43] A. Mallya and S. Lazebnik, “Learning informative edge maps for indoor
layout estimation for cluttered indoor scenes,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition. IEEE, pp. 616–624, 2016. scene layout prediction,” In Proc. IEEE Int. Conf. Computer Vision.
Santiago, Chile: IEEE, pp. 936–944, 2015.
[33] C. H. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet:
Reconstructing the 3d room layout from a single rgb image,” in Proc. [44] Y. Zhang, F. Yu, S. Song, P. Xu, A. Seff, and J. Xiao, Largescale Scene
IEEE/CVF Conf. Computer Vision and Pattern Recognition. Salt Lake Understanding Challenge: Room Layout Estimation, 2016.
City, USA: IEEE, 2018, pp. 2051–2059.
[34] Y. Z. Ren, S. W. Li, C. Chen, and C.-C. J. Kuo, “A coarse-to-fine
Luping Wang received the Ph.D. degree in the
indoor layout estimation (CFILE) method,” in Proc. Asian Conf.
Department of Computer Science and Engineering,
Computer Vision, Springer, Cham, pp. 36–51, 2016. Fudan University in 2019. Since August 2019, he has
[35] P. Miraldo, F. Eiras, and S. Ramalingam, “Analytical modeling of joined the Department of Electrical Engineering,
vanishing points and curves in catadioptric cameras,” in Proc. IEEE University of Shanghai for Science and Technology.
Conf. Computer Vision and Pattern Recognition. Salt Lake City, USA: His research interests include scene understanding,
IEEE, 2018, pp. 2012–2021. robotics, navigation, computer vision, pattern
[36] H. Howard-Jenkins, S. Li, and V. Prisacariu, “Thinking outside the box: recognition, and artificial intelligence.
Generation of unconstrained 3d room layouts,” in Proc. Asian Conf.
Computer Vision. Perth, Australia: Springer, 2018, pp. 432–448.
[37] X. T. Li, S. F. Liu, K. Kim, X. L. Wang, M. H. Yang, and J. Kautz,
“Putting humans in a scene: Learning affordance in 3d indoor Hui Wei received the Ph.D. degree in the
environments,” in Proc. IEEE Computer Vision and Pattern Department of Computer Science, Beijing University
Recognition. Long Beach, CA, USA: IEEE, 2019, pp. 12368–12376. of Aeronautics and Astronautics in 1998. From 1998
to 2000, he was a Postdoctoral Fellow in the
[38] A. Atapour-Abarghouei and T. P. Breckon, “Veritatem dies aperit –
Department of Computer Science and the Institute of
temporally consistent depth prediction enabled by a multi-task
Artificial Intelligence, Zhejiang University. Since
geometric and semantic scene understanding approach,” in Proc. IEEE November 2000, he has joined the Department of
Conf. Computer Vision and Pattern Recognition. Long Beach, CA, Computer Science and Engineering, Fudan
USA: IEEE, 2019, pp. 3373–3384. University. His research interests include artificial
[39] M. H. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points intelligence and cognitive science.

Authorized licensed use limited to: VIT University. Downloaded on November 26,2020 at 09:51:38 UTC from IEEE Xplore. Restrictions apply.

You might also like