03 - Semantic Visual SLAM in Dynamic Environment

Autonomous Robots (2021) 45:493–504
https://doi.org/10.1007/s10514-021-09979-4
Semantic visual SLAM in dynamic environment

Shuhuan Wen1,2 · Pengjiang Li1,2 · Yongjie Zhao1,2 · Hong Zhang3 · Fuchun Sun4 · Zhe Wang1,2
Received: 2 November 2019 / Accepted: 26 February 2021 / Published online: 4 May 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021
Abstract
Human-computer interaction requires accurate localization and effective mapping, while dynamic objects can influence the
accuracy of localization and mapping. State-of-the-art SLAM algorithms assume that the environment is static. This paper
proposes a new SLAM method that uses mask R-CNN to detect dynamic ob-jects in the environment and build a map containing
semantic information. In our method, the reprojection error, photometric error and depth error are used to assign a robust
weight to each keypoint. Thus, the dynamic points and the static points can be separated, and the geometric segmentation
of the dynamic objects can be realized by using the dynamic keypoints. Each pixel is assigned a semantic label to rebuild a
semantic map. Finally, our proposed method is tested on the TUM RGB-D dataset, and the experimental results show that
the proposed method outperforms state-of-the-art SLAM algorithms in dynamic environments.
Keywords Reprojection error · Photometric error · Depth error · Dynamic target detection · Semantic SLAM
1 Introduction mation from various sensors, including laser, camera and

IMU. Visual SLAM has attracted many researchers given its
In recent years, SLAM has been widely used in many fields, rich environmental information and low price.
such as service robot, storage robot, AR, and VR. SLAM Visual SLAM obtains environmental information through
technology can realize localization and mapping using infor- a camera, including a monocular camera, RGB-D camera
and binocular camera. Early visual SLAM algorithms used
B Shuhuan Wen a monocular camera to obtain environmental information
swen@ysu.edu.cn
and solve localization and mapping. Monocular cameras are
Pengjiang Li associated with scale uncertainty, while binocular or RGB-D
b1848643747@163.com
cameras effectively solve the problem of scale uncertainty.
Yongjie Zhao RGB-D cameras can obtain the RGB image of the environ-
smxshuimuxiao@163.com
ment but also directly obtain the depth of an object, which
Hong Zhang can effectively solve the problem of scale uncertainty.
hzhang@ualberta.ca
Visual SLAM is mainly divided into feature-based and
Fuchun Sun direct methods. A feature-based method estimates camera
fcsun@tsinghua.edu.cn
pose by extraction and matching feature points and it is robust
1 Engineering Research Center of the Ministry of Education for in environments with rich texture, such as ORB-SLAM2
Intelligent Control System and Intelligent Equipment, (Mur-Artal & Tardós, 2017). A direct method estimates the
Yanshan University, Qinhuangdao, People’s Republic of camera pose by minimizing the photometric error between
China
the feature points and the projection points, and dense or
2 Key Laboratory of Industrial Computer Control Engineering semi-dense maps can also be obtained, i.e., LSD SLAM
of Hebei Province, Yanshan University, Qinhuangdao, China
(Engel & Schöps T et al., 2014), PTAM (Klein & Murray,
3 The Department of Computing Science, University of 2007) and DSO (Engel et al., 2017). Most of the existing
Alberta, Edmonton, AB T6G 2E8, Canada
algorithms assume that the environment is static; however,
4 The Institute for Artificial Intelligence, State Key Lab of dynamic objects exist in the real world, such as walking
Intelligent Technology and Systems, Department of Computer
Science and Technology, Beijing National Research Center
people, moving vehicles, etc. Dynamic objects in the environ-
for Information Science and Technology, Tsinghua ment will reduce the robustness, and it is difficult to perform
University, Beijing 100084, People’s Republic of China
123
494 Autonomous Robots (2021) 45:493–504
accurate localization and mapping. Therefore, how to obtain SLAM algorithm. Most of the existing SLAM algorithms
accurate localization and mapping in dynamic environment assume that the environment is static, or some dynamic points
remains a challenge. are eliminated by RANSAC (Fischler & Bolles, 1981) algo-
The ability for a robot to understand the environment rithm. However, this method can only remove a few dynamic
is the premise of better navigation and human-computer points. In recent years, some algorithms have been proposed
interaction. Traditional SLAM mapping only characterizes to deal with the problem of localization in dynamic environ-
an environment geometrically and lacks semantic under- ments.
standing of the environment. How to localize and build a
semantic map accurately in a dynamic environment remains
a challenging task. As the development of deep learning
progresses, deep learning has made great achievements in
target detection and environment segmentation. Therefore, 2.1 Visual SLAM in dynamic environment
it is worthwhile to combine SLAM with deep learning to
realize accurate localization and establish semantic maps in Kerl et al. (2013a) found that photometric error is associ-
dynamic environments. ated with t distribution through statistical analysis. Based
In this paper, we propose a novel method to reduce the on this characteristic, a weighted least square error function
influence of dynamic objects on localization and mapping. is constructed for accurate localization. Kerl et al. (2013b)
We combine deep learning with the proposed geometry extended this work (2013a) and added a depth error to obtain
method to detect dynamic targets in the environment and a more robust effect. Li et al. (2017) proposed a static weight
achieve accurate localization and mapping. The main contri- method of keyframe edge points and demonstrated the possi-
butions of this paper are as follows: bility that a spatial point is a static point. Tan (2013) proposed
that the smallest keyframe of the current frame is used as
(1) We propose a new framework that removes the mis- the reference frame. The spatial points corresponding to the
matching points in feature matching. We employ mask feature points of the reference frame are projected into the
R-CNN to perform semantic segmentation for the current frame for comparison of structure and appearance,
image. The semantic segment information removes and the dynamic points are screened. Wang et al. (2019)
the dynamic points to obtain an accurate fundamental completed the depth image, then used kmeans clustering
matrix based on static features. algorithm to segment the depth image, and used geometric
(2) We improve multi-view geometric method combined constraints to determine dynamic objects. Cui et al. (2019)
with mask R-CNN which can combine the advantages determined whether the feature points of the reference frame
of the two methods to detect dynamic targets better. We are dynamic points by comparing its appearance and the
use depth error, photometric error and reprojection error image blocks centered on the projection points of the three-
to assign a robust weight to the static points in order to dimensional points in the current frame. Xu et al. (2018)
separate static and dynamic points. proposed an improved Vibe algorithm to detect dynamic
(3) We propose a novel semantic SLAM to remove the influ- objects in the environment, and the robustness of the RGB-D
ence of dynamic objects on localization and mapping. SLAM algorithm is improved by eliminating dynamic points.
We use mask R-CNN with geometric constraints as a Alcantarilla et al. (2012) used the dense scene flow repre-
semantic segmentation network integrated with SLAM sentation to obtain the probability of the object motion and
to build a semantic map (of 4 classes including back- used these probabilities to identify the moving objects in the
ground) by passing the labels to an OctTree. scene. Finally, the features on the moving objects are deleted,
and improved 3D reconstruction results can be obtained. Sun
The remainder of this paper is organized as follows. In et al. (2017) used the frame difference method to detect
Sect. 2, we discuss related work. In Sect. 3, we describe the dynamic targets in the scene and then used particle filters
proposed algorithms. In Sect. 4, we show the results of exper- to track dynamic targets to enhance moving target detection.
iments and evaluation. In Sect. 5, we provide conclusions. Finally, the maximum posteriori estimate is applied to the
vector quantized depth image to accurately determine the
foreground. In (Bakkay et al. 2015), used a spatiotemporal
2 Related work filter to filter the acquired depth image and then obtained a
dense scene stream through the input RGB image and the
The image obtained by camera is a projection of points from filtered depth image. Finally, the filtered depth image and
three-dimensional space to a two-dimensional plane. The scene stream information are used to segment the dynamic
position of dynamic objects in space changes with time, objects in the scene. However, these algorithms don’t take
which will decrease the accuracy of localization for the advantage of the rich semantic information in the scene.
123
Autonomous Robots (2021) 45:493–504 495
2.2 Semantic SLAM in dynamic environment environment is obtained by assigning a semantic tag to each
point of the map. The proposed framework is introduced as
Yu et al. (2018) detected some dynamic points and mis- follows.
matched points by polar geometric constraints and then
segmented possible dynamic targets using a neural network. 3.1 Semantic segmentation and initialization
If the points that do not satisfy the polar line constraints are of camera pose
located on these objects, this is a dynamic object. However,
this method mainly depends on neural networks, and only a In this paper, we use mask R-CNN to realize image seman-
few dynamic objects can be detected using polar constraints. tic segmentation. Mask R-CNN not only realizes pixel-level
In (Bescos et al., 2018), the author used the semantic infor- semantic classification but also detects dynamic objects, such
mation provided by Mask R-CNN (He et al., 2017) to remove as a walking person, a moving vehicle, a bicycle, etc. It has
part of the dynamic feature points in the reference frame, then the same high detection accuracy for static objects, such as
projected the remaining features into the current frame, and computers, chairs, books, etc.
filtered using the difference between the real depth and the We implement mask R-CNN based on TensorFlow. An
projected depth Dynamic characteristics. The algorithm is RGB image is used as the input for the neural network, and the
affected by the accuracy of the semantic segmentation net- images are classified based on semantic pixel-level by Mask
work, and uses less feature point information. Zhang et al. R-CNN to realize semantic segmentation. Finally, we label
(2019) use Mask R-CNN to perform semantic segmentation, the dynamic and static targets in the segmentation results to
which can’t segment the book taken in one’s hand. Xu et al. improve the location accuracy. The Mask R-CNN training
(2019) can realize dense reconstruction for static background set uses the COCO data set, with a total of 80 kinds of tar-
and the proposed system is robust in dynamic scene, but the gets, such as people, chairs, monitors, books, cars, horses
moving cup is also reconstructed and included in the map. and other targets. Considering that the algorithm is mainly
At present, most of the existing algorithms only realize the for indoor scenes, the algorithm in this paper focuses on
detection of dynamic objects but do not contain an under- segmenting three types of targets, pedestrians, chairs, and
standing of environmental information. Alternatively, most displays, and treats detected pedestrians as dynamic targets,
algorithms just build a semantic map but do not deal with while chairs and displays are static targets, and adds their
dynamic objects in the environment very well. Other meth- semantic information to dense map. Mask R-CNN weight
ods for detecting dynamic objects mostly rely on the results parameters use Matterport open source pretrained weight val-
of semantic segmentation. ues (https://github.com/matterport/Mask RCNN)
In this paper, a new method combining semantic seg- State-of-the-art SLAM algorithms based on feature points
mentation and geometric constraints is proposed to detect find the matched point pairs of two frames of the images when
dynamic objects in the environment. The feature points on the pose is initialized. Then, some mismatched point pairs
the dynamic objects are removed to obtain accurate localiza- and dynamic point pairs are removed by the RANSAC algo-
tion, and the semantic map with only the static environment is rithm. However, when there are many dynamic objects, the
built. Our proposed method is evaluated on the TUM RGB-D initialization position is not accurate. To initialize a relatively
dataset. The results show that the proposed method is more robust camera pose, we use the mask R-CNN segmentation
robust than other methods in dynamic environments. results. The feature points in the dynamic target are removed,
and only the static feature points are retained. Then, the cam-
era pose is initiated by matching the remaining static feature
3 Method description points. In Fig. 2, each frame of the input RGB image (Fig. 2a)
may obtain a classified image (Fig. 2b) containing semantic
As shown in Fig. 1, our proposed method uses the RGB- information, and then the feature points of the dynamic object
D camera to obtain RGB images and depth images. Then are removed (Fig. 2d). A relatively robust camera pose is ini-
mask R-CNN is used for semantic segmentation of the image tialized.
to extract the potential dynamic objects and initialize an In the pose initialization phase, we detect the feature
accurate camera pose. We use the LK optical flow pyramid points in the Mask RCNN detected as pedestrian targets, and
(Bouguet, 2001) algorithm to track the feature points of the because the Mask R-CNN network has certain miss detec-
reference frame and obtain an accurate fundamental matrix tion problems, it is not possible to segment the complete
according to semantic segmentation results. We propose a dynamic target in some frames (such as the second column
new geometric method to detect dynamic points and combine and the last column in Fig. 2), so there will be incomplete
the semantic segmentation results to remove environmental dynamic point culling at the initial stage, and pedestrians
dynamic feature points. Finally, an Octree map (Hornung can be well detected at most times (such as the first and third
et al., 2013) is built such that a semantic map with only static columns in Fig. 2). The pose obtained at this stage is only
123
Fig. 1 The framework of our proposed algorithm. Mask R-CNN is used completed. The LK algorithm is used to track the matching points of
to implement semantic segmentation and provide potential dynamic the reference frame feature points at the current frame. Finally, the geo-
target information. First, combined with the potential dynamic target metric method is used to assist the Mask R-CNN to detect the dynamic
information provided by Mask R-CNN, a robust pose initialization is target and add a thread to establish the semantic map
the initial pose of the algorithm. On the basis of using the geometric method based on (Kerl et al. 2013a) to solve these
algorithm proposed in this paper, the complete dynamic tar- problems. We will introduce the proposed geometry method
get will be further detected, and all dynamic features will as follows.
be removed, and then the pose will be optimized. Figure 2b Because the image changes with time and the camera
shows that the dynamic target detected only using Mask R- moves slowly, there is minimal difference between the two
CNN is incomplete, so we propose a new geometric method adjacent frames. We assume the first frame of the image as
to assist a neural network to detect dynamic targets of the the reference keyframe. For each frame of the subsequent
environment. Our approach is described in Sect. 3.2. input, we judge whether the frame is a keyframe. If the cur-
rent frame is a keyframe and there are 5 keyframes between
3.2 Geometric method the current keyframe and the previous reference keyframe
(Tan et al., 2013). The keyframe is updated to the reference
In this paper, Mask R-CNN is used to detect dynamic objects keyframe.
in the environment, such as walking people, moving vehicles, Each pixel point x (u, v, 1)T in the image is a pro-
etc. However, if an object does not have autonomous move- jection of the three-dimensional space point, and the camera
ment attribute, i.e., its movement is caused by the movement projection model is obtained as follows:
of other objects, then the moving object cannot be detected
by Mask R-CNN. For example, a book taken by a walk- x f (T , P)
ing person, a cup held by a walking person, etc. The target KT P
detection algorithm based on neural network trains the net-
K (R P + t) (1)
work through a quantity of samples and constantly adjusts
the parameters to make the network achieve high recogni- where T ∈ SE(3) is camera pose, including camera rotation
tion accuracy. However, most of the training samples with matrix R and translation matrix t. K is the camera internal
high clarity are required, and target detection is performed parameter matrix. By using the inverse transformation in the
by extracting the features of the image target. When the object projection transformation and combined with the depth value
moves fast, the image will appear blurry using the camera. of the keypoints, the three-dimensional coordinates corre-
It is difficult to detect moving targets from blurred target sponding to the keypoints can be obtained as follows:
features using neural network methods. We proposed a new
123
Fig. 2 Semantic segmentation and preliminary removal of dynamic green, and chairs in blue. c All the extracted keypoints (static points
points. a The original image of the input frame. b Representation of and dynamic points). d The results using the semantic information to
semantic segmentation images. We mark people as red, computers in initially remove some of the dynamic points
P f −1 (T , x) between the keypoint x i and the projection point xí should

be 0. The luminance residual EI i of keypoint i is defined as
T −1 K −1 x (2)
follows:
where f x , f y and cx , cy are the inherent parameters of the E Ii I (xi ) − I (xi ) (3)
camera, which can be determined by the camera calibration
method before the experiment.
where I() is the pixel value. The luminance residual of key-
In Eqs. (1) and (2) all keypoints (including dynamic
points and projection points obey t distribution (Kerl et al.
and static feature points) in the reference frame can be
2013a). So a robust and static weight value can be assigned
back-projected into the three-dimensional space to obtain
to each keypoint as follows:
corresponding three-dimensional space points. These spatial
points are projected into the current frame to obtain projec- v+1
tion point xí . If the camera pose estimation is accurate and w(xi ) 2 (4)
the spatial point Pi is a static point, the luminance residual v + EσIi
123
where v is the degree of freedom associated with the t objects in the environment. The proposed geometric method
distribution and σ is residual variance. can effectively detect fast moving objects in the environ-
We add two new constraints based on (Kerl et al., 2013a): ment, but it is not obvious for slowly moving objects in the
depth error constraint and reprojection error constraint. For distance. In this paper, the combination of the proposed geo-
each keypoint x i in the reference Keyframe, we can find its metric method and mask R-CNN can realize the complete
matching point x mi in the current frame. If the spatial point dynamic target detection in the environment and improve
corresponding to the keypoint is a static point, then the pro- the robustness of the algorithm.
jection point xí and the matching point x mi should be the
same point. However, given the error. we define the repro- 3.3 Feature point tracking and outlier removal
jection error EC i as follows:
We adopt the LK optical flow pyramid algorithm to track the
ECi ||xi − xmi ||2 (5) reference keyframe feature points (including dynamic points
and static points) at the current frame. Due to lighting, etc.,
If a static space point is projected to the current frame, its there will be some mismatching in the matching results that
reprojection depth should be the same as the true depth of need to be eliminated. If the keypoint x i in the reference
the matching point. Thus, we can define the depth error EDi keyframe and the tracked matching point x mi in the current
as follows: frame are correctly matched, they should satisfy the polar
constraint as follows:
E Di (Tc Pi )z − D(xmi ) (6)
T
xmt F xi 0 (8)
where T c ∈ SE(3) is the current framecamerapose. In addi-
tion. ()z returns the reprojection depth of space point Pi in the where F is the fundamental matrix and Fx i is the correspond-
current frame. Function D() is the actual depth value of the ing polar line of the keypoint in the current frame.
feature point. For static feature points, the errors of EI i , EC i We use Eq. (8) to determine whether a point is on a straight
and EDi should be close to 0. For dynamic feature points, the line, and the matching point should be on the polar line
three values will be very large. corresponding to the keypoint of the reference keyframe.
We also assume that the depth error and reprojection error To restore an accurate fundamental matrix, mask R-CNN is
follow the t distribution, and integrate the three error param- used to eliminate dynamic feature points and obtain a robust
eters to form a new error vector E (EI i , EC i , EDi ). For fundamental matrix. Then, the epipolar constraint error is
each keypoint, we simultaneously redefine the static weight obtained using Eq. (8). If the error is less than a threshold,
as follows: the keypoints in the two frames are correctly matched. Other-
v+1 wise, they are incorrect. Then, mismatched points and partial
w(xi ) (7) dynamic matching points can be effectively removed by the
v + ET E
proposed method.
where Σ is the covariance matrix. In this paper, the three
error variables are independent of each other and the degree 3.4 Dynamic environment semantic map
of freedom associated with the t distribution is set to 11 (Kerl
et al., 2013b). There are many types of maps, such as metric maps, topologi-
According to the statistical analysis of the residual results, cal maps, and semantic maps. Different maps can be applied
we set the corresponding threshold τ to distinguish the to different environments. At present, the sweeping robot
dynamic feature points from the static feature points. If the uses a two-dimensional grid map, which can realize two-
static weight value of the keypoint is less than τ , it is a dimensional plane navigation and obstacle avoidance. An
dynamic point. Otherwise, it is a static point. If the keypoint unmanned aerial vehicle (UAV) has a six-degree-of-freedom
in the reference keyframe is a dynamic feature point (or static pose and needs to build a three-dimensional map. Most of the
feature point), its matching point in the current frame is also a traditional SLAM algorithms choose to build a point cloud
dynamic feature point (or static feature point). The screened map. Although the point cloud map can rebuild the environ-
dynamic feature points are used as the growing points, and ment, it has many shortcomings, such as occupy computer
the region growth algorithm is used to segment the depth resources, without providing object categories and informa-
image of the current frame. Finally, the dynamic target can tion etc.
be detected. We combine the results of mask R-CNN segmen- With the development of SLAM technology, it is neces-
tation with the results of geometric segmentation to achieve sary to provide a useful map for the subsequent navigation
dynamic target segmentation in the environment. In Fig. 3, it and obstacle avoidance of a robot. Semantic maps provide
is unreliable to only rely on mask R-CNN to detect dynamic environmental semantic information and obstacle avoidance
123
Fig. 3 Geometric method to detect scene dynamic targets and integrate fourth column shows the dynamic feature points we traced using geo-
semantic segmentation results. The first column is the original image. metric methods. The fifth column is the dynamic target we separated
The second column is the semantic image segmented by Mask R-CNN. using geometric methods. The sixth column is the result of our integra-
The third column is the dynamic object detected by Mask R-CNN. The tion of neural networks and geometric methods
information for a robot. In this paper, an octree semantic map where α is log–odds. The inverse of log it is noted as follows:
is built. The octree map stores maps in the form of octrees,
1
which can compress and update maps very well (Hornung p log it −1 (α) (11)
et al., 2013). Due to the use of octree storage, the map is 1 + exp(−α)
composed of many small squares. Each of them is assigned
By referring to log-odds, we can redefine Eq. (9) as fol-
a probability value that the square will be occupied. Each
lows:
small square has three states: occupied, unoccupied, and
indeterminate. Thus, different levels of navigation and obsta-
L(n|z 1:T ) L(n|z 1:T −1 ) + L(n|z T ) (12)
cle avoidance can be achieved by adjusting the resolution.
The key to the building map is to update the map. Due where L(n|z1:T ) is the log−odds of the nth leaf-node at time
to the existence of dynamic objects or noise in the environ- T . L(n|z1:T −1 ) is the log −odds of the nth leaf- node at time
ment, the same block has different occupation probability at T − 1. L(n|zT ) is the log−odds of the nth leaf-node at a given
different times. observation zT at time T . The probability of node occupation
At t 1,…, T , the observed data are z1 ,…, zT , then the is updated by fusing each new data with the original data. The
occupation probability of the nth leaf-node record is noted occupation probability of the parent-node can be obtained
as follows: by the child-node, and a pixel semantic tag is added to the
node to obtain the final semantic map. As shown in Fig. 4,
−1
1 − P(n|z T ) 1 − P(n|z 1:T −1 ) P(n) the f 3/walking halfsphere sequence in TUM RGB-D data
P(n|z 1:T ) 1 +
P(n|z T ) P(n|z 1:T −1 ) 1 − P(n) is built into a semantic octree map. It can be seen that the
(9) dynamic objects in the environment have been removed, and
other detected objects in the environment are semantically
A probability p can be transformed into a real field using marked.
log it transformation as follows:

p
α log it( p) log (10)
1− p
123
Fig. 4 Octree Semantic Map of

f 3/walking halfsphere
Sequence in TUM RGB-D
Dataset. We build the semantic
map and remove the dynamic
objects from the environment.
The green in the map represents
the computer and the blue
represents the chair
4 Experimental results J D. Orb-slam2, 2017) and (Bescos et al., 2018). To better

demonstrate the performance of our proposed method, we
In this section, our method is tested on the TUM RGB-D use a performance value to demonstrate the advantage of our
dataset (Sturm et al., 2012). Numerous sequences in the TUM method (Yu et al., 2018):
RGB-D dataset are used, including environments with highly
dynamic objects and those with small moving objects. α−β
This paper uses TUM RGB-D dataset containing dynamic τ × 100 (13)
α
targets to verify the effectiveness of the proposed algorithm.
Due to the poor robustness of ORB- SLAM2 for dynamic
where α is the other algorithm error value and β is our method
scenes, the performance of ORB-SLAM2 is somewhat dif-
error value.
ferent from the original paper in (Mur-Artal & Tardós J D.
Since our system is mainly aimed at high dynamic envi-
Orb-slam2, 2017).
ronment, we mainly analyze the system performance in
The TUM RGB-D dataset is opened by TUM’s Computer
high dynamic environment. At the same time, because the
Vision Lab, which has multiple dynamic sequences, each
algorithm is time-consuming, improving the real-time per-
of which contains RGB images and depth images. Dynamic
formance of the system will be our future work, so we just
sequences are mainly divided into “sitting” and “walking”. In
choose some fast moving dynamic and low moving dynamic
the “sitting” sequence, two people sit at a table and talk, and
sequences of all the sequences when comparing rather than
the sequence is mainly used to test the robustness of the visual
choose them all.
SLAM algorithm for slow moving dynamic objects. In the
Figure 5 is the ATE plot for a highly dynamic environment.
“walking” sequence, two people traverse the office, and the
The first row is the comparison of the estimated trajectory of
sequence is used to test the robustness of the visual SLAM
ORB-SLAM2 with the real trajectory, and the second row is
algorithm for fast moving dynamic objects in the environ-
the comparison of the estimated trajectory of our proposed
ment. In the sequence, “xyz” means that the camera manually
method with the real trajectory. It can be seen that our pro-
moves in the x, y, and z directions, and “rpy” means that the
posed method reduces the error significantly compared with
camera manually rotates along the main coordinate axis. In
ORB-SLAM2.
addition, “halfsphere” means that the camera is manually
The comparison results between our proposed method and
moved over a small semicircle that is one meter in diameter.
ORB-SLAM2 algorithm are shown in Tables 1, 2 and 3. For
Our proposed method is compared with ORB-SLAM2
high dynamic sequences, RMSE and S.D. are 96.61% and
(Mur-Artal & Tardós J D. Orb-slam2, 2017), DynaSLAM
97.00% in terms of ATE, respectively. This finding demon-
(Bescos et al., 2018) and Mask R-CNN method. In this paper,
strates that our proposed method has good performance in
absolute pose error (ATE) is used for the evaluation of the
a high dynamic environment. For low dynamic sequences,
SLAM system, and relative trajectory error (RPE) is used for
the RMSE and S.D. are only 21.42% and 5.4%, respec-
the evaluation of the visual odometer. Due to the differences
tively. The reason is that in the low dynamic sequences,
in computer hardware resources used, the same algorithm
most of the objects are static. The objects move slowly, and
will have different performance in different devices. In order
the moving objects occupy a small area in the environment.
to ensure the validity of the comparison, this paper repro-
ORB-SLAM2 can obtain good effect in static environments,
duction ORB- SLAM2 and DynaSLAM in our device, so
so it is difficult to improve performance in low dynamic
the data is different from the data in (Mur-Artal & Tardós
sequences.
123
Fig. 5 Plots of ATE for sequences f 3/walking halfsphere, f 3/walking xyz, f 3/walking rpy. The first row is the ATE plot from ORB-SLAM2. The
second row is the ATE plot from our method
Table 1 Comparison of the absolute trajectory error(ATE) of the ORB-SLAM2 algorithm and our algorithm on the RGB-D dataset
Sequences ORB-SLAM2 Our method Improvements
RMSE Mean Median S.D RMSE Mean Median S.D RMSE Mean Median S.D
f3_w_half 0.3409 0.2809 0.2448 0.1930 0.0267 0.0236 0.0210 0.0126 92.16% 91.59% 91.42% 93.47%
f3_w_xyz 0.5023 0.4021 0.3042 0.3010 0.0190 0.0168 0.0150 0.0090 96.21% 95.82% 95.06% 97.00%
f3_w_rpy 0.9331 0.7615 0.5457 0.5392 0.0316 0.0260 0.0223 0.0179 96.61% 96.58% 95.91% 96.68%
f3_w_static 0.1486 0.1340 0.1177 0.0641 0.0071 0.0063 0.0056 0.0032 95.22% 95.29% 95.24% 95.00%
f3_s_static 0.0084 0.0075 0.0069 0.0037 0.0066 0.0056 0.0049 0.0035 21.42% 25.33% 28.98% 5.4%
We compared the four parameters of RMSE, mean,S.D. and median, respectively, and present the degree of improvement
Table 2 and 3 present the performance of visual odometry, results of ATE for our proposed method, DynsSLAM and
translation drift and rotation drift. These findings show that mask R-CNN algorithm. Mask in Table 4 indicates that the
our proposed method outperforms the ORB-SLAM2 system SLAM system only uses Mask RCNN as the dynamic tar-
in the high dynamic sequence, but the improvement in the get detection algorithm. The experimental results show that
performance is not obvious in the low dynamic sequence due our proposed method can better handle the dynamic envi-
to the same reason noted in the analysis of the ATE results. ronment and achieve accurate localization and mapping in
In this paper, we also compare our proposed method low dynamic sequences and high dynamic sequences com-
with the DynsSLAM and SLAM algorithms using mask R- pared with the DynsSLAM and SLAM algorithms using
CNN only. DynsSLAM can effectively deal with dynamic mask R-CNN only. DynsSLAM choose five keyframes with
environmental problems. Our method combines the geomet- the minimum distance between rotation and translation as the
ric constraint method with the mask R-CNN algorithm for reference keyframe. The 3D points corresponding to refer-
exclusive comparison with mask R-CNN to verify the advan- ence keyframes are projected to the current frame, and some
tages of our algorithm. Table 4 shows the RMSE comparison feature points are removed by the 3D points and the paral-
123
Table 2 Comparison of the translational drift of the ORB algorithm and our algorithm on the RGB-D dataset
f3_w_half 0.4840 0.3888 0.3749 0.2883 0.0384 0.0342 0.0319 0.0174 92.06% 91.20% 91.49% 93.96%
f3_w_xyz 0.7703 0.5666 0.3998 0.5218 0.0284 0.0252 0.0234 0.0131 96.31% 95.55% 94.14% 97.48%
f3_w_rpy 1.4148 1.1363 0.9056 0.8429 0.0458 0.0387 0.0332 0.0243 96.76% 96.59% 96.33% 97.11%
f3_w_static 0.2261 0.1720 0.1177 0.1468 0.0111 0.0099 0.0092 0.0049 95.09% 94.24% 92.18% 96.66%
f3_s_static 0.0136 0.0121 0.0115 0.0062 0.0099 0.0086 0.0076 0.0050 27.20% 28.92% 33.91% 19.35%
We compared the four parameters of RMSE, mean, S.D. and median, respectively, and present the degree of improvement
Table 3 Comparison of the rotation drift of the ORB algorithm and our algorithm on the RGB-D dataset
f3_w_half 13.2290 10.9186 12.8808 7.4692 0.9013 0.8085 0.7512 0.3983 93.18% 92.59% 94.16% 94.66%
f3_w_xyz 15.1391 11.1715 8.8309 10.2172 0.7795 0.6344 0.5503 0.4529 94.85% 94.32% 93.76% 95.56%
f3_w_rpy 28.1709 22.0567 17.5395 17.5244 0.9995 0.8384 0.7238 0.5441 96.45% 96.19% 95.87% 96.89%
f3_w_static 4.2299 3.2593 2.1710 2.6960 0.2847 0.2568 0.2420 0.1228 93.26% 92.12% 88.85% 95.44%
f3_s_static 0.3913 0.3551 0.3398 0.1645 0.3355 0.2986 0.2771 0.1529 14.26% 15.91% 18.45% 7.05%
We compared the four parameters of RMSE, mean, S.D. and median, respectively, and present the degree of improvement
Table 4 ATE for each algorithm Table 5 Comparison (RMSE) of the translational drift of the RGB-D
SLAM (Li & Lee, 2017) and our algorithm on the RGB-D dataset
Sequences ORB-SLAM2 DynaSLAM Mask Our
Sequences RGB-D SLAM (Li & Lee, Our Improvements
f3_w_half 0.3409 0.0292 0.0270 0.0267 2017)
f3_w_xyz 0.5023 0.0202 0.0206 0.0190
f3_w_half 0.0527 0.0384 27.14%
f3_w_rpy 0.9331 0.0417 0.0409 0.0316
f3_w_xyz 0.0651 0.0284 56.38%
f3_w_static 0.1486 0.0077 0.0081 0.0071
f3_w_rpy 0.2252 0.0458 79.66%
f3_s_static 0.0084 0.0072 0.0070 0.0066
f3_w_static 0.0327 0.0111 66.06%
f3_s_static 0.0231 0.0099 57.14%
lax angle of the reference keyframe and the current frame.

However, if the shift is small for reference keyframe and the SLAM in (Li & Lee, 2017). Because the evaluation tools and
current frame, the dynamic points of the reference keyframe source code are different, we only compare their RMSE of
will be retained as static points, which will cause missed translational drift, and we can see that our method has been
detection for the dynamic points. So in this paper we choose greatly improved.
the keyframe with large change comparing with the current
frame as the reference keyframe, and accurate feature match-
ing can be obtained by the proposed algorithm. Then we can 5 Conclusion
obtain much constraint information,reduce the missed detec-
tion for the dynamic points and get a better pose. However, In this paper, a new framework is proposed to detect dynamic
the improvement is not obvious because DynsSLAM and targets in the environment. The proposed system framework
mask R-CNN are also robust for dynamic environments. can achieve an accurate initial camera pose using a deep
In addition, due to different equipment and configuration learning network. Furthermore, reprojection error, photomet-
environment,the results of DynsSLAM reimplemented in our ric error and depth error are used to assign a robust static
environment are different from the original data in its article, point weight to each spatial point. Finally, dynamic objects
so the data in Table 4 may be different from those of the are detected, and accurate localization is achieved by combin-
original data in (Bescos et al., 2018). ing with the results of semantic segmentation. Our method is
As shown in Table 5, since we have referred to Article (Li evaluated using the TUM RGB-D dataset. The results show
& Lee, 2017), we compare our algorithm with that RGB-D that our method is robust to high dynamic environments,
123
but the performance improvement for low dynamic environ- 14. Li, S., & Lee, D. (2017). RGB-D SLAM in dynamic environ-
ments is not very obvious because the recognition rate of deep ments using static point weighting. IEEE Robot Autom Lett, 2(4),
2263–2270.
learning and the misdetection of geometric method affect the 15. Mur-Artal, R., & Tardós, J. D. (2017). Orb-slam2: an open-source
performance of the proposed algorithm. According to the slam system for monocular, stereo, and rgb-d cameras. IEEE Trans-
comparison in Table 5, we find that our algorithm has been actions on Robotics, 33(5), 1255–1262.
greatly improved compared with (Li & Lee, 2017), which 16. Sturm J, Engelhard N, Endres F, et al (2021) A benchmark for the
evaluation of RGB-D SLAM systems. In: 2012 IEEE/RSJ inter-
confirms this rationality of this assumption and the useful- national conference on intelligent robots and systems. IEEE, pp
ness of these two errors to a certain extent, and in future 573–580
work, we will use a better deep learning algorithm to further 17. Sun, Y., Liu, M., & Meng, M. Q. H. (2017). Improving RGB-
improve our work. D SLAM in dynamic environments: a motion removal approach.
Robot Auton Syst, 89, 110–122.
18. Tan W, Liu H, Dong Z, et al (2013) Robust monocular SLAM
Funding The work is supported by the national Natural Science Foun-
in dynamic environments. In: IEEE international symposium on
dation of China (Project No. 61773333), China Scholarship Council.
mixed and augmented reality (ISMAR). IEEE, pp 209–218
19. Wang, R., Wan, W., Wang, Y., et al. (2019). A new RGB-D SLAM
method with moving object detection for dynamic indoor scenes.
Remote Sensing, 11(10), 1143.
20. Xu B, Li W, Tzoumanikas D, et al (2019) Mid-fusion: octree-
based Objectlevel multi-instance dynamic SLAM. 2019 interna-
References tional conference on robotics and automation (ICRA). IEEE, pp
5231–5237
1. Alcantarilla PF, Yebes JJ, Almazán J, et al (2012) On combin- 21. Xu Y, Guo Q, Chen J (2018) Dynamic object detection using
ing visual SLAM and dense scene flow to increase the robustness improved vibe for RGB-D SLAM. In: 2018 IEEE international
of localization and mapping in dynamic environments. In: 2012 conference on systems, man, and cybernetics (SMC). IEEE, pp
IEEE international conference on robotics and automation. IEEE, 1664–1669
pp 1290–1297 22. Yu C, Liu Z, Liu X J, et al (2018) Ds-slam: a semantic visual slam
2. Bakkay, M. C., Arafa, M., & Zagrouba, E. (2015). Dense 3D SLAM towards dynamic environments. In: 2018 IEEE/RSJ international
in dynamic scenes using Kinect. Iberian conference on pattern conference on intelligent robots and systems (IROS). IEEE, pp
recognition and image analysis (pp. 121–129). Cham: Springer. 1168–1174
3. Bescos, B., Fácil, J. M., Civera, J., et al. (2018). DynaSLAM: 23. Zhang Z, Zhang J, Tang Q (2019) Mask R-CNN based semantic
tracking, mapping, and inpainting in dynamic scenes. IEEE Robot RGB-D SLAM for dynamic scenes. In: 2019 IEEE/ASME inter-
Autom Lett, 3(4), 4076–4083. national conference on advanced intelligent mechatronics (AIM).
4. Bouguet, J. Y. (2001). Pyramidal implementation of the affine lu- IEEE, pp1151–1156
cas kanade feature tracker description of the algorithm. Intel Corp,
5(1–10), 4.
5. Cui L, Wen F (2019) A monocular ORB-SLAM in dynamic envi-
Publisher’s Note Springer Nature remains neutral with regard to juris-
ronments. In: Journal of Physics Conference Series vol 1168. IOP
dictional claims in published maps and institutional affiliations.
Publishing, Bristol, p 052037
6. Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
40(3), 611–625. Shuhuan Wen was born in Hei-
7. Engel, J., Schops, T., & Cremers, D. (2014). LSD-SLAM: large- longjiang, China, on July 16,
scale direct monocular SLAM. European conference on computer 1972. She received the Ph.D.
vision (pp. 834–849). Cham: Springer. degree in control theory and
8. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: control engineering from the Yan-
a paradigm for model fitting with applications to image analysis shan University, Qinhuangdao,
and automated cartography. Communications of the ACM, 24(6), China in 2005. She is currently
381–395. a Professor of automatic control
9. He K, Gkioxari G, Dollár P, et al (2017) Mask rcnn. In: Proceed- in the Department of Electric
ings of the IEEE international conference on computer vision. pp Engineering, Yanshan University.
2961–2969
10. Hornung, A., Wurm, K. M., Bennewitz, M., Stachniss, C., & Bur-
gard, W. (2013). OctoMap: an efficient probabilistic 3D mapping
framework based on octrees. Autonom Robots, 34(3), 189–206.
11. Kerl C, Sturm J, Cremers D (2013) Robust odometry estimation
for RGB-D cameras. In: 2013 IEEE international conference on
robotics and automation. IEEE, pp 3748–3754
12. Kerl C, Sturm J, Cremers D (2013) Dense visual SLAM for RGB-D
cameras. In: 2013 IEEE/RSJ international conference on intelligent
robots and systems. IEEE, pp 2100–2106
13. Klein G, Murray D (2007) Parallel tracking and mapping for small
AR workspaces. In: Proceedings of the 2007 6th IEEE and ACM
international symposium on mixed and augmented reality. IEEE
Computer Society, pp 1–10
123
Pengjiang Li was born in HeBei, Fuchun Sun was born in Jiangsu

China, in November, 1997. He Province, China, in 1964. He
received a Bachelor degree in received the B.S. and M.S.
Automation from the Yanshan degrees from the Naval Aero-
University, Qinhuangdao, China nautical Engineering Academy,
in 2020. His research interests are Yantai, China, and the Ph.D.
LIDAR, Deep Learning, Semantic degree from the Department of
Segmentation. Computer Science and Tech-
nology, Tsinghua University,
Beijing, China, in 1998.
Yongjie Zhao was born in HeNan, Zhe Wang was received Mas-
China, in November, 1994. He ter’s degree in Control Science
received a Bachelor degree in and Engineering from Shenyang
Information Engineering from jianzhu university in 2019. Cur-
Zhengzhou Business University rently, He is studying for a PhD’s
in 2017. His research interests are degree in control engineering in
VSLAM,image processing. Yanshan university.
Hong Zhang wwas received the

B.Sc degree from Northeast-
ernUniversity, Boston, USA,
in 1982, and the Ph.D. degree
from Purdue University. West
Lafayette, IN, USA, in 1986,
both in electrical engineering. He
conductedpost-doctoral research
at the University of Pennsylvania
from 1986 to 1987 before joining
the Department of Computing
Science, University of Alberta,
Canada,where he is currently a
Professor and Director of the
Centre for IntellingentMining
Systerms.
123

03 - Semantic Visual SLAM in Dynamic Environment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 - Semantic Visual SLAM in Dynamic Environment

Uploaded by

Copyright:

Available Formats

Autonomous Robots (2021) 45:493–504

Semantic visual SLAM in dynamic environment

1 Introduction mation from various sensors, including laser, camera and

P f −1 (T , x) between the keypoint x i and the projection point x´i should

Fig. 4 Octree Semantic Map of

4 Experimental results J D. Orb-slam2, 2017) and (Bescos et al., 2018). To better

lax angle of the reference keyframe and the current frame.

Pengjiang Li was born in HeBei, Fuchun Sun was born in Jiangsu

Hong Zhang wwas received the

You might also like