Professional Documents
Culture Documents
Fig. 2. Visualization of the proposed approach. The letters a), b), C. and D. correspond to the identically named subsections in Sec. III. Starting from
the previous and current sensor inputs St−1 and St , two LiDAR range images It−1 , It are created which are then fed into the network. The output of
the network is a geometric transformation, which is applied to the source scan and normals Sk , Nk . After finding target correspondences with the aid of
a KD-Tree, a geometric loss is computed, which is then back-propagated to the network during training.
scans can be described by the following unknown conditional W denote the height and width of the image, respectively.
probability density function: Coordinates (u, v) of the image are calculated by discretizing
the azimuth and polar angles in spherical coordinates, while
p(Tk−1,k |Sk−1 , Sk ). (1)
making sure that only the nearest point is kept at each pixel
In this work, it is assumed that a unique deterministic location. A natural choice for H is the number of vertical
map Sk−1 , Sk 7→ Tk−1,k exists, of which an approxi- scan-lines of the sensor, whereas W is typically chosen to
mation T̃k−1,k (θ, Sk−1 , Sk ) is modeled by a deep neural be smaller than the amount of points per ring, in order to
network. Here, θ ∈ RP denotes the weights and biases obtain a dense image (cf. a) in Figure. 2). In addition to 3D
of the network, with P being the number of trainable point coordinates, range is also added, yielding (x, y, z, r)|
parameters. During training, the values of θ are obtained for each valid pixel of the image, given as I = φ(S).
by optimizing a geometric loss function L, s.t. θ∗ = b) Network: In order to estimate T̃t−1,t (θ, Ik−1 , Ik ),
argmin L(T̃k−1,k (θ), Sk−1 , Sk , Nk−1 , Nk ), which will be a network architecture consisting of a combination of con-
θ
discussed in more detail in Sec. III-D. volutional, adaptive average pooling, and fully connected
layers is deployed, which produces a fixed-size output in-
B. Network Architecture and Data Flow dependent of the input dimensions of the image. For this
As this work focuses on general robotic applications, a purpose, 8 ResNet [30]-like blocks constitute the core of the
priority in the approach’s design is given to achieve real- architecture, with each consisting of two layers. In total, the
time performance on hardware that is commonly deployed network employs approximately 107 trainable parameters.
on robots. For this purpose, computationally expensive pre- After generating a feature map of (N, 512, H2 , W 32 ) dimen-
processing operations such as calculation of normal vectors, sions, adaptive average pooling along the height and width
as e.g. done in [21], are avoided. Furthermore, during infer- of the feature map is performed to obtain a single value for
ence the proposed approach only requires raw sensor data each channel. The resulting feature vector is then fed into
for its operation. An overview of the proposed approach is a single multi-layer perceptron (MLP), before splitting into
presented in Figure 2, with letters providing references to two separate MLPs for predicting translation t ∈ R3 and
the corresponding subsections and paragraphs. rotation in the form of a quaternion q ∈ R4 . Throughout
a) Data Representation: There are three common tech- all convolutional layers, circular padding is applied, in order
niques to perform neural network operations on point cloud to achieve the same behavior as for a true (imaginary) 360°
q
data: i) mapping the point cloud to an image representation circular image. After normalizing the quaternion, q̄ = |q| , the
and applying 2D-techniques and architectures [25, 26], ii) transformation matrix T̃k−1,k (q̄(θ, Sk−1 , Sk ), t(θ, Sk−1 , Sk ))
performing 3D convolutions on voxels [25, 27] and iii) to is computed.
perform operations on disordered point cloud scans [28, 29].
Due to PointNet’s [28] invariance to rigid transformations C. Normals Computation
and the high memory-requirements of 3D voxels for sparse Learning rotation and translation at once is a difficult
LiDAR scans, this work utilizes the 2D image represen- task [20], since both impact the resulting loss independently
tation of the scan as the input to the network, similar to and can potentially make the training unstable. However, re-
DeepLO [24]. cent works [21, 24] that have utilized normal vector estimates
To obtain the image representation, a geometric mapping in their loss functions have demonstrated good estimation
of the form φ : Rn×3 → R4×H×W is applied, where H and performance. Nevertheless, utilizing normal vectors for loss
calculation is not trivial, and due to difficult integration c) Plane-to-Plane Loss: In the second loss term, the
of ”direct optimization approaches into the learning pro- surface orientation around the two points is compared. Let
cess” [24], DeepLO computes its normal estimates with n̂b and nb be the normal vectors at the transformed source
simple averaging methods by explicitly computing the cross and target locations, then the loss is computed as follows:
product of vertex-points in the image. In the proposed ap- nk
proach, no loss-gradient needs to be back-propagated through 1 X
Ln2n = |n̂b − nb |22 . (3)
the normal vector calculation (i.e. the eigen-decomposition), nk
b=1
as normal vectors are calculated in advance. Instead, nor-
mal vectors computed offline are simply rotated using the Again, if no normal is present in either of the two scans
rotational part of the computed transformation matrix, al- locations, the point is disregarded from the loss computation.
lowing for simple and fast gradient flow with arbitrary The final loss is then computed as L = Lp2n + Ln2n .
normal computation methods. Hence, in this work normal IV. E XPERIMENTAL R ESULTS
estimates are computed via a direct optimization method,
namely principal component analysis (PCA) of the estimated To thoroughly evaluate the proposed approach, testing is
covariance matrix of neighborhoods of points as described performed on three robotic datasets using different robot
in [31], allowing for more accurate normal vector predictions. types, different LiDAR sensors and sensor mounting orienta-
Furthermore, normals are only computed for points that have tions. First, using the quadrupedal robot ANYmal, the suit-
a minimum number of valid neighbors, where the validity of ability of the proposed approach for real-world autonomous
neighbors is dependent on their depth difference from the missions is demonstrated by integrating its pose estimates
point of interest xi , i.e. |range(xi ) − range(xnb )|2 ≤ α, with into a mapping pipeline and comparing against a state-of-
α being a scene-specific tuning parameter. the-art model-based approach [3]. Next, reliability of the
proposed approach is demonstrated by applying it to datasets
D. Geometric Loss from the DARPA SubT Challenge [11], collected using a
tracked robot, and comparing the built map against the
In this work, a combination of geometric losses akin to
ground-truth map. Finally, to aid numerical comparison with
the cost functions in model-based methods [2] are used,
existing work, an evaluation is conducted on the KITTI
namely point-to-plane and plane-to-plane loss. The rigid
odometry benchmark [33].
body transformation T̃k−1,k is applied to the source scan,
s.t. S̃k−1 = T̃k−1,k Sk , and its rotational part to all The proposed approach is implemented using Py-
source normal vectors, s.t. Ñk−1 = rot(T̃k−1,k ) Nk , where Torch [34], with the KD-Tree search component utilized from
denotes an element-wise matrix multiplication. The loss SciPy. For testing, the model is embedded into a ROS [35]
function then incentivizes the network to generate a T̃k−1,k , node. Both PyTorch and ROS implementations are made
publicly available1 .
s.t. S̃k−1 , Ñk−1 match Sk−1 , Nk−1 as close as possible.
a) Correspondence Search: In contrast to [21, 24] A. ANYmal: Autonomous Exploration Mission
where image pixel locations are used as correspondences,
in this work a full correspondence search among the trans- To demonstrate the suitability for complex real-world ap-
formed source and the target is performed in 3D using plications, the proposed approach is tested on data collected
a KD-Tree [32]. This has two main advantages: First, as during autonomous exploration and mapping missions con-
opposed to [24], there is no need for an additional field-of- ducted with the ANYmal quadrupedal robot [10]. In contrast
view loss, since correspondences can also be found for points to wheeled robots and cars, ANYmal with its learning-based
that are mapped to regions outside of the image boundaries. controller [36] has more variability in roll and pitch angles
Second, this also allows for the handling of cases close to during walking. Additionally, rapid large changes in yaw are
sharp edges, which, when using discretized pixel locations introduced due to the robot’s ability to turn on spot. During
only [24], can lead to wrong correspondences for points with these experiments, the robot was tasked to autonomously
large depth deviations, e.g. points near a thin structure. Once explore [37] and map [38] a previously unknown indoor
point correspondences have been established, the following environment and autonomously return to its start position.
two loss functions can be computed. The experiments were conducted in the basement of the
b) Point-to-Plane Loss: For each point ŝb in the trans- CLA building at ETH Zürich, containing long tunnel-like
formed source scan Ŝk−1 , the distance to the associated point corridors, as shown in Figure 1, and during each mission
sb in the target scan is computed, and projected onto the ANYmal traversed an average distance of 250 meters.
target surface at that position, i.e. For these missions ANYmal was equipped with a Velo-
dyne VLP-16 Puck Lite LiDAR. In order to demonstrate
nk
1 X the robustness of the proposed method, during the test
Lp2n = |(ŝb − sb ) · nb |22 , (2) mission the LiDAR was mounted in upside-down orientation,
nk
b=1
while during training it was mounted in the normal upright
where nb is the target normal vector. If no normal exists orientation. To record the training set, two missions were
either at the source or at the target point, the point is conducted with the robot starting from the right side-entrance
considered invalid and omitted from the loss calculation. of the main course. For testing, the robot started its mission
Fig. 3. Comparison of maps created by using pose estimates from the proposed approach and LOAM2 implementation against ground-truth map, as
provided in the DARPA SubT Urban Circuit dataset. More consistent mapping results can be noted when comparing the proposed map with the ground-truth.
roll (deg)
x (m)
opposing end of the main course. 0.0 0
−0.1 −1
During training and test missions, the robot never visited
0.2
pitch (deg)
the starting locations of the other mission as they were 1
y (m)
0.0
0
physically closed off. To demonstrate the utility of the −0.2 −1
proposed method for mapping applications, the estimated
yaw (deg)
0.05 0.5
robot poses were combined with the mapping module of
z (m)
0.00 0.0
LOAM [3]. Figure 4 shows the created map, the robot path −0.05 −0.5
during test, as well as the starting locations for training and 0 1000 2000
index
3000 4000 0 1000 2000
index
3000 4000
Segment length
5 10 25 40 60 100
trel [%] 0.345 0.212 0.151 0.160 0.178 0.128
deg
rrel [ 10m ] 0.484 0.274 0.150 0.103 0.069 0.046
z (m)
z (m)
z (m)
0 200
−1000 0
100
−50
−1500 0 −200
0 500 1000 1500 −200 −150 −100 −50 0 0 200 400 0 200 400 600
x (m) x (m) x (m) x (m)
Fig. 6. Qualitative results of the proposed odometry, as well as the scan-to-map refined version of it. From left to right the following sequences are
shown: 01, 07 (training set), 09, 10 (validation set).
yaw rotations, it can be noted that the LOAM map becomes with little drift, even on challenging sequences with dynamic
inconsistent. In contrast, the proposed approach is not only objects (01), and previously unobserved sequences during
able to generalize and operate in the new environment of training (09,10). Nevertheless, especially for the test set
the test set but it also provides more reliable pose estimates the scan-to-map refinement helps to achieve even better and
and produces a more consistent map when compared to the more consistent results. Quantitatively, the proposed method
DARPA provided ground-truth map. achieves similar results to the only other self-supervised
LiDAR odometry approach [24], and outperforms it when
C. KITTI: Odometry Benchmark combined with mapping, while also outperforming all other
To demonstrate real-world performance quantitatively and unsupervised visual odometry methods [8, 22, 23]. Similarly,
to aid the comparison to existing work, the proposed ap- by integrating the scan-to-map refinement, results close to
proach is evaluated on the KITTI odometry benchmark the overall state of the art [3, 13, 21] are achieved.
dataset [33]. The dataset is split in a training (Sequences Furthermore, to understand the benefit of utilizing multiple
00-08) and a test set (Sequences 09,10), as also done in geometric loses, two networks were trained from scratch on
DeepLO [24] and most other learning-based works. The re- a different training/test split of the KITTI dataset. The results
sults of the proposed approach are presented in Table II, and of this study are presented in Table III and demonstrate the
are compared to model-based approaches [3, 13], supervised benefit of combining plane-to-plane (pl2pl) loss and point-
LiDAR odometry approaches [20, 21] and unsupervised vi- to-plane (p2pl) loss over using the latter one alone, as done
sual odometry methods [8, 22, 23]. Only the 00-08 mean in [24].