You are on page 1of 19

sensors

Article
Indoor 3D Reconstruction of Buildings via Azure Kinect
RGB-D Camera
Chaimaa Delasse 1, * , Hamza Lafkiri 1 , Rafika Hajji 1 , Ishraq Rached 1 and Tania Landes 2

1 College of Geomatic Sciences and Surveying Engineering, Institute of Agronomy and Veterinary Medicine,
Rabat 6202, Morocco
2 ICube Laboratory UMR 7357, Photogrammetry and Geomatics Group, National Institute of Applied
Sciences (INSA Strasbourg), 24, Boulevard de la Victoire, 67084 Strasbourg, France
* Correspondence: chaimaadelasse@gmail.com

Abstract: With the development of 3D vision techniques, RGB-D cameras are increasingly used
to allow easier and cheaper access to the third dimension. In this paper, we focus on testing the
potential of the Kinect Azure RGB-D camera in the 3D reconstruction of indoor scenes. First, a
series of investigations of the hardware was performed to evaluate its accuracy and precision. The
results show that the measurements made with the Azure could be exploited for close-range survey
applications. Second, we performed a methodological workflow for indoor reconstruction based
on the Open3D framework, which was applied to two different indoor scenes. Based on the results,
we can state that the quality of 3D reconstruction significantly depends on the architecture of the
captured scene. This was supported by a comparison of the point cloud from the Kinect Azure with
that from a terrestrial laser scanner and another from a mobile laser scanner. The results show that
the average differences do not exceed 8 mm, which confirms that the Kinect Azure can be considered
a 3D measurement system at least as reliable as a mobile laser scanner.

Keywords: Azure Kinect; RGB-D; TLS; MLS; 3D indoor reconstruction

Citation: Delasse, C.; Lafkiri, H.;


Hajji, R.; Rached, I.; Landes, T. Indoor
3D Reconstruction of Buildings via 1. Introduction
Azure Kinect RGB-D Camera. Sensors
The reconstruction of an as-built 3D model requires the acquisition of the current state
2022, 22, 9222. https://doi.org/
of the building [1]. Point clouds, considered major inputs for 3D modeling of buildings,
10.3390/s22239222
usually result from two main acquisition devices: terrestrial 3D scanners (TLS: terrestrial
Academic Editor: Adam Krzyzak laser scanner) and dynamic 3D scanners (MLS: Mobile Laser scanner), which are based on
Received: 17 October 2022
the SLAM (simultaneous localization and mapping) technology.
Accepted: 24 November 2022
Although it has the advantage of allowing the fast and accurate acquisition of a large
Published: 27 November 2022
volume of data, the approach based on the acquisition and segmentation of point clouds
from TLS is expensive and not very user-friendly for indoor use and can be time consuming
Publisher’s Note: MDPI stays neutral
if a high level of detail is required. The use of SLAM technology can overcome this lack of
with regard to jurisdictional claims in
convenience but at the expense of accuracy. Moreover, the prices of such systems remain
published maps and institutional affil-
inaccessible; whereas 3D reconstruction approaches that rely on vision only can also be
iations.
used for this matter, they are often computationally intensive and suffer from a lack of
robustness [2].
RGB-D cameras represent a low-cost alternative easy to use in an indoor environment.
Copyright: © 2022 by the authors.
Interest in the use of RGB-D sensors and their applications in areas such as 3D reconstruc-
Licensee MDPI, Basel, Switzerland. tion [3] and BIM completion [4] has grown significantly since the launch of Microsoft’s
This article is an open access article first Kinect sensor in 2010. The line of Kinect cameras has, ever since, known notable
distributed under the terms and improvements with every updated version. The latest addition of the Azure Kinect differs
conditions of the Creative Commons from its predecessors in the fact that it supports four different depth sensing modes (wide
Attribution (CC BY) license (https:// field of view binned, narrow field of view binned, wide field of view unbinned, narrow
creativecommons.org/licenses/by/ field of view unbinned), and the color camera is of greater resolution (Microsoft, 2019). The
4.0/). precision (repeatability) of the Azure is also much better than the first Kinect versions [5].

Sensors 2022, 22, 9222. https://doi.org/10.3390/s22239222 https://www.mdpi.com/journal/sensors


Sensors 2022, 22, 9222 2 of 19

Today, the evaluation of the potential of RGB-D sensors in BIM-oriented 3D recon-


struction is a trending topic for the scientific community [3]. Indeed, the ability to success-
fully create complete and accurate models of indoor environments using a lightweight
and inexpensive device such as the Azure Kinect would be highly advantageous for the
BIM community.
Recently, several research studies have shown a real interest in the use of low-cost
RGB-D systems for 3D modeling and BIM reconstruction. Some studies have focused on the
evaluation of indoor and outdoor performance of these sensors. In fact, Ref. [5] propose a
methodology to investigate the Kinect Azure sensor, evaluating its warm-up time, accuracy,
precision, the effect of target reflectivity, and the multipath and flying pixels phenomenon.
The results confirm the values reported by Microsoft. However, due to its acquisition
technology, the device suffers from multipath interference, and its warm-up time remains
relatively long (about 40–50 min) [5]. Ref. [6] solved the question about the Kinect Azure’s
absolute accuracy by using a laser scanner as a benchmark, which extends the assessment
to the whole field of view of the sensor instead of along the optical axis. Their results
were consistent with the findings of [5]. The Azure Kinect shows an improvement over its
predecessor in terms of random depth error as well as systematic error. This research, as
well as others carried out about the investigation of various low cost RGB-D sensors [5,7],
developed 3D modeling pipelines based on RGB-D data. Indeed, Ref. [3] recently proposed
an automatic generation framework that transforms the 3D point cloud produced by a
low-cost RGB-D sensor into an as-built BIM without any manual intervention. The plane
extraction was based on the RANSAC (RANdom SAmple Consensus) algorithm enhanced
by a new “add-drop” method. For semantic segmentation, a CRF (conditional random
field) layer and a COB (convolutional oriented boundaries) layer were added at the end of
the raw FCN (fully convolutional networks) architecture to refine the segmentation results.
The experimental results show that the proposed method has competitive robustness,
processing time, and accuracy compared with a workflow conducted by a TLS.
Other studies focus on combining data from low-cost RGB-D sensors with a TLS point
cloud to complement it and improve its level of detail (LoD). Ref. [4] integrated the point
clouds of small features (doors and windows) from the Kinect V2 sensor with a TLS model.
The authors found that the segmentation accuracy for the Kinect data is highly dependent
on the type of window frame or door jamb. Furthermore, the calibration of the two datasets
remains mostly manual.
Some authors have been interested in improving RGB-D SLAM systems for the 3D
reconstruction of buildings. Ref. [2] proposed a basic RGB-D mapping framework for
generating dense 3D models of indoor environments. For image alignment, their approach
introduced RGBDICP: an improved variant of the ICP (iterative closest point) algorithm
jointly optimizing appearance and shape matching. The loop closure issue was addressed
based on geometric constraints. RGB-D mapping also incorporates a “surfel” representation
to handle occlusions. Although the global alignment process of RGB-D mapping is still
limited, the system can successfully map indoor environments. The authors suggested
that it would be useful to apply a visualization technique such as PMVS (patch-based
multi-view stereo) to enrich the indoor model.
Due to the peculiar structure and scalability of indoor environments, the depth quality
produced by RGB-D cameras and the SLAM algorithm are significant issues for existing
RGB-D mapping systems [8]. Ref. [9] introduced an approach integrating SfM (structure
from motion) and visual RGB-D SLAM to broaden the measurement range and improve
the model details. To ensure the accuracy of RGB image poses, the authors introduced
a refined false feature rejection modeling method for 3D scene reconstruction. Finally, a
global optimization model was used to improve camera pose accuracy and reduce inconsis-
tencies between depth images during stitching. This framework was reviewed by [8] in a
comparative evaluation. The methodology proposed in [8] exploited all possible existing
constraints between 3D features to refine the reconstructed model. The presented fully
constrained RGB-D SLAM framework is centimeter accurate. Comparison of the proposed
Sensors 2022, 22, 9222 3 of 19

work with the visual RGB-D SLAM systems of [9] and SensorFusion demonstrated its
usefulness and robustness.
The availability of RGB-D cameras to the public has provided the opportunity to
develop several 3D reconstruction workflows. However, the challenges associated with
this research area persist. It is necessary to distinguish between studies that deal with
the modeling of individual objects and those that focus on the reconstruction of complete
scenes. In the context of using RGBD sensors to model objects, the authors of [10] proposed
an experimental method for the 3D reconstruction of a balustrade with Kinect V2. A circular
network of frames from eight different perspectives was acquired around the object. As
was previously suggested by the results of depth calibration, the sensor was placed at
approximately 1 m of the object in order to minimize the global depth-related deformations.
The accuracy of the resulting mesh was assessed with regards to a reference point cloud
and mesh obtained by a measuring arm. While in both comparisons, the final error of
about 2 mm on a significant part of the model is acceptable for a low-cost device such as
the V2, the remaining deviations in the order of magnitude of one centimeter still represent
a substantial percentage of the size of the object.
While research concerning the modeling of individual objects is widely available,
studies related to the reconstruction of complete scenes remain limited. Constrained by
the rather limited field of view of low-cost RGB-D cameras, the creation of a complete
and reliable model of an indoor environment comes from several views acquired along
a trajectory, each corresponding to a fragment of the scene. While this procedure often
provides a global view of all surfaces, it suffers from significant odometry drift [11].
Frameworks developed in this sense can be classified into two categories: real-time
online reconstruction (i.e., KinectFusion ([12,13]); BADSLAM [14]; ORB-SLAM2 [15]) and
offline reconstruction (i.e., fully constrained RGB-D SLAM [8], visual RGB-D SLAM [9]),
which requires post-processing. In both cases, several workflows have been developed and
evaluated, and good accuracies have been achieved. However, the evaluation of the quality
of the indoor scene models was performed either on synthetic scenes or through a visual
evaluation of the results [11]). An actual evaluation with reference to a model of the same
scene of supposedly better accuracy, obtained, for example, through a TLS, has not yet been
performed, especially for the Azure Kinect.
The aim of this paper is to evaluate the potential of an RGB-D camera, precisely the
Azure Kinect, for BIM-oriented indoor 3D reconstruction. The main contributions of our
paper are threefold:
- An evaluation of the indoor performance of the Kinect Azure (Section 2.1).
- An evaluation of the quality of a 3D model reconstructed from this device (Section 2.2).
- A comparison with the clouds produced with both a terrestrial and a dynamic 3D
scanner (Section 2.3).
The remainder of the present paper is organized as follows. Section 2 presents the
methodology followed in order to investigate the performance and to assess the 3D recon-
struction quality of the Kinect Azure. Section 3 treats the results of our study, which are
further discussed in Section 4. Section 5 presents a conclusion and recommendations for
future works.

2. Method
The methodology followed in this work was composed of two main steps (Figure 1).
First, we started with an investigation of the Kinect Azure by evaluating several parameters:
the influence of the number of averaged images on the noise, the accuracy of the device
referring to the measurements of a TLS, the influence of the distance on the accuracy,
the precision as a function of the measurement distance, and, subsequently, a geometric
calibration of the sensor. After analyzing the results of the investigation tests, we performed
a methodological workflow test on Open3D in two different indoor scenes. Finally, we
present a comparison of the geometric quality of the reconstruction provided with the
RGB-D camera with respect to laser scanners, particularly a TLS and a MLS.
metric calibration of the sensor. After analyzing the results of the investigation tests, we
performed a methodological workflow test on Open3D in two different indoor scenes.
Finally, we present a comparison of the geometric quality of the reconstruction provided
with the RGB-D camera with respect to laser scanners, particularly a TLS and a MLS.
We note that other tests such as warm-up time and the influence of target reflectivity
Sensors 2022, 22, 9222 4 of 19
were also conducted but are not included in this paper as similar experiences were con-
ducted by fellow researchers in [5] and yielded the same results.

Figure 1.
Figure 1. Overview
Overview of
of the
the general
general workflow.
workflow.

We note that
2.1. Evaluation other
of the tests
Kinect such as warm-up time and the influence of target reflectiv-
Azure
ity were
As reported in Figure 1, wenot
also conducted but are includedseveral
conducted in this experiments
paper as similar experiences
in order were
to assess the
conducted by fellow researchers
technical specifications as well asinthe
[5] performance
and yielded the same
of the results.
camera with regards to several
parameters.
2.1. The
Evaluation adopted
of the Kinecttest’s
Azuremethods are presented in the following subsections.
2.1.1.As reported
Influence of in Figure
Image 1, we conducted several experiments in order to assess the
Averaging
technical specifications as well as the performance of the camera with regards to several
Image averaging
parameters. The adopted assumes
test’s that the noise
methods comes from
are presented a random
in the source.
following Thus, the ran-
subsections.
dom fluctuations above and below the real image data gradually balance out as an in-
creasing
2.1.1. number
Influence ofof images
Image are averaged. This technique is useful for reducing the inher-
Averaging
ent noise of the sensor and
Image averaging assumes its technology existing
that the noise in individual
comes imagessource.
from a random [1]. Thus, the
Our experiment was performed in an office scene (Figure
random fluctuations above and below the real image data gradually balance 2), where repeatedout meas-
as an
increasing number of images are averaged. This technique is useful for reducing was
urements from a 900 mm range with respect to the wall were performed. This scene the
chosen tonoise
inherent represent
of theasensor
typicalandindoor environment,
its technology which
existing in is most often
individual characterized
images [1]. by a
set ofOur
occlusions.
experimentAn analysis of noise in
was performed variability
an officeas a function
scene (Figureof2),
thewhere
size ofrepeated
the averaged
mea-
image set was performed with 20, 50, 100, and 200 successive depth maps.
surements from a 900 mm range with respect to the wall were performed. This scene was Using the ac-
quired depth maps, for each pixel, the mean and standard deviation of the distance
chosen to represent a typical indoor environment, which is most often characterized by a set meas-
urements
of were
occlusions. Ancalculated
analysis of . noise variability as a function of the size of the averaged image
set was performed with 20, 50, 100, and 200 successive depth maps. Using the acquired
Sensors 2022, 22, x FOR PEER REVIEW 5 of 20
depth maps, for each pixel, the mean and standard deviation of the distance measurements
were calculated.

Figure2.2. Indoor
Figure Indoor office
officescene
sceneused
usedfor
fortesting
testingimage
imageaveraging.
averaging.

2.1.2.
2.1.2. Accuracy
Accuracy
Accuracy
Accuracyreflects
reflectshow
howclose
closea ameasurement
measurement is is
to to
a reference value.
a reference To To
value. evaluate the the
evaluate ac-
curacy
accuracyof the Kinect
of the Azure,
Kinect we graphically
Azure, represented
we graphically the deviations
represented in theinmeasurements
the deviations the measure-
of our device compared with those from a TLS, which is supposed to be
ments of our device compared with those from a TLS, which is supposed to be of of higher accuracy.
higher
accuracy.

2.1.3. Influence of Distance on Accuracy


To study the influence of measurement distance on accuracy, we performed meas-
2.1.2. Accuracy
Accuracy reflects how close a measurement is to a reference value. To evaluate the
accuracy of the Kinect Azure, we graphically represented the deviations in the measure-
Sensors 2022, 22, 9222 ments of our device compared with those from a TLS, which is supposed to be 5ofofhigher
19
accuracy.

2.1.3. Influence
2.1.3. InfluenceofofDistance
DistanceononAccuracy
Accuracy
To study
To studythetheinfluence
influenceofofmeasurement
measurementdistance
distanceonon accuracy,
accuracy, wewe performed
performed meas-
mea-
surements from a static position of the Kinect by moving the target (whiteboard) along a a
urements from a static position of the Kinect by moving the target (whiteboard) along
perpendicularline
perpendicular linetotothe
thecamera
camera plane
plane and
and to to distance
distance marks
marks stacked
stacked out out
by aby a total
total station
station
(Figure 3).
(Figure 3). At
At each
eachrange,
range,the
theposition
positionofofthethe board
board was
was recorded
recorded by the
by the totaltotal station
station to to
preventititfrom
prevent fromchanging
changingorientation
orientation from
from one one station
station to another.
to another. TheThe differences
differences between
between
the distances
distancesfrom
fromthe theKinect
Kinect(for
(foreach mode)
each mode) and thethe
and total station
total were
station thenthen
were calculated
calculated
and analyzed.
and analyzed.

Figure 3.
Figure 3. Operating
Operatingmodus
modusfor
fortesting
testing
thethe influence
influence of different
of different ranges
ranges on measurement
on measurement accuracy.
accuracy.

2.1.4.
2.1.4. Precision
Precision
Precision reflects how reproducible measurements are. The data needed to evaluate
Precision reflects how reproducible measurements are. The data needed to evaluate
the repeatability could be derived from the accuracy experiment. The standard devia-
the repeatability could be derived from the accuracy experiment. The standard deviations
tions (Equation (1)) of the repeated measurements at each station and for all modes were
(Equation for
calculated (1))each
of the repeated measurements at each station and for all modes were calcu-
distance:
lated for each distance:
s
2
∑in=1 ( xi − x )
σ= , (1)
∑ n − 1̅
𝜎 , (1)
where σ is the standard deviation, xi are the measurements, x is the estimated mean value,
and
where 𝜎 number
n the deviation, 𝑥 are the measurements, 𝑥̅ is the estimated mean
of observations.
is the standard
value, and 𝑛 the number of observations.
2.1.5. Geometric Calibration
One of the most widely used approaches for geometric calibration is that developed
by [16]. It requires the camera to observe a planar target presented in a few different
orientations.
In our case, the determination of the intrinsic parameters was performed separately for
each camera onboard the Kinect Azure: a color camera and a depth camera. A board called
ChArUco (Figure 4) printed on cardboard paper in order to minimize wrinkles was used.
It is a flat board where uniquely encoded markers (ArUco) are placed inside the white tiles
of a checkerboard. The corners of the markers can be easily detected in the images using
the corner detection algorithms implemented in OpenCV (Open-Source Computer Vision
Library), an open-source computer vision and machine learning software library. The
checkerboard corners are then interpolated and refined. Finally, since the dimensions of the
tiles were well known, we obtained correspondences between the 2D image coordinates
and the 3D camera coordinates so that we could solve, in an iterative way, the camera
model we established.
imagesusing
images usingthe
thecorner
cornerdetection
detectionalgorithms
algorithmsimplemented
implementedininOpenCV
OpenCV(Open-Source
(Open-Source
Computer Vision Library), an open-source computer vision and machine
Computer Vision Library), an open-source computer vision and machine learning learningsoft-
soft-
ware library. The checkerboard corners are then interpolated and refined. Finally,
ware library. The checkerboard corners are then interpolated and refined. Finally, since since
thedimensions
the dimensionsofofthe
thetiles
tileswere
werewell
wellknown,
known,weweobtained
obtainedcorrespondences
correspondencesbetween
betweenthe the
2D image coordinates and the 3D camera coordinates so that we could solve, in
2D image coordinates and the 3D camera coordinates so that we could solve, in an itera-an itera-
Sensors 2022, 22, 9222 6 of 19
tiveway,
tive way,the
thecamera
cameramodel
modelwe weestablished.
established.

Figure
Figure 4. 4. Design
Design and
and parameters
parameters of of
thethe ChArUco
ChArUco board
board OpenCV.
OpenCV.
Figure 4. Design and parameters of the ChArUco board OpenCV.
• Intrinsic parameters and distortions
•• Intrinsicparameters
Intrinsic parametersand anddistortions
distortions
From a
Fromaastatic static position
staticposition
positionofofthe of camera,
the the camera, more
morethan
than30 than
30 30 images
images were each
weretaken,
taken, taken, each
time time
mov-
From
moving the target to occupycamera, more
different distances images
and were
viewpoints, in each
ordertime mov-
to properly
ingthe
ing thetarget
targettotooccupy
occupydifferent
differentdistances
distances andviewpoints,
viewpoints,ininorder
ordertotoproperly
properlyreproduce
reproduce
reproduce the distortions created by the andcamera lenses. These images were processed using
the distortions
theOpenCV
distortions created
created by
by the
the camera
camera lenses.
lenses. These
These images
images were
were processed
processed using
using OpenCV
OpenCV
to find the matrix of each camera, gathering the intrinsic parameters as well as
totofind
findthethematrix
the radial
matrixofofeach
eachcamera,
and tangential
camera,gathering
gatheringthe
distortion parameters
theintrinsic
intrinsicparameters
parametersasaswell
wellasasthe
according to the Brown–Conrady
theradial
radial
model
and tangential
and(Figure
tangential distortion parameters according to the Brown–Conrady model (Figure 5).
5). distortion parameters according to the Brown–Conrady model (Figure 5).

Figure 5. Workflow of geometric calibration on OpenCV.


Figure 5. Workflow of geometric calibration on OpenCV.
Figure 5. Workflow of geometric calibration on OpenCV.
• Extrinsic parameters
•• Extrinsic
Extrinsic parameters
parameters
This test consists of finding the parameters of the rotation–translation matrix that links
theThis test consistsofofof
This
3D test consists
landmarks thefinding theparameters
two cameras.
finding the parameters ofofthe
This requires, therotation–translation
atrotation–translation
least, a single imagematrix
of the
matrix that
same
that
links the
target 3D
(the landmarks
ChArUco of
board)the two
taken cameras.
by both This
cameras requires, at least,
simultaneously. a
Insingle
links the 3D landmarks of the two cameras. This requires, at least, a single image of theour image
case, a of
set ofthe
five
same target
images was(the ChArUco
taken by board)
moving taken
the by
target both
to cameras
various simultaneously.
positions. The In our
parameters
same target (the ChArUco board) taken by both cameras simultaneously. In our case, a set case,
were a set
then
ofoffive
fiveimages
imageswas
determined wastaken
usingtaken bymoving
a function
by moving thetarget
available
the target tovarious
in thetoSDKvarious positions.
(Software Theparameters
Development
positions. The parameters were
Kit). were
then determined using a function available in the SDK (Software
then determined using a function available in the SDK (Software Development Kit). Development Kit).
2.2. 3D Reconstruction of Indoor Scenes
Open3D is a pipeline dedicated to 3D scene reconstruction from RGB-D data. It
is based on the work of [11] and improved by the ideas of [17]. This framework was
chosen for being an open-source project and for containing a set of carefully selected data
structures and algorithms designed for manipulating 3D data. A diagram presenting the
data preparation step and the processing part is shown in Figure 6.
The first step consists of collecting and preprocessing the RGB-D inputs. We used a
custom C++ program, built using functions from the SDK, to extract and transform the color
and depth images into a single coordinate system. The RGB-D input was then integrated
into the open3D pipeline to generate a 3D mesh of the scene.
The experiment presented in this section had a double objective. The first one was to
evaluate the performance of the Kinect Azure camera in a real indoor environment, while
the second one was to evaluate the robustness of our methodological workflow in two
scenes with distinct characteristics. The first scene was a furnished room with a single
window. The second scene was a vacant office with several windows.
Two depth modes WFOV (wide field-of-view) and NFOV (narrow field-of-view)
unbinned were studied, in combination with the 1536p resolution of the color camera,
which has a large field of view especially in the vertical direction. Furthermore, the quality
of the reconstruction was assessed both by keeping the windows uncovered and then by
covering them to block sunlight.
2.2. 3D Reconstruction of Indoor Scenes
Open3D is a pipeline dedicated to 3D scene reconstruction from RGB-D data. It is
based on the work of [11] and improved by the ideas of [17]. This framework was chosen
for being an open-source project and for containing a set of carefully selected data struc-
Sensors 2022, 22, 9222 tures and algorithms designed for manipulating 3D data. A diagram presenting 7the data
of 19
preparation step and the processing part is shown in Figure 6.

Figure 6. Diagram of the reconstruction of a 3D point clouds from RGB-D data on Open3D.
Figure 6. Diagram of the reconstruction of a 3D point clouds from RGB-D data on Open3D.
2.3. Quality Assessment of the Reconstruction with Respect to a TLS and a MLS
The firstsection,
In this step consists of collecting
we conduct and preprocessing
a comparison of the geometric the quality
RGB-D of inputs.
the 3D We used a
scene
custom C++ program,
reconstruction resultingbuilt
fromusing functions
the Kinect Azurefrom themodels
and 3D SDK, to extract by
obtained and transform
a TLS (FARO the
color and depth images into a single coordinate system. The
FOCUS S + 150) and a MLS (GeoSLAM ZEB Horizon). A comparison of the technical RGB-D input was then inte-
grated into the open3D pipeline to generate a 3D mesh of the scene.
specifications of the three devices is given in Table 1. The comparison with the GeoSLAM
The experiment
is interesting presented
since the acquisition in this section
principle as had
wellaasdouble objective. process
the registration The firstareone was to
quite
evaluate
similar tothetheperformance
Azure Kinect. of Itthe Kinectthe
requires Azure
usercamera in a real indoor
to continuously environment,
move through while
the room
while
the scanning
second one andwasperforms
to evaluate a rough registration of
the robustness in real
our time through SLAM
methodological algorithms.
workflow in two
scenes with distinct characteristics. The first scene was a furnished room with a single
Table 1. Comparison
window. The second of scene
the technical
was a specifications
vacant officeofwith
threeseveral
devices (Microsoft
windows.Kinect Azure 2019,
FARO 2019,
Two GeoSLAM
depth modes2019).
WFOV (wide field-of-view) and NFOV (narrow field-of-view) un-
binned were studied, in combination FARO Focus with the 1536p
S Plus resolution
GeoSLAM ZEBof the color
Kinectcamera,
Azure which
has a large field of view especially in 150the vertical direction.
Horizon Furthermore, the
NFOVU quality of the
reconstruction
Range was
(m) assessed both by keeping the windows
150 100 uncovered and3.86 then by cover-
ing them to (degrees)
FOV block sunlight. 360 × 300 360 × 270 C: 90 × 74.3
D: 75 * 65
Weight
2.3. Quality (g)
Assessment 4200
of the Reconstruction with Respect3700
to a TLS and a MLS 440
Scanning velocity (pts/s) 2,000,000 300,000 -
In thisprecision
Relative section,(mm)
we conduct a comparison
1 of the10–30
geometric quality of11the 3D scene
reconstruction resulting
Raw data file size from the Kinect Azure and 3D models obtained by a TLS (FARO
40–50 100–200 2000–3000
FOCUS S(MB/min)
+ 150) and a MLS (GeoSLAM ZEB Horizon). A comparison of the technical spec-
ifications of the three devices is given in Table 1. The comparison with the GeoSLAM is
interesting
Beforesince the acquisition
calculating principle
the distances betweenas well
the as the registration
clouds, process
we first needed are quite
to align them.sim-
ilar
We to the Azure
chose Kinect.
the cloud fromItthe
requires
TLS asthe
theuser to continuously
reference one since itmove
is the through the room
most accurate. while
Then,
the other and
scanning clouds were aligned
performs manually
a rough first, in
registration and then
real fine-tuning
time was performed
through SLAM using
algorithms.
an ICP algorithm integrated in Cloud Compare. We use an RMS of 1.0 × 10−5 , which
consisted of the minimum improvement between two consecutive iterations to validate the
registration result.
The comparison is based on absolute cloud-to-cloud (C2C) distances. This distance
was chosen to facilitate the interpretation of the results. For each point of the compared
cloud, the distance, in absolute value, of the nearest neighbor belonging to the reference
cloud was calculated (Figure 7).
sisted of the minimum improvement between two consecutive iterations to validate the
registration result.
The comparison is based on absolute cloud-to-cloud (C2C) distances. This distance
was chosen to facilitate the interpretation of the results. For each point of the compared
cloud, the distance, in absolute value, of the nearest neighbor belonging to the reference
Sensors 2022, 22, 9222 8 of 19
cloud was calculated (Figure 7).

Figure 7. Workflow of data acquisition and comparison of point clouds two by two based on absolute
Figure 7. Workflow of data acquisition and comparison of point clouds two by two based on abso-
distances C2C.
lute distances C2C.
3. Results
3. Results
In this section, we present the results of the performance tests as well as the 3D
In this section,
reconstruction. we present
The conclusions the from
drawn resultstheof thetests
first performance tests
allow a better as well as the
understanding of 3D re-
the behavior ofThe
construction. the device and confirmed
conclusions its potential
drawn from in terms
the first of 3D reconstruction.
tests allow a better understanding of
the behavior of the device and confirmed its potential in terms of 3D reconstruction.
3.1. Azure Kinect Performance Tests
This first section presents the results of the experiments that focused on the technical
3.1. Azure Kinect Performance Tests
specifications and the performance of the Azure Kinect camera.
This first section presents the results of the experiments that focused on the technical
3.1.1. Influence and
specifications of Image Averaging
the performance of the Azure Kinect camera.
Figure 8 presents the results obtained by calculating the standard deviations of the
depth measurements and varying the number of consecutive frames. We can see that
the standard deviations become higher on the edges of the objects as well as on the
reflective surfaces (here, the metal plate). This effect is observed on all depth maps and
does not necessarily disappear with increasing the number of frames. Furthermore, the
visual rendering of the results becomes increasingly smooth by increasing the number of
successive frames averaged (Figure 8). This aspect is particularly noticeable when going
from 20 to 100 depth maps.
In general, the standard deviations of any pixel (not belonging to the edges of the
objects) decrease by about 2 mm (from approx. 9.3 mm to 7.3 mm) when increasing the
number of averaged frames from 20 to 300 (Figure 9).
By going from 20 to 100 depth maps, the standard deviation decreases by 1.3 mm (from
approx. 9.3 mm to 8.0 mm), which corresponds to 80% of the expected improvement with
300 frames. At this stage, adding more frames does not bring any significant improvement
in the standard deviations. We conclude, therefore, that it is recommended to average
100 successive frames.
standard
surfacesdeviations become
(here, the metal higher
plate). Thison the edges
effect of theonobjects
is observed as well
all depth mapsasand
on does
the reflective
not
necessarily
surfaces disappear
(here, with
the metal increasing
plate). This the number
effect of frames.
is observed onFurthermore, the visual
all depth maps and ren-
does not
dering of disappear
necessarily the results becomes increasingly
with increasing thesmooth
number byofincreasing the number of successive
frames. Furthermore, the visual ren-
frames
dering averaged
of the (Figure
results 8). This
becomes aspect is particularly
increasingly smooth by noticeable
increasingwhen going from
the number 20 to
of successive
Sensors 2022, 22, 9222 100 depth maps.
frames averaged (Figure 8). This aspect is particularly noticeable when going from 9 of 19 20 to
100 depth maps.

Figure 8. Evolution of standard deviations [in mm] of measurements according to the number of
averaged frames.

In general, the standard deviations of any pixel (not belonging to the edges of the
Figure 8. 8.
Figure Evolution
objects) Evolutionof
decrease ofstandard
by standard deviations
deviations
about 2 mm [inmm]
(from[in mm]of9.3
approx. of measurements
measurements according
according
mm to 7.3 mm) thetonumber
the number
whentoincreasing of
the of
averaged frames.
averaged frames.
number of averaged frames from 20 to 300 (Figure 9).

In general, the standard deviations of any pixel (not belonging to the edges of the
objects) decrease by about 2 mm (from approx. 9.3 mm to 7.3 mm) when increasing the
number of averaged frames from 20 to 300 (Figure 9).

Figure 9.
Figure 9. Evolution
Evolutionofofthe
the standard
standard deviation
deviation of aof a pixel
pixel as a function
as a function of the number
of the number of averaged
of averaged frames.
frames.
3.1.2. Accuracy
Bypoint
A goingcloud
fromof 20the
to 100
same depth maps,
indoor scenethewas
standard deviation
extracted from thedecreases by 1.3 and
Kinect Azure mm
(from approx.
compared with9.3
themm to 8.0
point mm),
cloud which
from a 3Dcorresponds
scan performedto 80% of the
by the TLS.expected improvement
The average distance
with 300 frames.
calculated betweenAt thethis
twostage,
cloudsadding
reachesmore
8 mm, frames
while does not bring
the standard any significant
deviation is 8 mm
Figure 9. Evolution of the standard deviation of a pixel as a function of the number of averaged
(Figure 10). The histogram of the distances shows that most of the deviations are less
frames.
than 10 mm, which is acceptable for standard topographic surveys. It is important to note
that the most pronounced gaps are located on the contours of the details, including walls,
By going
doors, from
electrical box,20and
to at
100thedepth maps,
contours the
of the standard
intercom. deviation
A more decreases
detailed by 1.3 mm
demonstration
(from approx. 9.3 mm to 8.0 mm), which corresponds to 80% of the expected improvement
of the procedure, as well as an in-depth analysis of this comparison, will be addressed in a
with
later300
test.frames. At this stage, adding more frames does not bring any significant
mmthan 10 mm,
(Figure 10).which is acceptable
The histogram of the for standard
distances topographic
shows that most ofsurveys. It is important
the deviations are less to note
than
that10the
mm, which
most is acceptable
pronounced for are
gaps standard
locatedtopographic surveys.of
on the contours It the
is important to note
details, including walls,
that the most pronounced gaps are located on the contours of the details, including
doors, electrical box, and at the contours of the intercom. A more detailed demonstration walls,
doors,
of theelectrical box, as
procedure, andwell
at the
as contours of theanalysis
an in-depth intercom.
ofAthis
more detailed demonstration
comparison, will be addressed in
of the procedure, as well as an in-depth analysis of this comparison, will be addressed 10 inof 19
Sensors 2022, 22, 9222 a later test.
a later test.

Figure 10. Comparison of the point cloud extracted from the Kinect Azure to the cloud from the
TLS:Figure
(a) the10. Comparison
differences of thethe
between point
twocloud extracted
clouds from
[in m], (b) the Kinect
histogram ofAzure to the[in
the values cloud
m]. from the TLS:
Figure
(a) the10. Comparison
differences of the
between point
the two cloud
clouds [inextracted from the
m], (b) histogram Kinect
of the Azure
values [in m].to the cloud from the
TLS: (a) the differences between the two clouds [in m], (b) histogram of the values [in m].
3.1.3. Influence
3.1.3. of Distance
Influence on Accuracy
of Distance on Accuracy
First,
3.1.3. we notice
Influence
First, of that
we notice there
Distance
that is
onan
there is offset
Accuracy between
an offset betweenthe the
camera
cameramounting
mounting screw andand
screw the the
depth camera
depth lenslens
camera corresponding
corresponding to the origin of the
of camera coordinate system. Referring
to the First,
data we notice
sheet of the that there
sensor, this istoan
theoffset
offset is
origin
estimated
the camera
betweenat themm
50.4
coordinate
camera
(Figure
system.
mounting
11). This
Referring
screw and the
value
to the data sheet of the sensor, this offset is estimated at 50.4 mm (Figure 11). This value was
depth
wastaken camera
takeninto lens corresponding
into consideration
consideration toto correct to the origin of the camera coordinate system. Referring
correct measurements.
measurements.Using Usingitsitsmounting
mountingscrew,screw,the
thecam-
camera
towas
era the accurately
was data sheetplaced
accurately of theon
placed sensor,
on level this
aa level
offset isbase
topographic
topographic
estimated
base at 50.4on
and mounted
and mounted mm
on (Figure 11). This value
aa tripod.
tripod.
was taken into consideration to correct measurements. Using its mounting screw, the cam-
era was accurately placed on a level topographic base and mounted on a tripod.

Figure 11. Schema of the difference between the distance measured by the Kinect and the true
horizontal distance from a point on the ground.

We can notice that there is no direct and clear correlation between the evolution of the
accuracy and the distance of the measurements. However, we can say that the accuracy
of our device tends, generally, to decline with increasing distance, ranging from +4 mm
to −11 mm (Figure 12). This is not the case for large distances where the deviation seems
to oscillate around the same value. The graphs corresponding to the different modes
supported here show almost the same pattern with differences in values. In general, the
deviations do not exceed −11 mm except for the NFOV binned mode at 4500 mm. For
small distances, the two variants of the narrow mode seem to give the smallest deviations
compared with those of the wide mode.
seems
mm to oscillate
to −11 around 12).
mm (Figure the This
sameisvalue.
not theThe
casegraphs corresponding
for large to the
distances where thediffere
devi
modes
seems to oscillate around the same value. The graphs corresponding to theIndiff
supported here show almost the same pattern with differences in values. ge
eral, the deviations do not exceed –11 mm except for the NFOV binned mode
modes supported here show almost the same pattern with differences in values. In at 4500 mm
Foreral,
smallthedistances,
deviationsthedotwo
not variants
exceed –11of the
mmnarrow mode
except for theseem
NFOV tobinned
give themode
smallest dev
at 4500
ations compared with those of the wide mode.
For small distances, the two variants of the narrow mode seem to give the smallest d
Sensors 2022, 22, 9222 11 of 19
ations compared with those of the wide mode.

Figure 12. Evolution of the deviations, in mm, from the “true” distances.
Figure 12. Evolution
Figureof
12.the deviations,
Evolution in mm,
of the from the
deviations, in“true” distances.
mm, from the “true” distances.
3.1.4. Influence of the Distance on Precision
3.1.4. Influence of the
In
3.1.4. Distance
general,
Influencethe on
thePrecision
ofstandard deviations
Distance for all modes tend to increase with distance b
on Precision
In general,
do not theexceed
standard deviations for all modes tend to increaseclose-range
with distance but
In general, the standard deviationsisfor
6 mm (Figure 13), which interesting
all modesfortend to increase applications. T
with distance
do not exceed 6 mm
narrow-mode (Figure 13),
variants which
show is interesting
better results for
than close-range
the wide applications.
mode. Similarly,The
for the sam
do not exceed 6 mm (Figure 13), which is interesting for close-range applications.
narrow-mode mode,variants showvariants
the binned better results than thedeviations
showbetter
smaller wide mode. Similarly,
compared withforthe
theunbinned.
same
narrow-mode variants show results than the wide mode. Similarly, for the s
mode, the binned variants show smaller deviations compared with the unbinned.
mode, the binned variants show smaller deviations compared with the unbinned.

Figure 13.ofEvolution
Figure 13. Evolution standard of standardofdeviations
deviations of measurements
measurements as a functionasofadistance
function[in
of mm].
distance [in mm].
Figure 13. Evolution of standard deviations of measurements as a function of distance [in mm
In summary, by examining the effect of frame averaging on the noise, it can be
concluded that the averaging of 100 successive maps is sufficient and gives the best results
(1.3 mm decrease in the standard deviation going from 20 to 100 depth maps). We also
examined the accuracy of our instrument by comparing it with a TLS. The mapping of the
differences between the two point clouds showed that they do not exceed 8 mm. Such
accuracy motivates the use of the Azure Kinect for accurate indoor mapping. The analysis
of the variations in the accuracy and precision as a function of the distance show that the
narrow-mode variants present the best results. The wide mode, on the other hand, has
the advantage of allowing a much wider shot. We conclude, therefore, that the choice of
mode should be made following the aspects that interest the user. Our results confirm that
the standard deviation of the Kinect Azure does not exceed 6 mm at a maximum range of
5100 mm.
Sensors 2022, 22, 9222 12 of 19

3.2. 3D Reconstruction
Before presenting the result of the 3D reconstruction, it is interesting to assess the time
required for applying the workflow, from data acquisition to the final result, for the two
scenes (Table 2).

Table 2. Time spent during data acquisition and processing for 3D reconstruction.

Processing Time
Phase Operation
Room Office
Data acquisition Scanning with the Kinect 2 min 13 s 2 min 05 s
Image extraction and
Preprocessing alignment on the same 27 min 34 s 22 min 46 s
coordinate system
Fragment creation 3 h 33 min 26 s 2 h 31 min 25 s
Fragment registration 1 min 34 s 2 min 24 s
3D reconstruction
Fine registration 20 s 45 s
Integration 11 min 11 s 42 min 09 s
Total 4 h 16 min 18 s 3 h 41 min 34 s

The long processing time, in the case of the room, is due to the high number of frames,
which result from the use of an FPS (frames-per-second) rate of 30, in addition to scanning
the scene twice due to the use of the NFOV mode, which has a limited field of view in the
Sensors 2022, 22, x FOR PEER REVIEW 13 of 2
vertical direction.
The results of the reconstruction of the two scenes are shown in Figures 14 and 15.

Figure14.
Figure 14.Result
Resultofof
thethe indoor
indoor 3D 3D reconstruction
reconstruction of theof the room:
room: (a) view
(a) global global
ofview of the
the room room
from the from the
top;(b–d)
top; (b–d)focus
focus
onon corners
corners andand objects.
objects.
Sensors 2022, 22, 9222 13 of 19
Figure 14. Result of the indoor 3D reconstruction of the room: (a) global view of the room from the
top; (b–d) focus on corners and objects.

Figure15.
Figure 15. Results
Resultsofofindoor
indoor3D
3Dreconstruction
reconstruction of the
of the office
office scene.
scene. (a) keeping
(a) by by keeping the windows un
the windows
covered; (b)
uncovered; (b)by
bycovering thewindows.
covering the windows.

3.3. Evaluation of the Quality of the 3D Reconstruction


3.2.1. Evaluation of the Quality of the 3D Reconstruction
Concerning the first scene (the room), in the absence of a reference model, we opted
Concerning the first scene (the room), in the absence of a reference model, we opted
for a visual evaluation. The reconstruction performed can be qualified as successful. The
for a visual which
loop closure, evaluation.
meansThe that reconstruction
a correspondence performed
was made can be qualified
between the detailsascaptured
successful. The
loop
at theclosure,
beginningwhich means
and the samethat a correspondence
details were captured wasat themade between
end of the details
the acquisition, wascaptured
at the beginning and the same details were captured at the end of
performed correctly, and odometry drift was not observed, even with the presence of anthe acquisition, was
performedwindow.
uncovered correctly,
Theand odometry
result drift
of the color was notwas
integration observed, evencorrect
also visually with the presence
(Figure 14). of an
Although
uncovered the NFOV
window. Themodes
resultenable
of theacolor
betterintegration
depth range, they
was arevisually
also not suitable for (Figure
correct
capturing
14). large scenes in an efficient way because their vertical field of view is rather
limited, which does not capture the full height of the walls, even when scanning from the
maximum range. On the contrary, the WFOV modes have a wider field of view (horizontally
and vertically), which allows for reducing the number of frames. It can be noticed that the
depth range offered by the WFOV mode is sufficient for a close-range modeling application.
It is important to note that the occlusion problem persists (Figure 14c,d). Due to the
nonvisibility of objects from all angles, there is a lack of depth data for some objects. Indeed,
in order to avoid problems during the registration part of the post-processing, we cannot
dwell too much on all the details of the scene. It is, therefore, judicious to try to simplify
the trajectory of the camera so that the solution on Open3D converges correctly.
Concerning the office scene, a first attempt was made by keeping the windows ex-
posed. As expected, the presence of shiny surfaces led to erroneous depth measure-
ments. This induced an incorrect reconstruction of the scene (Figure 15a). We observe
that the loop closure failed, as indicated by a staircase effect when the fragments were re-
aligned. To confirm this, a second attempt was made by covering the window with curtains
(Figure 15b). At first glance, the loop closure appeared successful, and the fragments were
aligned correctly. However, an odometry drift effect persisted. This was highlighted by
overlaying this result with the point cloud from the TLS (Figure 16). We note that this effect
began, roughly, halfway through the windows and continued to accumulate as we went
along. Although loop closure in this case mitigated these errors, it was apparent that the
reconstruction deviated significantly from reality at several locations in the scene, especially
when we got close to the windows.
correctly. However, an odometry drift effect persisted. This was highlighted by overlay-
ing this result with the point cloud from the TLS (Figure 16). We note that this effect began,
roughly, halfway through the windows and continued to accumulate as we went along.
Although loop closure in this case mitigated these errors, it was apparent that the recon-
struction deviated significantly from reality at several locations in the scene, especially
Sensors 2022, 22, 9222 14 of 19
when we got close to the windows.

Figure
Figure16.16.Odometry
Odometry drift effect highlighted
drift effect highlightedby
bysuperposing
superposing point
point clouds
clouds from
from TLSTLS (green)
(green) and
and point
point clouds from Kinect Azure (purple). (a) Top view of the point clouds, indicating the start
clouds from Kinect Azure (purple). (a) Top view of the point clouds, indicating the start point and point
and the trajectory. (b) Perspective view. (c) Zoom on a corner of the room to highlight the odometry
the trajectory. (b) Perspective view. (c) Zoom on a corner of the room to highlight the odometry drift.
drift.
We should note that in both cases, the acquisition was tested with both modes, NFOV
and WFOV, and that this did not significantly affect the final result. The use of the WFOV
mode allowed capturing the whole scene in a single turn by holding the camera regularly
in front of the surface to be scanned. However, corners and wall intersections should be
scanned carefully since they very often resulted in zero-value pixels due to the multipath
effect. Averaging several successive depth maps, in the form of fragments, helps to remedy
this problem somewhat. However, this still needs to be taken into consideration when
scanning the scene.
In conclusion, the performed experiments are useful for identifying, on the one hand,
the potential of the sensor in 3D scene reconstruction, and on the other hand, its short-
comings and limitations in order to propose possible improvement paths. Indeed, the
combination of the WFOV mode with a high resolution of the color camera allows efficiency
in terms of data acquisition time (by scanning the whole scene in a single turn), file size in
storage, and processing time. However, it was found that scenes containing a large area
of windows can be problematic due to erroneous depth values. This is not the case for
small windows found in typical indoor environments, which do not seem to affect the
final result since the resulting zero-pixel values constitute only a small part of the captured
depth maps.

3.4. Comparison of the Geometric Quality of Reconstruction (TLS–MLS)


The result of comparing the point clouds is presented in Figure 17. Table 3 summarizes
the means and standard deviations observed by performing the following comparisons
of point clouds: (1) the TLS with the MLS; (2) the TLS with the Kinect Azure, and (3) the
MLS with the Kinect Azure. The mean differences between the clouds are less than 10 mm,
which is very satisfying.
Sensors 2022, 22, x FOR PEER REVIEW 16 of 20
Sensors 2022, 22, 9222 15 of 19

Figure 17. Means and standard deviations of C2C distances between the point clouds: (a) TLS vs.
Figure 17. Means and standard deviations of C2C distances between the point clouds: (a) TLS vs.
MLS, (b) TLS vs. Kinect Azure, (c) MLS vs. Kinect Azure.
MLS, (b) TLS vs. Kinect Azure, (c) MLS vs. Kinect Azure.
Table 3. Mean C2C distance [in mm] and standard deviations [in mm] between point clouds.
Table 3. Mean C2C distance [in mm] and standard deviations [in mm] between point clouds.
Average Distance (mm) Standard Deviation (mm)
TLS–MLS
Average Distance
6
(mm) Standard Deviation
4
(mm)
TLS–MLS
TLS–Kinect Azure 6 8 8 4
MLS–Kinect Azure
TLS–Kinect Azure 8 8 10 8

MLS–Kinect Azure 8 10
Visually, the 3D mesh obtained from the Kinect is smoothed at the corners as well
as at the edges of the walls. This effect explains a large part of the deviations compared
Visually, the 3D mesh obtained from the Kinect is smoothed at the corners as well as
with the other two clouds. The algorithm used encountered more difficulties in matching
at the edges of the walls. This effect explains a large part of the deviations compared with
the images of the second scene (office) than those of the first scene (room). This can be
the other two
explained clouds.
by the The
fact that algorithm
the used distinctive
room contains encountered more
points thatdifficulties intask
facilitate the matching
of the
images of the second scene (office) than those of the first scene (room).
image matching, unlike the office, which has only white walls. It would have been useful This can be ex-
plained byexample,
to put, for the fact that thestickers
colored room contains distinctive
on the office walls. points that facilitate the task of image
matching, unlike
For each of thethe
twooffice, which
clouds, has onlyawhite
we extracted part ofwalls. It would
the wall have beenauseful
and interpolated corre- to put,
sponding least squares plane. Then, the deviations—in
for example, colored stickers on the office walls. absolute values—were calculated
for these
For planes
each ofand thereported in Table
two clouds, we4. extracted
The TLS point cloud
a part of presents
the wallaandstandard deviationa corre-
interpolated
of 0.7 mm, while the MLS point cloud presents a deviation of 3.7 mm from the adjusted
sponding least squares plane. Then, the deviations—in absolute values—were calculated
plane. This confirms that the deviations between the clouds from the two devices are due
for these planes and reported in Table 4. The TLS point cloud presents a standard devia-
to the noise inherent to the measurements made with the MLS.
tion of 0.7 mm, while the MLS point cloud presents a deviation of 3.7 mm from the ad-
justed plane. This confirms that the deviations between the clouds from the two devices
are due to the noise inherent to the measurements made with the MLS.
Sensors 2022, 22, x FOR PEER REVIEW 17 of 2

Sensors 2022, 22, 9222 16 of 19


Table 4. Average values (in mm) and standard deviations (in mm) of the deviations between the
clouds—from TLS and MLS—from the interpolated planes.
Table 4. Average values (in mm) and standard deviations (in mm) of the deviations between the
Deviations
clouds—from TLS and MLS—from the interpolated planes. to the Interpolated Plane
Average Deviation (mm) Standard Deviation (mm)
Deviations to the Interpolated Plane
TLS 0.7 0.7
Average Deviation (mm) Standard Deviation (mm)
MLS 4.6 3.7
TLS 0.7 0.7
MLS 4.6 3.7
The differences in the point clouds resulting from the two scanners with the Kinec
Azure are almost the same (about 8 mm). We also note that the distributions are lef
The differences in the point clouds resulting from the two scanners with the Kinect
skewed (Figure 17), which means that most of the deviations are less than 10 mm (78% o
Azure are almost the same (about 8 mm). We also note that the distributions are left skewed
the values for the TLS and 86% of the values for the MLS). However, these values alone
(Figure 17), which means that most of the deviations are less than 10 mm (78% of the
are notforsufficient
values to validate
the TLS and 86% of thethe quality
values of the
for the MLS).geometric
However,reconstruction.
these values alone are not
Obviously, the deviations are higher in the salient
sufficient to validate the quality of the geometric reconstruction. elements located on the wall plane
(electrical box,the
Obviously, intercom, wrist),
deviations in addition
are higher to theelements
in the salient edges oflocated
the wall andwall
on the theplane
door. Devia
tions are box,
(electrical alsointercom,
observedwrist),
in theinintersections
addition to theofedges
the wall
of thewith
wall the
andground
the door.as shown in Figure
Deviations
are
18. also observed in the intersections of the wall with the ground as shown in Figure 18.

Figure18.
Figure 18.Comparison
Comparison of the
of the intersection
intersection ofwall
of the the with
wall the
with the ground.
ground. (a) TLS (a) TLSvs.
(white) (white)
Kinectvs. Kinec
Azure(color),
Azure (color),
(b)(b)
MLSMLS (white)
(white) vs. Kinect
vs. Kinect AzureAzure (color).
(color).

The
Theerror
error onon
thethe
edges of the
edges of details is probably
the details due to due
is probably the angle of angle
to the the shot. Unfortu-
of the shot. Unfor
nately, it is not practical to capture objects from all angles, especially small
tunately, it is not practical to capture objects from all angles, especially small ones. ones. This is the This is
case for the box, the handle, and the intercom.
the case for the box, the handle, and the intercom.
The error that affects the intersections is suspected to be from the multipath phe-
The error
nomenon. Indeed,that affects
these partsthe
areintersections is suspected
often not captured at once byto be
thefrom
depththe multipath
sensor becausephenom
enon. Indeed, these parts are often not captured at once by the depth
of the angle of incidence of the infrared signal. The signal bounces off several surfaces, sensor because o
theexample
for angle ofbetween
incidence theof theofinfrared
ends the wallsignal.
and theThe signal
floor, which bounces
results off several
in pixels surfaces, fo
of zero
example
values between
(Figure 19). the
Thisends of of
is one thethe
wall
main andlimitations
the floor,of which
usingresults
this typein pixels of zero
of sensor for values
indoor
(Figurereconstruction:
19). This is one Evenofifthe
a detail
main of limitations
the scene is visible,
of using depth
thisinformation may not
type of sensor for be
indoor re
available for the entire frame in question. This effect can be alleviated by varying
construction: Even if a detail of the scene is visible, depth information may not be available the angle
of
forview
the and
entireby frame
taking several successive
in question. Thisimages, but be
effect can a right ratio between
alleviated acquisition
by varying time of view
the angle
and required level of detail must be defined in future works. Another potential solution
and by taking several successive images, but a right ratio between acquisition time and
would be to perform a depth calibration.
required level of detail must be defined in future works. Another potential solution would
be to perform a depth calibration.
Sensors 2022,
Sensors 22,22,x 9222
2022, FOR PEER REVIEW 17 of 19 18 of 20

Figure19.
Figure 19.Multipath
Multipath phenomenon
phenomenon observed
observed onintersection
on the the intersection
of twoof twoatwalls
walls at the
the upper upper section
section
cornerofofaaroom,
corner room, asas indicated
indicated by the
by the red red arrows.
arrows. PixelsPixels in have
in black blackzero
have zerovalues:
depth depthdepth
values: depth image
image
(left),passive
(left), passiveinfrared
infrared image
image (right).
(right). Microsoft.
Microsoft.

4. Discussion
4. Discussion
The indirect ToF technology has proven its robustness in several applications, includ-
The indirect
ing indoor ToF technology
3D reconstruction. The newhas version
proven of itsthe
robustness in several
Kinect sensors bringsapplications,
even more includ
ing indoor 3Dmainly
improvements, reconstruction. The new
centimeter-level version
precision andofaccuracy
the Kinectand sensors brings even
the introduction of more
improvements,
several mainly centimeter-level
depth modes—each precision
tailored to a different and accuracy
scenario—in addition and
to thethe introduction o
integration
of an inertial
several depth measurement
modes—each unit.tailored to a different scenario—in addition to the integration
As with the second
of an inertial measurement unit. Kinect version, the Azure camera needs to be warmed up for
a relatively
As with long
thetime in order
second to stabilize
Kinect version,thetheoutput. The order
Azure camera of magnitude
needs to be warmed of the up for a
improvement due to a 40–50 min warm-up is approximately 2 mm.
relatively long time in order to stabilize the output. The order of magnitude of the im
Overall, the performance investigation tests yielded interesting values with regard
provement due to a 40–50 min warm-up is approximately 2 mm.
to accurate indoor modeling applications. The methodological workflow developed for
Overall, thewas
3D reconstruction performance
carried out investigation tests yielded
using the open-source interesting
framework Open3D.values with regard
It is partic-
to accurate
ularly indoor
interesting for modeling
customizing applications.
the processing The methodological
chain, workflow developed
allowing the implementation of for
3D reconstruction
algorithms, which was wasthecarried
case forout
thisusing
study.the open-source framework Open3D. It is partic
ularlyIt isinteresting
also worth mentioning
for customizing that the effect
the of distortion
processing in the
chain, intersections
allowing of surfaces
the implementation o
(between
algorithms, walls and was
which the floor, for example).
the case This effect persists even after performing
for this study.
geometric calibration
It is also worth to correct for distortion
mentioning effectsofcaused
that the effect by the
distortion incamera lens. Probable
the intersections of surfaces
causes have been discussed, such as the multipath phenomenon or the lack of depth
(between walls and the floor, for example). This effect persists even after performing ge
calibration, which requires further testing in other environments.
ometric calibration to correct for distortion effects caused by the camera lens. Probable
In conclusion, the Kinect Azure has several advantages for 3D indoor reconstruction.
causes have
Its small weight been
anddiscussed,
miniaturizationsuch make
as theitmultipath
easy to handle.phenomenon or the lack
The large number of depth cali
of frames
bration,
per second which requires
is useful further
for avoiding thetesting in other
blur effect whenenvironments.
scanning the scene. Finally, its cost is
In conclusion,
exceptionally the Kinect
low compared withAzure has several
professional RGB-Dadvantages
sensors. for 3D indoor reconstruction
Its small weight and
The negative points miniaturization make it easy
that we have highlighted to handle.
concern the dataThe large number
acquisition solution of frames
available with the SDK. For instance, the SDK does not present real-time
per second is useful for avoiding the blur effect when scanning the scene. Finally, its cos feedback for
visualizing
is exceptionally the output
low of the camera
compared as we
with go through RGB-D
professional the scene. The need to connect two
sensors.
cables, one for the power supply and the other for the connection with the workstation,
The negative points that we have highlighted concern the data acquisition solution
affects practicality since the device must remain connected to a power source to work. It
available with the SDK. For instance, the SDK does not present real-time feedback for
would have been better to combine the two into one cable so that the device can be powered
visualizing
directly fromthe theoutput
laptop. of the camera
Finally, as we
the camera goperforms
also through poorly
the scene.
withThe need
highly to connect two
reflective
cables, one for the power supply and the other for the connection
materials, which are quite common in indoor scenes. Therefore, particular care must with the workstation
be
affectsduring
taken practicality since the device must remain connected to a power source to work. I
data acquisition.
would have been better to combine the two into one cable so that the device can be pow
5. Conclusions
ered directly from the laptop. Finally, the camera also performs poorly with highly reflec
tive The Azure camera
materials, whichisarethequite
latest common
addition toin
Microsoft’s line of Kinect
indoor scenes. sensors.
Therefore, According
particular care mus
to the manufacturer, it has notable performance improvements over its predecessors. In
be taken during data acquisition.
this paper, we performed a series of experiments to evaluate the potential of the new
5. Conclusions
The Azure camera is the latest addition to Microsoft’s line of Kinect sensors. Accord
ing to the manufacturer, it has notable performance improvements over its predecessors
Sensors 2022, 22, 9222 18 of 19

Kinect in the context of 3D indoor reconstruction. The aim of this paper was to draw first
conclusions about this device and to see if it could potentially become a low-cost alternative
to close-range laser scanners for 3D enthusiasts. The main contributions of our work are
threefold: (a) the evaluation of the indoor performance of the Azure, (b) the construction of
a 3D model of real indoor scenes based on Kinect acquisition, and (c) the evaluation of its
geometric quality, referring to a more accurate reference as well as the comparison of the
resulting model with those issued from both a terrestrial and a mobile 3D scanner.
Based on our experiments, we can say that in terms of accuracy and precision, this
sensor has a significant potential in 3D indoor modeling applications. It offers a variety
of modes that users can adapt to their needs, as well as competitive resolutions compared
with other low-cost sensors. However, the sensor still has some drawbacks, including
the long warm-up time. As with many other sensors based on ToF technology, the Azure
is also affected by the phenomenon of flying pixels and multipath interference. The 3D
reconstruction scene confirmed the initial conclusions concerning the potential of the Azure
in BIM applications. Nevertheless, it seems that the robustness of the solution depends,
to a large extent, on the architecture of the scene in question. This conclusion, however,
needs to be investigated further by experimenting with other indoor scene configurations.
A comparison of the point cloud from the Azure with that from a TLS and another from
a MLS showed that the average differences between the measurements from the laser
scanners and the measurements from the Azure Kinect do not exceed 8 mm.
In conclusion, the Azure Kinect is a promising device. It surely has potential that
deserves to be exploited for a wide range of applications, mainly 3D interior reconstruction
and as-built BIM.
In the future, some improvements that we suggest concern the development of a
data acquisition application with a better interface for visualizing the data stream during
scanning. It is one of the challenging aspects for SLAM sensors that requires achieving
complex calculations and displaying large amounts of data simultaneously and in real time.
This could be partially overcome by mapping the computed trajectory of the sensor. Other
interesting perspectives would be implementing the integrated IMU (inertial measurement
unit) in the reconstruction workflow to mitigate the effect of odometry drift, extending
the experimentation to different environments such as adjoining rooms, and exploring the
contribution of depth calibration. Although we compared the computed Azure Kinect point
cloud with those produced by a TLS and an MLS, it should also be tested with segmentation
algorithms that integrate into the scan to BIM workflows.

Author Contributions: Conceptualization, C.D., H.L., R.H. and T.L.; methodology, C.D., H.L., R.H.,
I.R. and T.L.; validation, C.D., H.L., R.H. and T.L.; writing—original draft preparation, C.D., H.L.,
R.H. and T.L.; writing—review and editing, C.D., H.L., R.H., I.R. and T.L.; supervision, R.H. and T.L.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data supporting reported results can be found at the following link:
https://mega.nz/folder/ctIFRSyL#p80SkY5t_BBoDTSm_9WsqA (accessed on 15 November 2022).
Acknowledgments: The authors would like to thank the Geoptima company for providing the
material used in this research.
Conflicts of Interest: The authors declare no conflict of interest.
Sensors 2022, 22, 9222 19 of 19

References
1. Macher, H.; Landes, T.; Grussenmeyer, P. Point Clouds Segmentation as Base for As-Built BIM Creation. In ISPRS Annals of
the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH: Taipei, Taiwan, 2015; Volume II-5-W3,
pp. 191–197.
2. Henry, P.; Krainin, M.; Herbst, E.; Ren, X.; Fox, D. RGB-D Mapping: Using Kinect-Style Depth Cameras for Dense 3D Modeling of
Indoor Environments. Int. J. Robot. Res. 2012, 31, 647–663. [CrossRef]
3. Li, Y.; Li, W.; Tang, S.; Darwish, W.; Hu, Y.; Chen, W. Automatic Indoor As-Built Building Information Models Generation by
Using Low-Cost RGB-D Sensors. Sensors 2020, 20, 293. [CrossRef]
4. Lachat, E.; Landes, T.; Grussenmeyer, P. Combination Of Tls Point Clouds And 3d Data From Kinect V2 Sensor To Complete
Indoor Models. In International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH:
Prague, Czech Republic, 2016; Volume XLI-B5, pp. 659–666.
5. Tölgyessy, M.; Dekan, M.; Chovanec, L’.; Hubinský, P. Evaluation of the Azure Kinect and Its Comparison to Kinect V1 and Kinect
V2. Sensors 2021, 21, 413. [CrossRef]
6. Kurillo, G.; Hemingway, E.; Cheng, M.-L.; Cheng, L. Evaluating the Accuracy of the Azure Kinect and Kinect V2. Sensors 2022, 22,
2469. [CrossRef] [PubMed]
7. Weinmann, M.; Jäger, M.A.; Wursthorn, S.; Jutzi, B.; Weinmann, M.; Hübner, P. 3D Indoor Mapping With The Microsoft Hololens:
Qualitative And Quantitative Evaluation By Means Of Geometric Features. In ISPRS Annals of the Photogrammetry, Remote Sensing
and Spatial Information Sciences; Copernicus GmbH: Göttingen, Germany, 2020; Volume V-1-2020, pp. 165–172.
8. Darwish, W.; Li, W.; Tang, S.; Li, Y.; Chen, W. An RGB-D Data Processing Framework Based On Environment Constraints For
Mapping Indoor Environments. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus
GmbH: Enschede, The Netherlands, 2019; Volume IV-2-W5, pp. 263–270.
9. Tang, S.; Zhu, Q.; Chen, W.; Darwish, W.; Wu, B.; Hu, H.; Chen, M. Enhanced RGB-D Mapping Method for Detailed 3D Indoor
and Outdoor Modeling. Sensors 2016, 16, 1589. [CrossRef] [PubMed]
10. Lachat, E.; Macher, H.; Landes, T.; Grussenmeyer, P. Assessment and Calibration of a RGB-D Camera (Kinect v2 Sensor)- Towards
a Potential Use for Close-Range 3D Modeling. Remote Sens. 2015, 7, 13070–13097. [CrossRef]
11. Choi, S.; Zhou, Q.-Y.; Koltun, V. Robust Reconstruction of Indoor Scenes. In Proceedings of the 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5556–5565.
12. Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A.
KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proceedings of the 2011 10th IEEE International Symposium
on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136.
13. Bylow, E.; Sturm, J.; Kerl, C.; Kahl, F.; Cremers, D. Real-Time Camera Tracking and 3D Reconstruction Using Signed Distance
Functions. Robot. Sci. Syst. 2013, 2, 2.
14. Schöps, T.; Sattler, T.; Pollefeys, M. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In Proceedings of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 134–144.
15. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. IEEE Trans.
Robot. 2017, 33, 1255–1262. [CrossRef]
16. Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334.
[CrossRef]
17. Park, J.; Zhou, Q.-Y.; Koltun, V. Colored Point Cloud Registration Revisited. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 143–152.

You might also like