Professional Documents
Culture Documents
Pierre Joubert
Abstract The Kinect also provides 8-bit colour (RGB) data si-
multaneously with the depth data, at the same rate and reso-
The Microsoft Kinect sensor provides real-time colour and depth
data via a colour and infrared camera. We present an approach to
lution. Figure 2 gives an example. However, it is important
use this new technology for 3D reconstruction from uncontrolled to note that the depth and RGB data are given relative to
motion. We find the intrinsic and extrinsic parameters of a Kinect the IR and RGB camera respectively. This makes simulta-
by calibration, and then attempt to find a mapping from the raw neous use of both depth and RGB data non-trivial. Figure 2
depth data to real world metric coordinates by means of two meth- indicates this well, as it is clear that the IR image is more
ods. With these parameters we are able to create a 3D point cloud “zoomed-in” than the RGB image.
reconstruction, with colour data, of any static scene recorded by
the Kinect. We also adapt a feature based approach for recon-
structing an unknown object from different points of view. This 2. Problem statement and approach
enables us to create a dense metric reconstruction of a static scene.
Unfortunately it is rather sensitive to the quality, and quantity, of The objective of this project is to firstly render a 3D cloud
the features found, as we demonstrate on different self acquired of points, with colours, from the depth and RGB data
sets. However, even if the recording yields poor or few features, captured by the Kinect, in metric space. After achieving
it is shown that at least a partial reconstruction can be made. this we want to stitch consecutive point clouds (each from
a slightly different angle) together to form a dense 3D
reconstruction of a static scene. We will approach this by
1. Introduction dividing the above tasks into two main parts, namely the
Kinect calibration step and the application step.
The Microsoft Kinect sensor (Kinect), as shown in Fig-
ure 1, is the world’s first low cost, consumer grade 3D Kinect calibration
sensor capable of giving as output 11-bit depth data in
real-time, at 30 frames per second, and at a resolution of • The IR and RGB cameras need to be calibrated sepa-
640×480 pixels [1]. This is achieved by projecting a fixed rately, in order to find intrinsic parameters such as fo-
infrared (IR) pattern onto nearby objects which a dedicated cal lengths, centres of projection and radial distortion
IR camera picks up. The distortion of the pattern, due to coefficients for each. This will facilitate the mapping
the uneven structure of the scene, reveals depth. of points (or pixels) between image and real world
Some applications that may benefit immensely from the
Kinect include:
• the estimation of body position, gestures and motion
for human-computer interaction [2] ;
• map generation for robotic navigation [3] ;
1
Figure 2: Comparison of the IR, depth and RGB output images from the Kinect. The depth is calculated from the distortion
of the IR pattern due to uneven structures in the scene.
coordinates. v X lane
ge p X
ima
X
• Stereo calibration must be performed on the two Y u x
x X
cameras, to find a relative rotation and translation
u
from one camera to the other. This will enable the X Z
triangulation of matching features found in both im- p f
Z Z
ages, to obtain 3D world coordinates.
• A mapping from raw depth data to real world met- (a) (b)
ric coordinates need to be determined, which will Figure 3: The pinhole camera model where each ray passes
aid in the application step. This will be approached through a single point. Image (a) shows a 3D representa-
by means of two methods, namely triangulation (as tion, while (b) is a view on the XZ-plane.
mentioned above) and an experiment.
Application the camera centre for simplicity. This means that any point
(X, Y, Z)T in world coordinates is represented by a point
• After calibration, pixels in a given depth image can (u, v)T on the image plane. The distance from the camera
be mapped to real world metric coordinates. This centre to the image plane is called the focal length f .
will produce a metric point cloud. It is necessary to define homogeneous coordinates at
this point, as it will enable us to write down this projection
• The point cloud can be projected to pixels in the
mathematically.
RGB image, such that each pixel in the depth im-
age is assigned a colour value. This will result in the
For any point (u, v)T on the image plane the homo-
desired colour point cloud.
geneous coordinates are given by (ku, kv, k)T for any
• Corresponding features in successive RGB images, nonzero k. This is due to the fact that any point on the
taken of a static scene from different viewpoints, can image plane is just a line through the camera centre inter-
be matched to find the relative rotation and transla- secting the image plane, and thus any multiple of the coor-
tion from one image to the other. This would assist dinates will still lie on this line. We generally use k = 1.
in data stitching and result in a dense 3D reconstruc- In 2D this enables us to go from inhomogeneous to ho-
tion. mogeneous coordinates:
The following sections provide a more in depth discussion u
u
on each of the steps listed. 7→ v ;
v
1
2
The principal axis is the line through the camera cen-
ycam tre that intersects the image plane perpendicularly. In Fig-
ure 3 it would be the Z-axis, and this point of intersection
py is called the principal point p. In (3) it is assumed that the
p xcam
Y principal point is at the origin of the image plane, but in
practice it may not be. The principal point can have any
coordinates, but usually lies close to the image centre.
xp
X In Figure 4 the rectangle represents an image produced
Figure 4: A rectangle representing an image produced by by the camera model, with the origin at the bottom left. We
the camera model, with the principal point p close to the need to incorporate the offset of the principal point to the
image centre. origin in the projection by adding the x and y coordinates
of p to the projected point. The projection then changes to
3
v lane
u ge p
ima
c Z
w
Y
X
X
Figure 5: Point X represented in 3D world coordinates. It Figure 6: An example of the planar checkerboard pattern
is related to the camera coordinate system by a rotation used for calibration, as seen by the RGB (left) and IR
and translation. (right) cameras. These images are used in calibrating each
camera separately, as well as in stereo calibration where
they act as an image pair.
want to map this point onto the image plane, as in (7), by
converting X to a point in the camera’s coordinate system
by means of a rotation and translation. with homogeneous vectors, xi = PXi implies merely that
Let us define inhomogeneous vectors X̃, X̃cam and c̃. X̃ xi and PXi point in the same direction). Writing P in terms
is the coordinates of a point in the world coordinate frame, of its row vectors, we get
X̃cam represents the same point, but in the camera coordi- T
nate frame, and c̃ is the coordinates of the camera centre in p1 Xi
the world coordinate frame. PXi = pT2 Xi , (13)
We can then write the conversion as T
p3 Xi
X̃cam = R(X̃ − c̃), (9) where pTj denotes the j th row. If we let xi = (ui , vi , wi )T ,
the cross product becomes
which is represented in homogeneous coordinates as
T
vi p3 Xi − wi pT2 Xi
X xi × PXi = wi pT1 Xi − ui pT3 Xi .
(14)
R −Rc̃
Y .
Xcam = T (10) T T
0 1 Z ui p2 Xi − vi p1 Xi
1
Using the fact that pTj Xi = XTi pj , we can rewrite (14) as
By combining (7) and (10) we get
0T −wi XTi vi XTi p
x = KR [ I | − c̃ ] X, (11) 1
wi XTi 0T −ui XTi p2 = 0, (15)
4
Z
c RGB
x
R
IR X
5
Figure 9: Images of raw depth data showing the lack of
metric perspective, visible when viewed from the side (right
image).
6
error
error
(a) (b)
Figure 12: Image (a) shows the depth data plotted against
the known distance in metres. Image (b) shows the inverse
of the known distance plotted against the depth data with
the dashed line indicating an extrapolation of the plot.
7
(a) (b)
Figure 14: Image (a) clearly showing radial distortion at
the edges. This can be undistorted, as seen in (b), where
the bended lines are restored to straight lines.
where (x̃, ỹ) is the ideal image position (which obeys linear
Figure 13: Images showing the reconstruction from the projection), (xd , yd ) is the actual pimage position after dis-
depth data (top left). The top right image shows the re- tortion, r̃ is the radial distance x̃2 + ỹ 2 from the centre
construction from raw depth data whilst the bottom image of distortion and L(r̃) is a distortion factor, which is only a
shows the reconstruction after mapping the depth to metric function of the radius r̃.
coordinates. With (x, y) the measured coordinates, (x̂, ŷ) the cor-
rected coordinates,
p (xc , yc ) the centre of radial distortion
and r = (x − xc )2 + (y − yc )2 , we write the correction
With (26), (27) and (28) we can map any pixel (u, v) in the to the distortion as
depth image to a point (X, Y, Z) in real world coordinates.
Having finished calibrating the Kinect we have all the x̂ = xc + L(r)(x − xc ), (30)
preliminary results required to start applying it to datasets. ŷ = yc + L(r)(y − yc ). (31)
8
Figure 15: Comparison between the nearest neighbour ap-
proach (left) and the bi-linear interpolation (right). The
bi-linear method results in a smoother transition between
colours, as can be seen around the sticker on the fridge.
This last phase of this project is influenced by the work of We use SIFT features found in consecutive RGB images,
[10]. In it, Joubert uses a feature based approach to re- see Figure 17, as in [10], to find the rotation and transla-
construct an unknown object from a sequence of RGB im- tion from one image to the other. It is important to note
ages. The intrinsic parameters of the camera are known, that the translation of the movement is retrievable up to an
but the motion (rotation and translation) from one image to unknown scale α, as indicated in Figure 18. We attempt to
the other is determined by interpreting the feature matches determine this unknown scale in Section 4.2.3.
across the images. The feature detection and matching is
done by the Scale Invariant Feature Transform (SIFT) [11] 4.2.2. Constructing 3D data
and a set of inliers is calculated, from the SIFT matches, We apply the methods used in Section 4.1 to find the met-
by a robust estimator, the RANdom SAmple Consensus ric depth and colour mapping for each subrecording. It is
(RANSAC) [12]. important to note that all the data given and determined so
A shortcoming in the work of [10] was that she could
9
Z
αc̃m came
ra 2
x
Rm
camera 1
X
(a) (b)
Figure 18: A diagram indicating the spatial relationship
between consecutive RGB images. Camera 2 lies some-
where on the dashed line, as the exact scale α is yet to be
calculated.
where Rm and c̃m describe the motion from the one image
to the other.
(c) (d) Note that in (35) we have X̃1 and X̃2 , determined by
Figure 17: SIFT feature detection applied on consecutive (34), as well as Rm and c̃m (found by the procedure from
RGB images ((a) and (b)). The SIFT matches are shown in [10]). The only unknown is α, and rewriting (35) as
(c) and the RANSAC inliers in (d), where every line segment αc̃m = X̃1 − RTm X̃2 , (36)
indicates a match from image (a) to (b). We can infer the
camera movement from these matches. gives 3 equations for every feature match, from which α
can be solved.
far was calculated with the IR camera at the origin of the Ideally α should be the same for all these equations, but
stereo coordinate system (refer to Figure 8). In this section due to loss in accuracy for coordinates far from the camera,
however, we are primarily working with the RGB images this is usually not the case.
and it would therefore be convenient to work with the RGB To approximate α and remove outliers, we take the me-
camera at the origin. dian of all the α values. Finally we have an estimation
for the translation from one image to the other, namely
4.2.3. Transforming coordinate system and calculating the c̃0 = αc̃.
scale factor α
4.2.4. Applying total rotation and translation
Transforming the 3D data, such that the RGB camera is at
the origin of the coordinate system, can be achieved by It is necessary to calculate the total rotation Rtot and trans-
lation c̃tot for the current image, as the method up until now
X̃rgb = RT X̃ + c̃, (34) only describes the relative movement between image pairs.
It can be shown that [10]
where X̃ is the given depth coordinate expressed in IR cam-
era coordinates, and R and c̃ are the extrinsic parameters Rtot ← Rm Rtot , (37)
found in Section 3.3. c̃tot ← RTtot c̃0m + c̃tot , (38)
With the RGB camera at the origin we attempt to
calculate the unknown scale factor α of the translation, where Rm and c̃0m are the rotation and correctly scaled trans-
mentioned in Section 4.2.1, from one image to the next. lation of the current camera relative to the previous one.
This is achieved by the following procedure. To align the current colour point cloud with the previ-
ous reconstructions, we apply Rtot and c̃tot to it.
We calculate the 3D coordinates for all the colour pixels
mapped to the depth points, as in Section 4.1, from two
consecutive RGB images. 5. Results
Suppose x1 and x2 were found to be a feature corre-
spondence in the two RGB images. Furthermore, let X̃1 The methods described in Sections 4.1 and 4.2 enable us to
be the inhomogeneous 3D coordinates of x1 in the first create a dense metric colour point cloud reconstruction of a
RGB camera’s coordinate system, and X̃2 the 3D coordi- static scene. Some example reconstructions are shown and
nates of x2 in the second RGB camera’s coordinate system. discussed in this section.
We therefore have It is difficult to quantitatively assess the accuracy of our
method, since independently generated “ground truth” data
X̃1 = RTm X̃2 + αc̃m , (35) is unavailable and beyond the scope of this project. We
10
method hinges on the quality of features found between
successive RGB images.
Some of the methods used in this project are rather ba-
sic and probably not as accurate as can be, but was suffi-
cient for the purpose of this project.
Pre- and post-processing may be useful. As mentioned
in Section 3.2, the use of a filter such as the adaptive mean
filter could remove noise from the IR images resulting in
a more accurate calibration. Another possibility would be
to perform smoothing on the final reconstruction, or to use
our stitched reconstruction as initialization for a point cloud
alignment procedure such as ICP [13]. The depth data is
quite noisy around the edges, resulting in unwanted gaps in
the reconstruction, and this is definitely an aspect to con-
sider and improve upon.
11
[8] Z. Zhang, A flexible new technique for camera cali-
bration, Technical Report MSRTR-98-71, Microsoft
Research, 1998.
[9] R.C. Gonzalez, R.E. Woods, Digital Image Process-
ing, Third Edition, Pearson Prentice Hall, 2008.
[10] D. Joubert, Motion estimation for multi-view ob-
ject reconstruction, Applied Mathematics Honours
Project, Stellenbosch University, 2010.
[11] D.G. Lowe, Object recognition from local scale-
invariant features, Proceedings of the International
Conference on Computer Vision, 2:1150-1157, 1999.
12
(a)
(b)
(c)
Figure 19: Partial reconstructions made with a subset of a few captured datasets. The reconstructions in (a) and (b) are
shown from different angles. Some of the RGB images of each dataset are shown on the left, and a representation of the
cameras for each RGB image used are included in the reconstructions.
13