You are on page 1of 13

Dense object reconstruction from uncontrolled motion

using the Microsoft Kinect sensor

Pierre Joubert

Applied Mathematics Honours Project 2011


Department of Mathematical Sciences, University of Stellenbosch
14926938@sun.ac.za

Project advisor: Dr Willie Brink

Abstract The Kinect also provides 8-bit colour (RGB) data si-
multaneously with the depth data, at the same rate and reso-
The Microsoft Kinect sensor provides real-time colour and depth
data via a colour and infrared camera. We present an approach to
lution. Figure 2 gives an example. However, it is important
use this new technology for 3D reconstruction from uncontrolled to note that the depth and RGB data are given relative to
motion. We find the intrinsic and extrinsic parameters of a Kinect the IR and RGB camera respectively. This makes simulta-
by calibration, and then attempt to find a mapping from the raw neous use of both depth and RGB data non-trivial. Figure 2
depth data to real world metric coordinates by means of two meth- indicates this well, as it is clear that the IR image is more
ods. With these parameters we are able to create a 3D point cloud “zoomed-in” than the RGB image.
reconstruction, with colour data, of any static scene recorded by
the Kinect. We also adapt a feature based approach for recon-
structing an unknown object from different points of view. This 2. Problem statement and approach
enables us to create a dense metric reconstruction of a static scene.
Unfortunately it is rather sensitive to the quality, and quantity, of The objective of this project is to firstly render a 3D cloud
the features found, as we demonstrate on different self acquired of points, with colours, from the depth and RGB data
sets. However, even if the recording yields poor or few features, captured by the Kinect, in metric space. After achieving
it is shown that at least a partial reconstruction can be made. this we want to stitch consecutive point clouds (each from
a slightly different angle) together to form a dense 3D
reconstruction of a static scene. We will approach this by
1. Introduction dividing the above tasks into two main parts, namely the
Kinect calibration step and the application step.
The Microsoft Kinect sensor (Kinect), as shown in Fig-
ure 1, is the world’s first low cost, consumer grade 3D Kinect calibration
sensor capable of giving as output 11-bit depth data in
real-time, at 30 frames per second, and at a resolution of • The IR and RGB cameras need to be calibrated sepa-
640×480 pixels [1]. This is achieved by projecting a fixed rately, in order to find intrinsic parameters such as fo-
infrared (IR) pattern onto nearby objects which a dedicated cal lengths, centres of projection and radial distortion
IR camera picks up. The distortion of the pattern, due to coefficients for each. This will facilitate the mapping
the uneven structure of the scene, reveals depth. of points (or pixels) between image and real world
Some applications that may benefit immensely from the
Kinect include:
• the estimation of body position, gestures and motion
for human-computer interaction [2] ;
• map generation for robotic navigation [3] ;

• obstacle detection and avoidance [4] ;


• general 3D object reconstruction.
Figure 1: The Microsoft Kinect sensor with IR projector,
RGB and IR cameras clearly visible from left to right.

1
Figure 2: Comparison of the IR, depth and RGB output images from the Kinect. The depth is calculated from the distortion
of the IR pattern due to uneven structures in the scene.

coordinates. v X lane
ge p X
ima
X
• Stereo calibration must be performed on the two Y u x
x X
cameras, to find a relative rotation and translation
u
from one camera to the other. This will enable the X Z
triangulation of matching features found in both im- p f
Z Z
ages, to obtain 3D world coordinates.
• A mapping from raw depth data to real world met- (a) (b)
ric coordinates need to be determined, which will Figure 3: The pinhole camera model where each ray passes
aid in the application step. This will be approached through a single point. Image (a) shows a 3D representa-
by means of two methods, namely triangulation (as tion, while (b) is a view on the XZ-plane.
mentioned above) and an experiment.
Application the camera centre for simplicity. This means that any point
(X, Y, Z)T in world coordinates is represented by a point
• After calibration, pixels in a given depth image can (u, v)T on the image plane. The distance from the camera
be mapped to real world metric coordinates. This centre to the image plane is called the focal length f .
will produce a metric point cloud. It is necessary to define homogeneous coordinates at
this point, as it will enable us to write down this projection
• The point cloud can be projected to pixels in the
mathematically.
RGB image, such that each pixel in the depth im-
age is assigned a colour value. This will result in the
For any point (u, v)T on the image plane the homo-
desired colour point cloud.
geneous coordinates are given by (ku, kv, k)T for any
• Corresponding features in successive RGB images, nonzero k. This is due to the fact that any point on the
taken of a static scene from different viewpoints, can image plane is just a line through the camera centre inter-
be matched to find the relative rotation and transla- secting the image plane, and thus any multiple of the coor-
tion from one image to the other. This would assist dinates will still lie on this line. We generally use k = 1.
in data stitching and result in a dense 3D reconstruc- In 2D this enables us to go from inhomogeneous to ho-
tion. mogeneous coordinates:
 
The following sections provide a more in depth discussion   u
u
on each of the steps listed. 7→ v ;
v
1

3. Kinect calibration and from homogeneous to inhomogeneous coordinates:


 
3.1. Single camera geometry u  
 v  7→ 1 u .
We will be using the pinhole camera model [5, 6], illus- w v
w
trated in Figure 3, which assumes that all rays pass through
a single point, called the camera centre c. A scene is pro-
jected on the image plane, which is modelled in front of

2
The principal axis is the line through the camera cen-
ycam tre that intersects the image plane perpendicularly. In Fig-
ure 3 it would be the Z-axis, and this point of intersection
py is called the principal point p. In (3) it is assumed that the
p xcam
Y principal point is at the origin of the image plane, but in
practice it may not be. The principal point can have any
coordinates, but usually lies close to the image centre.
xp
X In Figure 4 the rectangle represents an image produced
Figure 4: A rectangle representing an image produced by by the camera model, with the origin at the bottom left. We
the camera model, with the principal point p close to the need to incorporate the offset of the principal point to the
image centre. origin in the projection by adding the x and y coordinates
of p to the projected point. The projection then changes to

The same applies to 3D points: (X, Y, Z)T 7→ (f X/Z + px , f Y /Z + py )T , (4)


   
  X X   and the matrix representation becomes
X Y   Y  X
Y  7→ 1
Z  ;  Z  7→
Y  .    
   
W X     X
Z Z Y  f X + Zpx f 0 px 0 
1 W   7→ f Y + Zpy  = 0 f py Y
Z  0
Z  . (5)
Z 0 0 1 0
1 1
We now return to the camera geometry. From Fig- Now, writing
ure 3(b) it is clear, using similar triangles, that  
f 0 px
u X K = 0 f py , (6)
=
f Z 0 0 1
X
∴u = f , we can express x = PX as
Z
and v = f Y /Z. x = K [ I | 0 ] X, (7)
This means that any point (X, Y, Z)T projects to in the camera coordinate system.
(f X/Z, f Y /Z, f )T on the image plane. It is clear that The matrix K is called the intrinsic calibration matrix,
all the projected points on the image plane would have the and contains information describing the inner workings of
same third coordinate, namely f , and it can therefore be the camera.
ignored. This lets us simplify the projection to At this stage we have assumed that the projected coor-
dinates have the same scale factors in both x and y direc-
(X, Y, Z)T 7→ (f X/Z, f Y /Z)T , (1)
tions, implying that the camera model has square pixels,
which is a mapping from Euclidean 3-space (R3 ) to Eu- which is often not the case. Introducing pixel scale factors
clidean 2-space (R2 ). mx and my in the x and y directions respectively, we get
Since we are now working in homogeneous coordi- mx f = αx , my f = αy , mx px = x0 and my py = y0 ,
nates, the projection is expressed simply as a linear map- where αx and αy are the focal lengths in pixels, and x0 and
ping between homogeneous coordinates: y0 are the image centre in pixels.
Finally for generality we add a skew parameter s. It
   
X     X should be close to zero for most normal cameras. The cali-
Y  f X f 0 0 0 bration matrix can therefore be written as
  7→ f Y  = 0 f 0 0Y  .
 
Z  Z  (2)  
Z 0 0 1 0 αx s x0
1 1
K =  0 αy y0 . (8)
This can be written compactly as 0 0 1

We have now described the intrinsic camera parameters


x = PX, (3)
in K, and we discuss the extrinsic parameters, namely
where X is the point in world coordinates, x is the mapping rotation and translation, next.
of X on the image plane, as indicated in Figure 3, and the
3×4 matrix P is called the camera projection matrix [5]. In Figure 5, X = (X, Y, Z)T now represents a point
in world coordinates, rather than camera coordinates. We

3
v lane
u ge p
ima

c Z
w

Y
X
X

Figure 5: Point X represented in 3D world coordinates. It Figure 6: An example of the planar checkerboard pattern
is related to the camera coordinate system by a rotation used for calibration, as seen by the RGB (left) and IR
and translation. (right) cameras. These images are used in calibrating each
camera separately, as well as in stereo calibration where
they act as an image pair.
want to map this point onto the image plane, as in (7), by
converting X to a point in the camera’s coordinate system
by means of a rotation and translation. with homogeneous vectors, xi = PXi implies merely that
Let us define inhomogeneous vectors X̃, X̃cam and c̃. X̃ xi and PXi point in the same direction). Writing P in terms
is the coordinates of a point in the world coordinate frame, of its row vectors, we get
X̃cam represents the same point, but in the camera coordi-  T 
nate frame, and c̃ is the coordinates of the camera centre in p1 Xi
the world coordinate frame. PXi = pT2 Xi , (13)
 
We can then write the conversion as T
p3 Xi
X̃cam = R(X̃ − c̃), (9) where pTj denotes the j th row. If we let xi = (ui , vi , wi )T ,
the cross product becomes
which is represented in homogeneous coordinates as
 T 
  vi p3 Xi − wi pT2 Xi
  X xi × PXi = wi pT1 Xi − ui pT3 Xi .
 
(14)
R −Rc̃ 
Y  .

Xcam = T (10) T T
0 1  Z ui p2 Xi − vi p1 Xi
1
Using the fact that pTj Xi = XTi pj , we can rewrite (14) as
By combining (7) and (10) we get
0T −wi XTi vi XTi p 
 
x = KR [ I | − c̃ ] X, (11)  1
 wi XTi 0T −ui XTi p2 = 0, (15)

where X is now in the world coordinate frame. This is the −v X T


uX T
0 T p3
i i i i
general mapping given by a pinhole camera, and it is clear
from x = PX that which is a set of linear equations of the form Ai p = 0.
In (15) only the first two rows are linearly independent,
P = KR [ I | − c̃ ]. (12) as the third row can be written as the sum of −ui /wi times
the first row and −vi /wi times the second row. This means
It is important to note that the camera matrix just de- that (15) can be simplified to
rived has 11 degrees of freedom: 5 for K (elements αx ,
" #p 
αy , x0 , y0 and s), 3 for R (rotation about three axes) and 3 0 T
−wi Xi T T
v i Xi 1
for c̃. Next we discuss how P can be determined for a real p2 = 0. (16)
T T T
w i Xi 0 −ui Xi p
camera. 3

From (16) it is clear that each point correspondence


3.2. Camera calibration
produces two unique equations,  since P has 11 degrees
 and
To obtain the camera matrix P we need a set of point corre- of freedom, we need at least 11 2 = 6 correspondences.
spondences xi ↔ Xi such that xi = PXi for all i. We can We approximate the solution of p in Ap = 0 by creating an
use these correspondences to a build a set of linear equa- overdetermined system and solving for p in a least-squares
tions and calculate the elements of P. Since P has 11 de- sense using singular value decomposition (SVD). This is
grees of freedom we need at least 11 independent linear done due to the fact that some point correspondences might
equations to calculate them uniquely. contain measurement errors. Remember that p is the row-
To obtain the set of linear equations we write xi = PXi representation of P, so in solving p we also calculate the
as the cross product xi × PXi = 0 (since we are working camera matrix P.

4
Z

c RGB

x
R
IR X

Figure 8: Representation of the stereo coordinate system


with the IR camera at the origin and the RGB camera re-
lated to it by a rotation R and translation c.

respectively, as well as their distortion coefficients:


 
  −0.058
593.73 0 315.19  0.145
 
Kir = 0 591.75 219.72 ; kir =  −0.010; (17)
  
0 0 1 −0.002
0.000
 
  0.222
522.97 0 335.67 −0.518
 
Krgb = 0 −0.005.
521.16 243.42; krgb = (18)

Figure 7: An example of the reprojection used in Bouguet’s 0 0 1  0.005
camera calibration toolbox [7] to optimize and re-estimate 0.000
the corners. Here we see the image points found by cor- The distortion coefficients in kir and krgb describe the
ner extraction, indicated by a plus, and the reprojections amount of distortion present in the image. It will aid in
indicated by a circle. The error in reprojection is indicated the undistortion of images, as described in Section 4.1.
with an arrow, where the length of the arrow is indicative It is important to state that these parameters, as well as
of the size of error made. the extrinsic ones given below, will differ slightly from one
Kinect to another, due to small mechanical inaccuracies in
the manufacturing process.
The acquisition of point correspondences can be done
by using a planar checkerboard, with known dimensions,
3.3. Stereo camera calibration
as calibration object. Figure 6 gives an example. The ad-
vantage of using a checkerboard is that the corners can be Stereo calibration is necessary when a system consists of
detected easily in an image. two cameras, as it is necessary to know the exact rotation
We used Bouguet’s camera calibration toolbox for Mat- and translation from the one camera to the other. It follows
lab [7], which is based on the work of [8]. It provides func- the same methodology as with single camera calibration,
tions to read in the images and perform corner extraction except that we now take two pictures of the calibration ob-
and refinement up to sub-pixel accuracy, as illustrated in ject simultaneously (an image pair is shown in Figure 6)
Figure 7. It then solves an optimization problem in order to from the two cameras. This way we know that the object
determine the various intrinsic parameters. will have exactly the same world coordinates in both im-
It should be noted that the IR images are quite heavily ages and this allows us to calculate the spatial relationship
corrupted by the infrared pattern used to infer depth (refer from one camera to the other.
to Figure 6), which may cause problems for the corner ex- For stereo calibration we used the results from the sin-
tractor. A filter such as the adaptive median filter [9] can gle camera calibration as input to Bouguet’s toolbox. It
be useful in alleviating this problem. optimizes the intrinsic parameters and gives as output the
We calibrated the IR and RGB cameras of our Kinect to external parameters, i.e. rotation R and translation c̃, be-
determine their intrinsic calibration matrices Kir and Krgb tween the two cameras. For our Kinect we obtained
   
0.99 0.01 −0.05 −23.01
R = −0.01 0.99 0.00; c̃ =  3.14  , (19)
0.05 0.00 0.99 1.74

5
Figure 9: Images of raw depth data showing the lack of
metric perspective, visible when viewed from the side (right
image).

rounded to two decimals for display purposes. The full ac-


curacy is still used in further calculations. Here the transla-
tion is given in millimetres. It is interesting to note that the
rotation matrix is close to the 3×3 identity matrix.
Interpretation of these extrinsic parameters are satisfac-
tory, as it shows that the two cameras are about 23mm hor-
izontally apart (first coordinate of c̃) and almost perfectly
aligned (R ≈ I), as is apparent from the Kinect itself. A vi- Figure 10: (Top left) The checkerboard pattern’s corners
sual representation can be seen in Figure 8 which indicates used for triangulation. The first and last corners are
the IR camera at the origin of the stereo coordinate system. marked by a yellow and green dot. The views from directly
These results allow us to write down the two camera behind (top right), and from the top (bottom), of the Kinect
matrices as shows that the triangulation does not match the corner se-
quence, nor does it deliver a planar surface.
Pir = Kir [ I | 0 ], (20)
Prgb = Krgb R[ I |− c̃ ]. (21) Writing Pir in terms of its row vectors enables us to
express (22) as
Having all the intrinsic and extrinsic parameters for the
Kinect enables us to map colours (from the RGB image) to 
v1 pT3 − pT2

corresponding depth coordinates (from IR camera) to cre-
pT1 − u1 pT3  X = 0, (24)
 
ate the desired colour point cloud. The only thing remain- 
T T
ing, before that is possible, is to infer a mapping from the u1 p2 − v1 p1
raw depth data to real world coordinates, as is discussed
next. where pTj denotes the j th row of Pir .
In (24) only the first two rows are linearly independent,
3.4. Raw depth mapping as the last row can be written as the sum of −u1 times the
first row and −v1 times the second row, and can therefore
The Kinect gives as output 11-bit raw depth data (a 640× be ignored.
480 matrix of values between 0 and 2047). It is important In the same manner it can be shown that for qTj the j th
to note that the depth data gives an indication of which sur- row of Prgb , (23) can be written as
face is closer to, or further from, the camera relative to its " #
surrounding surfaces, as is clear from Figure 9. That is why v1 qT3 − qT2
it is necessary to map the raw depth data to metric coordi- X = 0. (25)
qT1 − u1 qT3
nates. In this section we approach this problem by means
of two methods: triangulation and experiment. From (24) and (25) it is clear that we have a problem of
the form AX = 0, with A known. We can then solve for X
3.4.1. Triangulation in a least squares sense using SVD.
A feature X visible as xir in the IR camera and xrgb in the Having solved for X = (X, Y, Z)T , we now have world
RGB camera can be triangulated, by writing x = PX as a coordinates for a specific feature visible in both IR and
cross product for each camera: RGB cameras, and we can attempt to infer a mapping be-
tween the raw depth data and metric depth.
xir × Pir X = 0, (22) The known corners of the checkerboard pattern can be
xrgb × Prgb X = 0. (23) used as image features for triangulation, since it was al-
ready calculated during the calibration process.

6
error

error

(a) (b)
Figure 12: Image (a) shows the depth data plotted against
the known distance in metres. Image (b) shows the inverse
of the known distance plotted against the depth data with
the dashed line indicating an extrapolation of the plot.

inversely proportional to the actual metric distance. This


allows us to infer the mapping for the depth values, using a
least squares fitting, as
Figure 11: Comparison between triangulation of two dif-
ferent stereo systems. It is clear that the same error on the Z = (−0.0028(raw depth) + 3.1023)−1 , (26)
image plane (indicated by dashed lines) has a much more
where the values are rounded for display purposes. The
significant triangulation error if the cameras are closer to-
full accuracy is still used in the calculation of the mapped
gether.
depth.
The X and Y mappings can be derived in the following
The results were unsatisfactory, as seen in Figure 10 manner. Given the homogeneous depth image coordinates
where it is clear that the triangulation is very inaccurate. (u, v, 1)T , the IR camera matrix Pir and the homogeneous
This is due to the small distance between the cameras, as world coordinates (X, Y, Z, 1)T of a point, we can write
indicated in Figure 11. If the cameras were further apart, a  
small estimation error on the image plane would result in
  X
u Y 
a relatively small triangulation error. The same small error λ v = Pir  Z  .

results in a significantly larger triangulation error when the 1
cameras are closer together, due to the feature rays becom- 1
ing near parallel.
It then follows from (20) that
With triangulation too unreliable we decided to rather
infer the mapping more directly by means of an experiment,   
 
 X
which is discussed next. u αx 0 x0 0  
Y
λ v =  0 αy y0 0 Z 
3.4.2. Experiment 1 0 0 1 0
1
 
It is known from [1] that the effective range of the Kinect αx X + x0 Z
is between about 0.8 and 3.5 metres. We set up an ex- =  αy Y + y0 Z 
periment by moving a planar surface orthogonally to the Z
Kinect’s field of view, at known distances. We record the    
u αx X/Z + x0
depth data at 0.25 metre increments from half a metre to
∴ v =  αy Y /Z + y0  ,
3.25 metres. The depth measurements within the planar
1 1
surface are all found to be very similar, and we average
those values to come up with a final measurement for each which can be written as
distance. This enables us to plot the depth data against the
known distance, shown in Figure 12(a). u = αx X/Z + x0 ;
It is clear from Figure 12 (a) that our measurement at v = αy Y /Z + y0 ,
0.5 metre is inconsistent with the other measurements. It
also falls outside the effective range, as mentioned above, and which finally gives the mappings as
and will thus be omitted from the set in trying to infer a
general mapping. X = (u − x0 )Z/αx , (27)
From Figure 12 (b) it is clear that the raw depth data is Y = (v − y0 )Z/αy . (28)

7
(a) (b)
Figure 14: Image (a) clearly showing radial distortion at
the edges. This can be undistorted, as seen in (b), where
the bended lines are restored to straight lines.

We can model the radial distortion as [5]


   
xd x̃
= L(r̃) , (29)
yd ỹ

where (x̃, ỹ) is the ideal image position (which obeys linear
Figure 13: Images showing the reconstruction from the projection), (xd , yd ) is the actual pimage position after dis-
depth data (top left). The top right image shows the re- tortion, r̃ is the radial distance x̃2 + ỹ 2 from the centre
construction from raw depth data whilst the bottom image of distortion and L(r̃) is a distortion factor, which is only a
shows the reconstruction after mapping the depth to metric function of the radius r̃.
coordinates. With (x, y) the measured coordinates, (x̂, ŷ) the cor-
rected coordinates,
p (xc , yc ) the centre of radial distortion
and r = (x − xc )2 + (y − yc )2 , we write the correction
With (26), (27) and (28) we can map any pixel (u, v) in the to the distortion as
depth image to a point (X, Y, Z) in real world coordinates.
Having finished calibrating the Kinect we have all the x̂ = xc + L(r)(x − xc ), (30)
preliminary results required to start applying it to datasets. ŷ = yc + L(r)(y − yc ). (31)

The RGB image can thus be undistorted by computing


4. Application L(r). During calibration we calculated the distortion co-
efficients for each camera of our Kinect. These are essen-
4.1. Colour point cloud
tially coefficients in a Taylor expansion of L(r), and can be
As mentioned in Section 1, we take a recording of a static used to undistort the image. It is important to point out at
scene we wish to reconstruct, and obtain the depth and this stage that we did not undistort the depth image, as its
colour datasets from the Kinect. Having this data enables distortion coefficients are almost a factor ten smaller than
us to map the depth data to metric coordinates by using the RGB image’s, as is clear from (17) and (18).
(26), (27) and (28). An example of the results of this map-
ping can be seen in Figure 13. 4.1.2. Mapping colours to the point cloud
The next step is to obtain the corresponding colour of
With the image undistorted we can obtain the correspond-
each depth point from the RGB image, but before we can
ing colour of each depth point from the RGB image.
do that we have to undistort the RGB image with respect to
Given X in world coordinates, x in RGB image coordi-
radial lens distortion.
nates and Prgb , as discussed in Sections 3.1 and 3.3, it then
follows that
4.1.1. Radial distortion
x = Prgb X, (32)
Consumer grade camera lenses cause radial distortion
where x = (u, v, 1)T now gives the RGB image coor-
around the edges of an image, as exemplified in Figure 14.
dinates corresponding to the point X. The image coordi-
This causes straight lines to bend and the degree to which
nates found are at sub-pixel accuracy, whilst pixels are rep-
it bends is proportional to the distance from the so-called
resented by integers. The problem now is which pixel’s
centre of distortion.
colour to assign to X (nearest neighbour) or which frac-
tion of each surrounding pixel to use to colour X (bi-linear
interpolation).

8
Figure 15: Comparison between the nearest neighbour ap-
proach (left) and the bi-linear interpolation (right). The
bi-linear method results in a smoother transition between
colours, as can be seen around the sticker on the fridge.

The nearest neighbour method works by rounding the


image coordinate x to the nearest integer pair and choos-
ing that coordinate pixel as the depth point’s colour, but it
produces a rather jagged colour mapping. Bi-linear inter-
polation on the other hand gives a smoother transition be-
tween colours, as seen in Figure 15. It weighs each of the Figure 16: After finding the corresponding colour pixels
four pixels surrounding x by the distance between x and the for each depth point in the scene from Figure 13, a colour
pixel in question, and adds them together, to form a blend point cloud reconstruction is possible.
of all four colours.
Defining the first and second component of x rounded
use only the features found in two consecutive images to
down as uf and vf , and uc and vc as the components
triangulate points, resulting in a rather sparse reconstruc-
rounded up, allows us to express the bi-linear interpolation
tion. We, on the other hand, effectively have a 640 × 480
mathematically as
grid of depth points we can use for reconstruction.
C(i, j) = A(uf , vf )(uc − u )(vc − v )+ We approach the data stitching by making a complete
A(uc , vf )(u − uf )(vc − v )+ recording of a static scene and dividing it up into sub-
(33) recordings. Each subrecording then acts as an individual
A(uf , vc )(uc − u )(v − vf )+
A(uc , vc )(u − uf )(v − vf ), static scene, delivering an RGB image and corresponding
point cloud (as discussed in Section 4.1). This enables us
where A(i, j) is the colour of the undistorted RGB image to use the consecutive RGB images, as [10] did, to deter-
at pixel (i, j) and C is the 640 × 480 matrix of interpolated mine the movement from one image to the other, and in so
colours. doing also the movement of the Kinect from one captured
This enables us to create a metric point cloud recon- scene to the other. We then use our own metric depth data
struction of the recorded scene, with colour. Figure 16 for the reconstruction, since we know the spatial relation-
gives an example. The next and final step is to stitch dif- ship between the IR and RGB cameras, as determined in
ferent point clouds together to form a more complete 3D Section 3.3. We approach this procedure by means of the
reconstruction. steps discussed next.

4.2. Data stitching 4.2.1. Estimating motion

This last phase of this project is influenced by the work of We use SIFT features found in consecutive RGB images,
[10]. In it, Joubert uses a feature based approach to re- see Figure 17, as in [10], to find the rotation and transla-
construct an unknown object from a sequence of RGB im- tion from one image to the other. It is important to note
ages. The intrinsic parameters of the camera are known, that the translation of the movement is retrievable up to an
but the motion (rotation and translation) from one image to unknown scale α, as indicated in Figure 18. We attempt to
the other is determined by interpreting the feature matches determine this unknown scale in Section 4.2.3.
across the images. The feature detection and matching is
done by the Scale Invariant Feature Transform (SIFT) [11] 4.2.2. Constructing 3D data
and a set of inliers is calculated, from the SIFT matches, We apply the methods used in Section 4.1 to find the met-
by a robust estimator, the RANdom SAmple Consensus ric depth and colour mapping for each subrecording. It is
(RANSAC) [12]. important to note that all the data given and determined so
A shortcoming in the work of [10] was that she could

9
Z

αc̃m came
ra 2

x
Rm
camera 1
X
(a) (b)
Figure 18: A diagram indicating the spatial relationship
between consecutive RGB images. Camera 2 lies some-
where on the dashed line, as the exact scale α is yet to be
calculated.

where Rm and c̃m describe the motion from the one image
to the other.
(c) (d) Note that in (35) we have X̃1 and X̃2 , determined by
Figure 17: SIFT feature detection applied on consecutive (34), as well as Rm and c̃m (found by the procedure from
RGB images ((a) and (b)). The SIFT matches are shown in [10]). The only unknown is α, and rewriting (35) as
(c) and the RANSAC inliers in (d), where every line segment αc̃m = X̃1 − RTm X̃2 , (36)
indicates a match from image (a) to (b). We can infer the
camera movement from these matches. gives 3 equations for every feature match, from which α
can be solved.
far was calculated with the IR camera at the origin of the Ideally α should be the same for all these equations, but
stereo coordinate system (refer to Figure 8). In this section due to loss in accuracy for coordinates far from the camera,
however, we are primarily working with the RGB images this is usually not the case.
and it would therefore be convenient to work with the RGB To approximate α and remove outliers, we take the me-
camera at the origin. dian of all the α values. Finally we have an estimation
for the translation from one image to the other, namely
4.2.3. Transforming coordinate system and calculating the c̃0 = αc̃.
scale factor α
4.2.4. Applying total rotation and translation
Transforming the 3D data, such that the RGB camera is at
the origin of the coordinate system, can be achieved by It is necessary to calculate the total rotation Rtot and trans-
lation c̃tot for the current image, as the method up until now
X̃rgb = RT X̃ + c̃, (34) only describes the relative movement between image pairs.
It can be shown that [10]
where X̃ is the given depth coordinate expressed in IR cam-
era coordinates, and R and c̃ are the extrinsic parameters Rtot ← Rm Rtot , (37)
found in Section 3.3. c̃tot ← RTtot c̃0m + c̃tot , (38)
With the RGB camera at the origin we attempt to
calculate the unknown scale factor α of the translation, where Rm and c̃0m are the rotation and correctly scaled trans-
mentioned in Section 4.2.1, from one image to the next. lation of the current camera relative to the previous one.
This is achieved by the following procedure. To align the current colour point cloud with the previ-
ous reconstructions, we apply Rtot and c̃tot to it.
We calculate the 3D coordinates for all the colour pixels
mapped to the depth points, as in Section 4.1, from two
consecutive RGB images. 5. Results
Suppose x1 and x2 were found to be a feature corre-
spondence in the two RGB images. Furthermore, let X̃1 The methods described in Sections 4.1 and 4.2 enable us to
be the inhomogeneous 3D coordinates of x1 in the first create a dense metric colour point cloud reconstruction of a
RGB camera’s coordinate system, and X̃2 the 3D coordi- static scene. Some example reconstructions are shown and
nates of x2 in the second RGB camera’s coordinate system. discussed in this section.
We therefore have It is difficult to quantitatively assess the accuracy of our
method, since independently generated “ground truth” data
X̃1 = RTm X̃2 + αc̃m , (35) is unavailable and beyond the scope of this project. We

10
method hinges on the quality of features found between
successive RGB images.
Some of the methods used in this project are rather ba-
sic and probably not as accurate as can be, but was suffi-
cient for the purpose of this project.
Pre- and post-processing may be useful. As mentioned
in Section 3.2, the use of a filter such as the adaptive mean
filter could remove noise from the IR images resulting in
a more accurate calibration. Another possibility would be
to perform smoothing on the final reconstruction, or to use
our stitched reconstruction as initialization for a point cloud
alignment procedure such as ICP [13]. The depth data is
quite noisy around the edges, resulting in unwanted gaps in
the reconstruction, and this is definitely an aspect to con-
sider and improve upon.

Figure 20: This reconstruction shows the result of inaccu- 7. Acknowledgements


rate feature matching in the stitching process.
The author thanks Simon Muller for his technical assis-
tance in capturing datasets, Francois Singels for clarify-
therefore evaluate the success of our experiments purely by ing the theory behind camera geometry, Daniek Joubert for
visual inspection. sharing her motion estimation code, and Dr Willie Brink
Figure 19 (a) and (b) (on the last page of this project) for his assistance and support with this project.
show a partial reconstruction of a man in a chair, and Fig-
ure 19(c) a partial reconstruction of the inside of an office.
We note that the success of our method hinges on the 8. References
accuracy of the feature matches. Figure 20 shows another
part of the recording used to create the reconstructions in [1] The OpenKinect Project,
Figures 19 (a) and (b), where inaccurate feature matching http://openkinect.org/wiki/Main Page.
caused the stitching to fail. The moment a consecutive pair
[2] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpre-
of RGB images deliver a small or partially incorrect set
tation of hand gestures for human-computer interac-
of matches, the calculated motion of the Kinect becomes
tion: a review, IEEE Pattern Analysis and Machine
unreliable and destroys the rest of the reconstruction pro-
Intelligence, 7:677–695, 1997.
cess. The fact that we take a pairwise motion estimation
approach causes a single mistake to propagate through the [3] J. Miura, Y. Negishi, Y. Shirai, Mobile robot map
solution. generation by integrating omnidirectional stereo and
Nevertheless, if good feature matching is achieved be- laser range finder, IEEE/RSJ International Confer-
tween consecutive RGB images, we can build impressive ence on Intelligent Robots and Systems, 1:250–255,
3D reconstructions. 2002.
[4] L.M. Lorigo, R.A. Brooks, W.E.L. Grimsou, Visually-
6. Conclusions and future work guided obstacle avoidance in unstructured environ-
ments, IEEE/RSJ International Conference on Intelli-
In this project we have successfully calibrated the IR and gent Robots and Systems, 1:373–379, 1997.
RGB cameras of our Kinect, inferred a general mapping
[5] R. Hartley, A. Zisserman, Multiple View Geometry in
from its raw depth data to metric world coordinates and
Computer Vision, Second Edition, Cambridge Univer-
mapped colours from the RGB image to the point cloud.
sity Press, 2003.
This enabled us to create a metric coloured point cloud re-
construction for a single IR and RGB image pair. Adapt- [6] F.Singels, Real-time stereo reconstruction using hier-
ing the feature based approach of [10] enabled us to stitch achical DP and LULU filtering, MSc thesis, Stellen-
consecutive coloured point clouds together to form a more bosch University, 2010.
complete dense reconstruction of a static scene. Further-
more we argued that at worst a partial reconstruction is [7] J. Bouguet, Camera Calibration Toolbox for Matlab,
achievable, as seen in Figure 19, as the success of the http://www.vision.caltech.edu/bouguetj/
calib/ doc/.

11
[8] Z. Zhang, A flexible new technique for camera cali-
bration, Technical Report MSRTR-98-71, Microsoft
Research, 1998.
[9] R.C. Gonzalez, R.E. Woods, Digital Image Process-
ing, Third Edition, Pearson Prentice Hall, 2008.
[10] D. Joubert, Motion estimation for multi-view ob-
ject reconstruction, Applied Mathematics Honours
Project, Stellenbosch University, 2010.
[11] D.G. Lowe, Object recognition from local scale-
invariant features, Proceedings of the International
Conference on Computer Vision, 2:1150-1157, 1999.

[12] M. Fischler, R. Bolles, Random sample consensus: a


paradigm for model fitting with applications to im-
age analysis and automated cartography, Communi-
cations of the ACM, 24(6):381-395, 1981.

[13] P.J. Besl, N.D. McKay, A method for registration of


3-D shapes, IEEE Pattern Analysis and Machine In-
telligence, 14(2):239–256, 1992.

12
(a)

(b)

(c)

Figure 19: Partial reconstructions made with a subset of a few captured datasets. The reconstructions in (a) and (b) are
shown from different angles. Some of the RGB images of each dataset are shown on the left, and a representation of the
cameras for each RGB image used are included in the reconstructions.

13

You might also like