Professional Documents
Culture Documents
I hereby declare that, except where specifically indicated, the work submit-
ted herein is my own original work. The length of this report is no less than
10000 words and no more than 15000.
4
Acknowledgements
Finally, this year’s research was made possible through the support from
the Internal Graduate Studentship award of Trinity College, Cambridge.
6
Contents
1 Introduction 13
1.1 About Motion Capture . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Formulation 17
3 Literature Review 19
3.1 Markerless Systems . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Marker Technologies . . . . . . . . . . . . . . . . . . . . . . . 25
4 Current Work 27
4.1 The Measurement Prediction Λ(X) . . . . . . . . . . . . . . . 29
4.1.1 Model Definition . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Model Adaptation to State . . . . . . . . . . . . . . . 31
4.1.3 The State . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.4 The Rendering Function . . . . . . . . . . . . . . . . . 33
4.2 The Jacobian JΛ (X) . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 The Measurement Dk . . . . . . . . . . . . . . . . . . . . . . 40
5 Results 43
5.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Single Cylinder . . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 Upper Body . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A Appendix 59
Bibliography 63
List of Figures
4.1 Top Row: Two views of the model when it assumes state X.
Bottom Row: The set of chains c defined by the model: a
chain is used for each arm, the head and for each lateral part
of the torso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Left: Plot of λp (X) vs d2p (X). Right: Illustration of distance
measure dp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Left: The smallest line-to-line distance dp takes place at a
location outside the limb’s length. Right: Example of a ray
that produces a d¯p that misses the length of the limb. In such
a case detection of an intersection between ray and cylinder
is facilitated by calculation of dˆp and dˇp on either sides of the
limb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Rendered cylinder under various states X. Notice the dis-
continuities in pixel intensity along the cylinder’s axis in the
image on the top-right. . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Left: Illustration of distance measures d¯p , dˆp and dˇp , used to
render a cylinder with hemispherical ends. Right: Distances
{dn,p }N
n=1 , used to render the limb as a sphere tree. . . . . . .
s
36
4.6 The resulting measurement predictions for one limb using the
Cylinder with Hemispherical Ends method. . . . . . . . . . . 37
4.7 Demonstration of the Sphere-Tree algorithm, an algorithm
that represents objects of arbitrary shapes as groups of spheres.
The number of spheres used depends on the desired accuracy.
This image is by courtesy of G. Bradshaw and C. O’Sullivan
[1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 Resulting measurement predictions for various X of one limb
when rendered as a sphere tree. . . . . . . . . . . . . . . . . . 39
10 LIST OF FIGURES
4.9 The pre-processing stage of the system. 1st Row: View from
camera 1. 2nd Row: View from camera 2. Raw framesets
Ik are grabbed (column 2), and compared against the mean
background frameset Bµ (column 1) to produce a noisy silhou-
ette image Ek (column 3). The frameset is then morphologi-
cally filtered and smoothed to produce the final measurement
Dk (column 4). . . . . . . . . . . . . . . . . . . . . . . . . . . 41
characters. On the other hand, markerless methods have only seen a decade’s
worth of research. Even though promising results have been reported, these
systems have not provided the same robustness or precision as their marker-
based competitors.
1.2 Applications
Xk = f (Xk−1 ) + wk , (2.1)
and is known as the process equation. The term f (Xk−1 ) is the state
prediction (also referred to as the a priori state estimate) and wk is the
18 Formulation
process noise. Assuming a suitable model, the model’s projection onto the
image planes of the cameras Λ(Xk ) adequately describes the projection of
the subject Dk , as indicated by the measurement equation:
Dk = Λ(Xk ) + Nk (2.2)
where Dk is the measurement (also referred to as the data) and Λ the
non-linear measurement function. Λ(Xk ) is the measurement prediction and
Nk the measurement noise.
Both, process and measurement noise are assumed to follow a zero-mean
Gaussian ditribution, i.e.
the effect of the posterior distribution’s local modes less severe. In several
applications a restricted number of types of motion need to be tracked, and
in such cases motion models are introduced. State predictions are therefore
able to be made according to the motion model, this way helping the tracker
follow a greater change in state. Despite the fact that we now have better
results when the a priori expected types of motion are observed, motion
models result in limitation of the system’s ability to track other types of
motion. Recently, systems falling into this category have been designed
that adapt the parameters of the temporal model with time, allowing it to
lock onto a different motion. Although this greatly enhances the capabilities
of such systems, motion models are generally only applied when there is no
need to track arbitrary motion.
Because of the various design issues that each motion capture system
deals with, several reviews have been published on the subject, dividing
systems into several taxonomies [21, 22]. In the following sections we present
a review of several systems on motion capture.
subjects using chamfer distance maps applied to edge data. In their system
they make assumptions of subjects wearing tight-fitted coloured clothes to
facilitate the edge-extraction algorithm. The state-space for each individual
tracked is searched by first making a prediction for the individual’s state that
is as accurate as possible. Then the state variables are allowed to change
in order to improve the cost function, but this is done in a hierarchical
fashion: the head and torso are searched first while the rest of the body
is frozen in its predicted state, then the arms are searched, and finally the
legs. On a lower level down their state-space hierarchy, each arm (or leg) is
searched by fitting the upper arm (thigh) first, while the forearm (calf) is
frozen in its state prediction. Finally the state of the forearm (calf), which
stands on the lower end of the state-space hierarchy is allowed to change
in order to improve the fitness function. This hierarchy in search results in
the decomposition of the state-space into a set of smaller spaces, this way
making the search easier. They claim that this decomposition is paramount
since several individuals are tracked simultaneously, leading to an otherwise
very large state-space. The search method itself is a best-first search, i.e.
the state-space is searched in the vicinity of the state that yields to the
best evaluation of the cost function. Then, given the best state estimate so
far, the new state estimate is found by incrementing pre-set values on each
of the state-variables, as dictated by the state-space decomposition method
explained above. Since best-first search implies that the final converged
state is very close to the state prediction, their system relies heavily on an
accurate state prediction, which is calculated by considering a 2nd order
kinematic model and the series of states previously attained by the subject.
Kakadiaris and Metaxas [26] use physics-based tracking techniques to fa-
cilitate tracking. They use a contour representation of the subject, extracted
by three orthogonally-positioned cameras. A kinematic model is used to pre-
dict the state in the current frame, then 2D forces of attraction are applied
on the model from each point of the occluding contour, making the model’s
projection move towards the data. They use principles from projective ge-
ometry to associate each point of the occluding contour with a point of the
model, given the model’s predicted position. The set of forces are used as
the measurement for an Extended Kalman Filter (EKF) [27, 28, 29], which
adapts the state in order to minimise the net force applied on the model.
One of the contributions of their work is that they also propose a formal
method to decide which cameras used by their system will be active while
processing each frame. To make this decision they consider the visibility
of each limb of the model and the observability of its predicted motion as
viewed by each camera.
3.1. Markerless Systems 23
times, each time corresponding to one annealing run. The particles are al-
lowed to move with decreasing freedom as the level of the annealing run
increases. Greater freedom allows the particles to jump from one local opti-
mum to another, while at the final layer they are effectively trapped in one
of the maxima, which is hoped to be the global optimum. Their results are
very good, but the processing cost is extremely high comparing with other
systems (an hour’s processing time for 5 seconds of footage was quoted).
Even though a method was proposed in [19] to partition the state-space in
order to improve the speed, their results remain precise yet slow to obtain.
In the past two decades many systems have been developed for achieving
motion capture making use of marker technology, which imposes restrictions
on the clothing of the subject. In most cases lycra clothing has to be worn,
and IR emitters/reflectors have to be attached all over the subject’s body,
usually close to the joints. Even though these methods have limited porta-
bility, with filming taking place in dedicated studios, they have proven to
yield accurate results and are widely used in the film industry for precise
animation of digital characters.
Design of marker technology systems for motion capture considers very
similar principles as their markerless counterparts: (i) there is still a need
to collect data, which in this case is the projection of markers’ position in
space, (ii) an underlying model for the subject still needs to be used, and
(iii) the state-space of the model still needs to be searched for the state that
gives optimal fit between the model’s projection and the data.
In addition to the above, since the number of markers used by such
systems is finite, the expense of one of these markers being occluded is high
and could seriously confuse the system. Marker systems therefore face the
problem of data association that was not explicitly tackled by markerless
approaches. Each marker data point now needs to be associated to the
projection of the correct marker of the model, this way identifying which
markers have been occluded. For completion a few systems are outlined
below.
Ringer and Lasenby [41, 42] use Multiple Hypothesis Tracking (MHT) to
solve the marker association problem. For each frame grabbed they main-
tain several association hypotheses, whose validity is revealed on-line. As the
26 Literature Review
Dk = Λ(X∗ ) + Nk , (4.1)
where X∗ is the true pose of the subject. For our system’s case the
measurement Dk is a silhouette of the subject extracted from the set of
N frames captured by the cameras at time tk , and Λ(X) is the silhouette
produced by rendering M(X), the 3D model under state X. Nk describes
all the discrepancies between the model’s silhouette and the data. Since N
cameras of pixel resolution U × V are used, Λ(X), Dk and Nk are all column
vectors of size N U V , e.g.,
£ ¤T
Λ(X) = λ1 (X) λ2 (X) . . . λp (X) . . . λN U V (X) (4.2)
¡ ¢ ¡ ¢
¡ ¢ P Dk |X P X
P X|Dk = (4.4)
P (Dk )
¡ ¢ ¡ ¢
with P Dk |X being the likelihood, P X the prior distribution, and
P (Dk ) the evidence.
Regarding the evidence as a normalising constant and assuming no prior
knowledge about the position of the subject’s body in the k th dataset, the
prior distribution becomes flat, and we therefore need to maximise the like-
lihood, this way using the maximum likelihood estimator:
¡ ¢
X̂M L = arg max P Dk |X . (4.5)
X
R = σn2 IN U V , (4.6)
where IN U V is the identity matrix of size N U V and σn2 is a constant.
This gives,
½ ¾
¡ ¢ ¡ ¢− N U V 1 ¡ ¢T ¡ ¢
P Dk |X = 2πσn2 2
exp − 2 Dk − Λ(X) Dk − Λ(X) (4.7)
2σn
³ ´
X̂(i) = X̂(i−1) + K Dk − Λ(X̂(i−1) ) (4.9)
where ³ ´−1
K = JΛ (X̂(i−1) )T JΛ (X̂(i−1) ) JΛ (X̂ (i−1) )T (4.10)
4.1. The Measurement Prediction Λ(X) 29
∂
and JΛ (X) = ∂X Λ(X) being the Jacobian of the model’s silhouette.
Equations (4.9) and (4.10) show that the tracker needs a new measurement
for every frame k grabbed by the camera system, and that the model is
rendered and its Jacobian calculated for each iteration i. The remainder
of this chapter will give details about how the model is constructed and
rendered in order to give the measurement prediction, how its Jacobian is
calculated, and how we extract the measurement from the camerasets.
Figure 4.1: Top Row: Two views of the model when it assumes state X.
Bottom Row: The set of chains c defined by the model: a chain is used
for each arm, the head and for each lateral part of the torso.
c and for each limb l we also define a limb radius rc,l , limb length sc,l , and
initial (X = 0) limb orientation, expressed as a unit vector v̂c,l . Note that
chains 3 and 4 are rigidly attached to, and therefore constrained to always
have the same orientation as, limb 1 of chains 1, 2, and 3.
Put in a more formal context, the definition of the model M includes
definitions for the following:
{Lc }C
c=1 the length of each chain c,
n oC
{v̂c,l }L c
l=1 the initial orientation of each limb l in each chain c,
c=1
n oC
Lc
{rc,l }l=1 the radius of each limb l in each chain c, and
c=1
n oC
{sc,l }Lc
l=1 the length of each limb l in each chain c.
c=1
4.1. The Measurement Prediction Λ(X) 31
à l
!
Y
∗
v̂c,l (X) = Rc,1 (X)Rc,2 (X) . . . Rc,l (X)v̂c,l = Rc,m (X) v̂c,l (4.11)
m=1
∗ ∗ ∗
pc,l (X) = xsp (X) + sc,1 v̂c,1 (X) + sc,2 v̂c,2 (X) + . . . + sc,l v̂c,l (X)
l
X
∗
= xsp (X) + sc,m v̂c,m
m=1
l
à m
!
X Y
pc,l (X) = xsp (X) + sc,m Rc,n (X) v̂c,m (4.12)
m=1 n=1
Since the root joint of every limb l is the leaf joint of limb l − 1 in the
same chain, pc,l and pc,l−1 are the end-point positions of limb l. The set of
n oC
3D positions of all the leaf joints {pc,l }Lc
l=1 therefore contains all the
c=1
necessary information for representing M(X) under any arbitrary state X.
32 Current Work
£ ¤T
X = xTsp qT1,1 . . . qT1,L1 qT2,1 . . . qT2,L2 . . . qc,l . . . qTC,LC (4.13)
where
q0 cos 2θ
x qx ux sin θ
xsp = y , and qc,l =
qy = uy sin θ
2 .
(4.14)
z 2
qz uz sin 2θ
In (4.14) above q0 etc. is used for brevity instead of q0,c,l , and likewise
for θ, ux etc.
Each quartet qc,l of state rotation coefficients as defined in (4.14) con-
stitutes a quaternion [46] and parameterises the rotation of limb l in chain
c about a unit axis û = [ux uy uz ]T and through an angle θ. Even though
each rotation is now described with 4 parameters, these are constrained in
that |qc,l | = 1. The new estimate for the state is calculated using (4.9)
and (4.10), and subsequently all the quaternions are normalised, effectively
resulting in 3 degrees of freedom per limb rotation.
Euler angles ( qc,l = [φx φy φz ]T ) could have been used to parameterise
the limbs’ rotations, resulting in a smaller size for the state vector X. Such
a parameterisation, however, suffers from discontinuities in qc,l even during
small variations of the limb’s orientation [47]. Using quaternions, on the
other hand, smooths the state-space for X thus eliminating local minima
and singularities.
n oC
The state-dependent limb rotations {Rc,l (X)}L c
l=1 , of equations
c=1
(4.11) and (4.12) are 3 × 3 rotation matrices resulting from the quaternions
n oC
{qc,l }L
l=1 c=1 stored in the state. For the quaternion as defined in (4.14)
c
λp
255
0 0.25 r2 r2 2
2.25 rc,l d2p
c,l c,l
Figure 4.2: Left: Plot of λp (X) vs d2p (X). Right: Illustration of distance
measure dp .
q02 + qx2 − qy2 − qz2 2(qx qy − q0 qz ) 2(qx qz + q0 qy )
Rc,l (X) = 2(qx qy + q0 qz ) q02 − qx2 + qy2 − qz2 2(qy qz − q0 qx ) . (4.15)
2(qx qz − q0 qy ) 2(qy qz + q0 qx ) q02 − qx2 − qy2 + qz2
255 if dp (X) > 32 rc,l
255 2 255
λp (X) = rc,l dp (X) − 2 if 12 rc,l ≤ dp (X) ≤ 32 rc,l . (4.16)
0 if dp (X) < 12 rc,l
34 Current Work
where dp is the shortest distance between the axis of the limb and the
current ray, as shown in Fig. 4.2 on the right. We have chosen λp = f (d2p )
so that the rendering function is differentiable as will be seen in Sect. 4.2
and Appendix A.
We render the model by firing rays from the camera’s centre of projec-
tion, passing through each pixel from the set of cameras. For each pixel
p, and for each limb l in chain c of the model, we calculate dp using the
line-to-line distance formula, which in our case becomes
Figure 4.4: Rendered cylinder under various states X. Notice the dis-
continuities in pixel intensity along the cylinder’s axis in the image on the
top-right.
Cylinder
Figure 4.5: Left: Illustration of distance measures d¯p , dˆp and dˇp , used
to render a cylinder with hemispherical ends. Right: Distances {dn,p }N n=1 ,
s
Figure 4.6: The resulting measurement predictions for one limb using the
Cylinder with Hemispherical Ends method.
of the limb, with its radius being equal to the radius of the cylinder. As seen
in Fig. 4.6, the resulting measurement prediction now shows finite gradients
from all possible states.
Sphere Tree
The performance of the system when the limbs are rendered as cylinders with
hemispherical ends has proven to be very promising. The limbs now produce
a differentiable measurement prediction under any X thanks to the point-
to-line distance measures introduced in the previous subsection. For this
reason, and because of future aspirations of the project to render models
that are a lot more realistic, we have decided to implement a rendering
method that assumes limbs as a nested group of spheres, or sphere trees.
Sphere trees have been studied by G. Bradshaw, and C. O’Sullivan in
[1] and efficient algorithms exist that can render objects of arbitrary shapes
as groups of spheres, the number of spheres increasing with the intended
accuracy, as illustrated in Fig. 4.7. For our case sphere trees could eventu-
ally form a very realistic representation of the subject’s limbs, resulting in
measurement predictions Λ(X) of superior quality. Apart from the future
38 Current Work
Figure 4.8 shows a limb that was rendered as a group of equally sized
spheres. Note that the sphere tree rendering method is not constrained in
a constant sphere radius, this way allowing for the natural curvature of the
subject’s limbs.
4.2. The Jacobian JΛ (X) 39
Equations (4.9) and (4.10) indicate that for every new estimate X̂(i) , the
Jacobian JΛ (X̂(i−1) of the measurement prediction has to be calculated. JΛ
is a matrix of size N U V × S where S is the size of the state vector X. For
∂λp (X)
each pixel of Λ(X) we calculate the derivative ∂X , which is a 1×S vector.
All these vectors are subsequently stacked to form JΛ .
By differentiating (4.16) we have,
255 ∂
2
rc,l ∂X (dp (X)) if 21 rc,l ≤ dp (X) ≤ 32 rc,l
∂λp (X)
= (4.20)
∂X
0 otherwise .
∂ 2
Of course, the value of ∂X dp (X) depends on the distance measure used
to render the pixel. As an example of how the distance derivatives are
calculated, the derivative of d¯2p is calculated in Appendix A. In general, for
every distance measure used, d2p = f (pc,l , pc,l−1 ), and so the derivative of
every joint position pc,l of every chain c has to be calculated. Differentiating
(4.12) w.r.t. the state gives,
40 Current Work
( l
à m
! )
∂pc,l ∂xsp ∂ X Y
= + sc,m Rc,n (X) v̂c,m (4.21)
∂X ∂X ∂X
m=1 n=1
∂p
which is a 3 × S matrix whose columns are ∂xc,l s
, xs being the sth state
variable. For the first 3 state variables (i.e. the spatial components of the
state) equation (4.21) becomes,
∂pc,l ∂xsp
= = I3 (4.22)
∂xsp ∂xsp
where I3 is the 3×3 identity matrix. For the rest of the state components,
which correspond to rotations, (4.21) becomes
l
(Ã `−1 ! Ã l
!)
∂pc,l X Y ∂Rc,` Y
= sc,m Rc,n Rc,n v̂c,m (4.23)
∂qξ,c,` ∂qξ,c,`
m=1 n=1 n=`+1
Figure 4.9: The pre-processing stage of the system. 1st Row: View from
camera 1. 2nd Row: View from camera 2. Raw framesets Ik are grabbed
(column 2), and compared against the mean background frameset Bµ (col-
umn 1) to produce a noisy silhouette image Ek (column 3). The frameset
is then morphologically filtered and smoothed to produce the final measure-
ment Dk (column 4).
100
80
70
60
50
40
30
20
Cylinder
10 Hemispherical Ends
Sphere Tree
0
10 20 30 40 50 60
No. of Iterations
Even though the system using any of the rendering methodologies de-
scribed above generally performs well and converges to the correct state
X∗ , problems using the ‘Cylinder’ method arise when we consider an initial
state X̂(0) where the limb is viewed sideways by the camera, in which case
the gradient in Λ(X̂(0) ) along the limb’s axis is undefined and the Jacobian
is therefore unable to provide gradient information to direct the state cor-
rectly, as mentioned in Sect. 4.1.4 of Chapt. 4. This is shown in Fig. 5.1,
where the model using the ‘Cylinder’ method fails to converge. We see that
neither of the two alternative rendering methods suffer from this problem,
and therefore the rest of the testing of the system was done using the ‘Hemi-
spherical Ended Cylinder’. The ‘Sphere Tree’ method is left for future use,
when we implement the Sphere-Tree algorithm for creating realistic models,
but it is important to see that its performance is equivalent to that of the
‘Hemispherical Ended Cylinder’.
5.1. Simulated Data 45
Single Camera
In Fig. 5.2 we display three snapshots from the silhouette matching process
for the case of a single-camera system. Despite the ambiguity problem when
working with pure silhouettes, we see that the algorithm can converge to the
global optimum, provided that X̂(0) lies in the vicinity of X̂M L .
7
x 10 Single Camera Experiment x 10
7
Double Camera Experiment
12
12
(D − Λ(X))T(D − Λ(X))
(D − Λ(X))T(D − Λ(X))
10
10
8
8
6
6
4
4
2 2
0 0
20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
iterations iterations
Stereo Vision
Figure 5.6: Tracking with low resolution data. Top Row: Dk for k = 1,
9, 17, 25 and 33. Bottom Row: The model’s converged state after 25
iterations for each corresponding Dk of the top row from the same view.
The model was rendered with a higher resolution for clarity. Only views
from camera 1 are shown in this illustration.
lower resolutions, with lower yet acceptable accuracy, as shown in Fig. 5.6
for the case of 36 × 48 pixels in each camera.
Conclusions and Future Work 6
The current system’s robustness hasn’t been completely tested yet. So far
we have successfully tracked a sequence showing a simple movement with
the subject moving at a relatively slow pace. The movement is tracked
successfully even at medium to low resolutions, down to 36 × 48 pixels in
each camera. We will however need to test the system on additional gestures
of increasing complexity.
Special attention will be given to gestures that involve occlusions. It is
expected that the system will cope with sequences involving a low degree
of occlusion, e.g. sequences where one of the two cameras sees an occluded
silhouette while the other does not. It is also believed that higher degrees
of occlusions will be handled when introducing more cameras, but unfor-
tunately our camera-capturing equipment as it stands now can only store
datasets produced by 2 cameras. Nevertheless, there are many issues to be
improved in order to increase the system’s robustness.
¡ ¢
xk = x̂k + Kk zk − h(x̂k ) (6.1)
where
0
Pk = Jf Pk−1 JfT + Q (6.4)
0
Pk = (I − Kk Jh )Pk . (6.5)
∂ ∂
Jf = ∂x f (xk−1 ) and Jh = ∂x h(x̂k ) are the Jacobians of the process and
measurement functions respectively.
Considering the special case when the process function f is the identity
matrix, and when the measurement covariance matrix R is diagonal having
its entries a lot smaller compared to the process covariance matrix Q, then
Jf = I and the EKF equations shown above become,
x̂k = xk−1
0
Pk = Pk−1 + Q
¡ ¢−1
Kk = (Pk−1 + Q)JhT Jh (Pk−1 + Q) JhT + R
¡ ¢−1
= (Pk−1 + Q) JhT Jh (Pk−1 + Q) JhT
= (Pk−1 + Q) JhT (JhT )+ (Pk−1 + Q)−1 Jh+
= Jh+
¡ ¢−1 T
= JhT Jh Jh
xk = f (xk−1 ) + wk (6.8)
and
zk = h (xk ) + vk (6.9)
This modification will make the system able to predict the pose of the
subject in the new frame based on the pose and its rate of change in the
previous frame, making the state prediction better. Taking the time interval
between successive frames as unit time, the pose state parameter s and its
rate of change are given by,
Work can also be done in upgrading the quality of the silhouettes that are
used as the measurement for the EKF. Currently the measurement is created
by classifying pixels of the current raw frame Ik into two classes: foreground
and background. The pixels are classified as foreground if they fall outside
a β % confidence interval of the mean background frame Bµ , as discussed
in Sect. 4.3 of Chapt. 4. This algorithm works well when the subject’s
presence in the scene does not interfere with the background. However, in
many cases we have a subject located very close to parts of the background,
casting shadows or reflecting ambient light. This produces a number of
pixels in the measurement that are erroneously assigned to the foreground.
Sparsely distributed erroneous pixels can easily be removed using morpho-
logical filters, but patches of erroneous pixels act as attractors for the model
causing confusion to the tracker. The current algorithm shows no robustness
against this, as shown in the 3rd image of the 2nd row in Fig. 4.9, where
shadows produced by the subject have resulted in the classification of the
pixels representing the keyboard as part of the foreground.
For these reasons it is proposed that in addition to Ik , Bµ and the back-
ground standard deviation frame Bσ , derivative framesets |∇Ik |, |∇B|µ and
6.3. Extension of the Measurement 53
|∇B|σ could also be used to facilitate shadow detection, so that pixels may
be classified into three classes, not two. Under this scenario, equation (4.25)
makes a rough classification of the pixels that belong to the foreground, and
subsequently their corresponding values in |∇Ik | and |∇B|µ are compared.
If these values are found to be similar, the pixel is classified as a shadow
pixel, otherwise it remains in the foreground class. The level of similarity
tolerated will clearly be dictated by the corresponding value of |∇B|σ , like
before. All this changes equation (4.25) to,
0 if |ip − µp | > βσp and |δip − δµp | > βδσp
ep = (6.12)
255 otherwise
where δip , δµp and δσp are elements of |∇Ik |, |∇B|µ and |∇B|σ respec-
tively.
It is hoped that with this extension we will be able to reduce the number
of pixels that are mistakenly assigned to the silhouette of the subject, this
way reducing the number of erroneous pixels that appear as patches and
therefore helping the tracker converge to the correct state. The increase
in computational complexity of the system is affordable, since |∇B|µ and
|∇B|σ are only calculated once, while |∇Ik | will have to be calculated for
each new frame grabbed.
Several things can also be said about the initialisation of the model: to date
this has not been given much attention. So far in all of our experiments the
state of the subject’s model is started up manually at the first frame of the
sequence. Even though this does not decrease the performance of the system
in any other way, we will try several methods for finding out more about
the subject’s initial state at the 1st frame. Particle Filtering (PF) is one of
the methods under consideration, in which the whole of the state-space is
sampled to investigate whether a high likelihood P (zk |xk ) is observed. After
a sufficient number of particles (samples) has been drawn (depending on the
size of the state-space), the initial state becomes the state that maximises the
observed likelihood. The big disadvantage of using a particle-based method
to track or initialise the state of the subject in motion capture systems is that
the state-space is very large, implying that ¡an enormous
¢ number of particles
will need to be taken. Since P (zk |xk ) = f h(xk ) and since the evaluation
of the measurement prediction is one of the most computationally expensive
stages of our system, an initialisation step involving Particle Filtering will
be a time-consuming process. For this reason Particle Filters will only be
considered with emphasis given in finding ways to reduce the state-space,
e.g. by partitioning the state-space into a group of smaller ones. Assuming
independence between the different chains c of the model, each chain could
now have its own state and a separate particle filter could be applied to it.
An alternative method to particle filtering for the initialisation of the
state could be the slight modification of equation (4.16) to have a smaller
gradient, such that the linearly-dependent part of the equation extends to
greater distances. This has the effect of decreasing the accuracy of the
tracker, but at the same time making the tracker able to converge to states
which are further away. As the model gets closer to the spatial position
of the subject, the gradient of the rendering function in (4.16) is allowed
to increase (therefore increasing the system’s accuracy) and more estimates
6.5. Model Acquisition 55
for the true state are calculated. Eventually the model’s spatial position
will coincide with the spatial position of the subject, and the gradient in
the rendering function will be the same as in (4.16). By then it is hoped
that the model’s state estimate will be in the vicinity of the true state, and
therefore convergence to the true initial state is achieved.
A third way to initialise the state is to search in 3D space and locate
the position of various parts of the subject using templates. To begin with,
we search for the position of the head by moving a template of the model’s
head in 3D space and projecting it onto the image planes of the cameras.
The 3D position that projects to areas of the measurement containing the
maximum number of edge foreground pixels (defined as foreground pixels
that have at least 1 background pixel neighbour) is chosen as the most
likely position for the head. Once the head is located, constraints about the
model’s structure are taken into consideration to help locate the position of
the rest of the body. As an example, the torso is constrained to be attached
on the base of the head and it is likely to extend vertically downwards.
The torso template is therefore initialised vertically below the head and
the number of pixels in the projected template as seen from the cameras is
counted and their centroid located. The torso template is then rotated for
its centroid to coincide with the centroid of the pixels in it. A new pixel
centroid is calculated and the template is rotated again until it converges
to an orientation that no longer changes. When this happens a very similar
technique is used for the initialisation of the model’s limb orientations again
using the constraints imposed by the model.
The technique described in the previous section for the initialisation of the
state is due to I. Mikić et al. and it has already been used with 3D voxel
data in [2]. An extension of this technique allows for model acquisition,
during which the size of the model’s limbs is defined based on the first
few measurements, this way allowing for any subject to be tracked. After
converging to the orientation of each body-part as explained above, the
corresponding template is reduced in size so that its projection only includes
foreground pixels. The template is then allowed to grow in all directions,
and its orientation is continuously adjusted to make sure that it grows in
the right direction. Growing along a certain direction continues until the
56 Conclusions and Future Work
µ ¶2
(ûr × (pc,l − pc,l−1 )) · (c − pc,l−1 )
d¯2p = ,
|ûr × (pc,l − pc,l−1 )|
where ûr is the unit vector in the direction of the ray passing through
pixel p, and c is the position of the current camera’s centre of projection.
Equation (4.17) may be re-written as,
¡ T ¢2
¯2 n (c − pc,l−1 )
dp = , (A.1)
nT n
where the normal vector n is given by
¯ ¯
¯ i j k ¯
¯ ¯
¯ ux uy uz ¯
n = ûr × (pc,l − pc,l−1 ) = ¯ ¯
¯ (x) ¯
¯ pc,l − p(x) (y) (y) (z) (z)
c,l−1 pc,l − pc,l−1 pc,l − pc,l−1
¯
h i
(z) (z) (y) (y)
= i uy (pc,l − pc,l−1 ) − uz (pc,l − pc,l−1 )
h i
(z) (z) (x) (x)
−j ux (pc,l − pc,l−1 ) − uz (pc,l − pc,l−1 )
h i
(y) (y) (x) (x)
+k ux (pc,l − pc,l−1 ) − uy (pc,l − pc,l−1 ) .
That is,
³ ´ ³ ´
(z) (z) (y) (y)
uy pc,l − pc,l−1 − uz pc,l − pc,l−1
nx ³ ´ ³ ´
(x) (x) (z) (z)
n = ny = uz pc,l − pc,l−1 − ux pc,l − pc,l−1 . (A.2)
³ ´ ³ ´
nz (y) (y) (x) (x)
ux pc,l − pc,l−1 − uy pc,l − pc,l−1
¡ ¢³ ∂n ∂pc,l−1
´
∂d21 2 nT (c − pc,l−1 ) (c − pc,l−1 )T ∂X − nT ∂X
=
∂X nT n
∂n
¡ T ¢2
2nT ∂X n (c − pc,l−1 )
− (A.3)
(nT n)2
∂n
where ∂X is found by differentiating (A.2):
61
µ (z) (z)
¶ µ (y) (y)
¶
∂pc,l ∂pc,l−1 ∂pc,l ∂pc,l−1
∂nx uy ∂X− ∂X − uz ∂X − ∂X
µ (x) ¶ µ ¶
∂n ∂X (x) (z) (z)
= ∂ny = uz ∂pc,l − ∂pc,l−1
− ux
∂pc,l
− ∂X
∂pc,l−1
.
∂X ∂X ∂X ∂X ∂X
∂nz µ (y) ¶ µ ¶
∂X ∂pc,l
(y)
∂pc,l−1
(x)
∂pc,l
(x)
∂pc,l−1
ux ∂X − ∂X − uy ∂X − ∂X
∂
¡ 2¢
We see that ∂X dp = f (pc,l−1 , pc,l ), i.e. that the derivative of the
distance measure is a function of the derivatives of each limb’s two end-
points. Derivatives for these end-points are calculated as per Sect. 4.2 of
Chapt. 4.
62 Appendix
Bibliography
[11] A. Ude, “Robust estimation of human body kinematics from video,” in Proc.
IEEE/RSJ Conf. Intelligent Robots and Systems, Kyongju, Korea, October
1999, pp. 1489–1494.
[12] A. Ude, “Robust human motion estimation using detailed shape and texture
models,” in Journal of the Robotics Society of Japan, July 2001, vol. 19, pp.
15–18.
[13] R. Urtasun and P. Fua, “3d tracking for gait characterization and recogni-
tion,” in Proc. IEEE International Conference on Automatic Face and Gesture
Recognition, Seoul, Korea, May 2004.
[14] I. Mikić, M. Trivedi E. Hunter, , and P. Cosman, “Articulated body pos-
ture estimation from multi-camera voxel data,” in Proc. IEEE International
Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, De-
cember 2001.
[15] R. Plankers and P. Fua, “Articulated soft objects for video-based body mod-
eling,” in Proc. International Conference on Computer Vision, Vancouver,
Canada, July 2001, pp. 394–401.
[16] M. Isard and A. Blake, “Condensation - conditional density propagation for
visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp.
5–28, 1998.
[17] A. Doucet, N. G. de Freitas, and N. J. Gordon, Eds., Sequential Monte Carlo
Methods in Practice, Springer-Verlag, New York, USA, 2001, Statistics for
Engineering and Information Science Series.
[18] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by
annealed particle filtering,” in Proc. IEEE International Computer Vision
and Pattern Recognition (CVPR), 2000.
[19] J. Deutscher, A. J. Davison, and I. D. Reid, “Automatic partitioning of high di-
mensional search spaces associated with articulated body motion capture,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition, Kauai,
Hawaii, December 2001.
[20] I. A. Kakadiaris and D. Metaxas, “Model-based estimation of 3d human motion
with occlusion based on active multi-viewpoint selection,” in Proc. IEEE
Conference on Computer Vision and Pattern Recognition, San Francisco, CA,
USA, June 1996, pp. 81–87.
[21] D. M. Gavrila, “The visual analysis of human movement: A survey,” in
International Journal of Computer Vision and Image Understanding, January
1999, vol. 73, pp. 82–98.
[22] T. B. Moeslund and E. Granum, “A survey of computer vision-based human
motion capture,” in International Journal of Computer Vision and Image
Understanding, March 2001, vol. 81, pp. 231–268.
BIBLIOGRAPHY 65