Markerless Motion Capture: Silhouette Matching Using a Kinematic Model

Markerless Motion Capture
Paris Kaimakis August 31, 2004
Signal Processing Laboratory

Department of Engineering
University of Cambridge
2
Declaration
I hereby declare that, except where specifically indicated, the work submit-
ted herein is my own original work. The length of this report is no less than
10000 words and no more than 15000.
4
Acknowledgements
My sincerest gratitude goes to my supervisor, Dr Joan Lasenby, for all the

support and encouragement that she has provided me with during this year.
This work is dedicated to my parents Mary and Costas, and to my

brother Alkis, for all their care throughout my life.
Finally, this year’s research was made possible through the support from
the Internal Graduate Studentship award of Trinity College, Cambridge.
6
Contents
1 Introduction 13
1.1 About Motion Capture . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Formulation 17
3 Literature Review 19
3.1 Markerless Systems . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Marker Technologies . . . . . . . . . . . . . . . . . . . . . . . 25
4 Current Work 27
4.1 The Measurement Prediction Λ(X) . . . . . . . . . . . . . . . 29
4.1.1 Model Definition . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Model Adaptation to State . . . . . . . . . . . . . . . 31
4.1.3 The State . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.4 The Rendering Function . . . . . . . . . . . . . . . . . 33
4.2 The Jacobian JΛ (X) . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 The Measurement Dk . . . . . . . . . . . . . . . . . . . . . . 40
5 Results 43
5.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Single Cylinder . . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 Upper Body . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusions and Future Work 49

6.1 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . 49
6.2 Shadow Detection . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Extension of the Measurement . . . . . . . . . . . . . . . . . 53
6.4 State Initialisation . . . . . . . . . . . . . . . . . . . . . . . . 54
8
6.5 Model Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 55

6.6 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 57
6.7 Time Management . . . . . . . . . . . . . . . . . . . . . . . . 58
A Appendix 59
Bibliography 63
List of Figures
4.1 Top Row: Two views of the model when it assumes state X.
Bottom Row: The set of chains c defined by the model: a
chain is used for each arm, the head and for each lateral part
of the torso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Left: Plot of λp (X) vs d2p (X). Right: Illustration of distance
measure dp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Left: The smallest line-to-line distance dp takes place at a
location outside the limb’s length. Right: Example of a ray
that produces a d¯p that misses the length of the limb. In such
a case detection of an intersection between ray and cylinder
is facilitated by calculation of dˆp and dˇp on either sides of the
limb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Rendered cylinder under various states X. Notice the dis-
continuities in pixel intensity along the cylinder’s axis in the
image on the top-right. . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Left: Illustration of distance measures d¯p , dˆp and dˇp , used to
render a cylinder with hemispherical ends. Right: Distances
{dn,p }N
n=1 , used to render the limb as a sphere tree. . . . . . .
s
36
4.6 The resulting measurement predictions for one limb using the
Cylinder with Hemispherical Ends method. . . . . . . . . . . 37
4.7 Demonstration of the Sphere-Tree algorithm, an algorithm
that represents objects of arbitrary shapes as groups of spheres.
The number of spheres used depends on the desired accuracy.
This image is by courtesy of G. Bradshaw and C. O’Sullivan
[1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 Resulting measurement predictions for various X of one limb
when rendered as a sphere tree. . . . . . . . . . . . . . . . . . 39
10 LIST OF FIGURES
4.9 The pre-processing stage of the system. 1st Row: View from
camera 1. 2nd Row: View from camera 2. Raw framesets
Ik are grabbed (column 2), and compared against the mean
background frameset Bµ (column 1) to produce a noisy silhou-
ette image Ek (column 3). The frameset is then morphologi-
cally filtered and smoothed to produce the final measurement
Dk (column 4). . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Silhouette matching on a single limb using a single simulated

measurement D and 1 camera. Three rendering methods are
compared: ‘Cylinder’, ‘Cylinder with Hemispherical Ends’
and ‘Sphere Tree’. Top Left: Λ(X̂(0) ) (white) superposed
on D (black) using the ‘Cylinder’ method to render both
silhouettes. Bottom Left: Same using the ‘Cylinder with
Hemispherical Ends’. Right: Performance of the three ren-
dering methods. The ‘Cylinder’ method fails to converge as
there is not sufficient gradient information to move the model
along the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Silhouette matching on a simulated dataset using 1 camera.

Λ(X̂(i) ) superposed on D for iterations i = 0 (left), i = 7
(middle), i = 120 (right). . . . . . . . . . . . . . . . . . . . . 45
5.3 Variation of the negative log-likelihood w.r.t. number of it-

erations i during silhouette matching with simulated data on
dataset D for the single camera (left) and double camera
(right) experiments. . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Silhouette matching on a real dataset using 2 cameras. 1st

Row: View from camera 1. 2nd Row: View from camera 2.
(i)
Column 1 shows real data D1 ; columns 2 to 5 show Λ(X̂1 )
superposed on D1 for iterations i = 0, i = 6, i = 10 and
i = 25 respectively. . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Motion recovery after successful silhouette matching on a

video sequence of real datasets D1:50 using 2 cameras. 1st
Row: Dk as viewed from camera 1; from left to right: D1 ,
(25)
D9 , D17 , D25 and D33 . 2nd Row: Λ(X̂k ) for each corre-
sponding Dk of row 1 from the same view. 3rd Row: Same
(25)
Dk as viewed from camera 2. 4th Row: Λ(X̂k ) for each
corresponding Dk of row 3 from the same view. . . . . . . . . 47
LIST OF FIGURES 11
5.6 Tracking with low resolution data. Top Row: Dk for k = 1,

9, 17, 25 and 33. Bottom Row: The model’s converged
state after 25 iterations for each corresponding Dk of the top
row from the same view. The model was rendered with a
higher resolution for clarity. Only views from camera 1 are
shown in this illustration. . . . . . . . . . . . . . . . . . . . . 48
6.1 Illustration of the model acquisition method using templates.

(a) The position and orientation of the body-part (here the
torso) is found. (b) The template shrinks until its projection
contains no background pixels. (c) It is then allowed to grow
in all directions, and its orientation is re-adjusted to make
sure that it does not go to the wrong direction. (d) Growing
stops when background pixels start entering the template.
This image is by courtesy of I. Mikić, M. Trivedi, E. Hunter
and P. Cosman. [2] . . . . . . . . . . . . . . . . . . . . . . . . 56
12 LIST OF FIGURES
Introduction 1
1.1 About Motion Capture
Motion Capture is the domain dedicated to measuring and quantifying the

posture variation of articulated bodies, usually humans. The extraction
of position and orientation parameter values, also known as the state, and
their temporal variation is desirable for several applications in many different
fields, such as the film industry, surveillance, etc.
There are several ways in which an articulated body’s state may be
tracked over time. Some methods use electro-mechanical sensors attached
to the body. The stress/strain on each sensor is measured and then the
measured forces and their resulting bending moments are applied to an
articulated model that approximates the structural dimensions and inertia of
the subject. Despite their accurate results, such methods require expensive
equipment, and are very intrusive since each sensor is of significant size and
associated wiring.
Due to minimal intrusion on the subject, as well as the minimal presence
of dedicated machinery necessary for its operation, attention has shifted to-
wards Optical Motion Capture, with which articulated body parameters are
estimated using video sequences of the subject. Two main ideologies pre-
vail: marker-based technologies and markerless systems. While the former
introduce markers (IR emitters/detectors in the most advanced systems)
on various positions of the subject’s body with, in the majority of cases, no
wiring involved, the latter completely eliminate any intrusion on the subject
and its freedom to move.
The past two decades have seen great progress in marker-based methods,
resulting in robust systems that can produce accurate results. To date,
the film industry uses marker-based systems to create digitally animated
14 Introduction
characters. On the other hand, markerless methods have only seen a decade’s
worth of research. Even though promising results have been reported, these
systems have not provided the same robustness or precision as their marker-
based competitors.
1.2 Applications
Historically, the reason why motion capture was introduced as a research

area in the first place, was for the investigation and simulation of human
vision and the way that we make decisions based on what we see. An early
interest has been observed in building robots that move, talk and generally
behave like humans. A robust motion capture system would therefore form
an integral component in robotics, improving the way with which a robot
interacts with the world around it.
Even though motion capture systems have seldom been used directly
in robotics, they have found several applications standing on their own,
starting from action recognition. Under this perspective, Motion Capture
provides useful tools for extracting the subject’s way of moving as a tempo-
ral variation of body pose parameters. The output from the motion capture
system is subsequently fed to a higher-level process for the purposes of recog-
nition. Central in this application domain is surveillance, in which people
are tracked in public places like museums, supermarkets and parking lots.
Suspicious or offensive behaviour is automatically detected, and the security
personnel immediately alerted. Action recognition could find applications
in medicine when specialist opinion is not readily available, e.g. for auto-
mated detection of behavioral abnormalities such as autism, recovery from
orthopaedic conditions, or other psycho-motor disabilities.
Another major class of motion capture applications is found in the film
industry, with emphasis given in digital animation. Not even the best an-
imators can create an animated character, whose patterns of motion are
realistic enough to be included in movies, by solely using computer graphics
and drawing tools. For the viewer to be convinced that the animation forms
a realistic part of the movie, an actor has to perform the animated charac-
ter’s desired pattern of motion. The actor’s movements are captured using
a high-precision motion capture system, and are subsequently repeated by
a computer graphics model, which is digitally fleshed out e.g. with green
scaled skin to create Star Wars’ Jar Jar Bings. Also, the emerging appli-
1.2. Applications 15
cation of digital directing classifies the interesting parts of tens of hours of

video data (e.g. tennis match) and displays them by order of content. Yet
another application employed in the film industry is content-based coding, an
encoding method designed to achieve higher compression rates by detecting
the state of the subject’s body and therefore dedicating higher resolution
to the corresponding pixels of the image, while coarser resolution could be
used for the background.
Finally Optical Motion Capture is used in human-machine interaction
[3] where the state of the body or the hand could be used as a tool for
operating equipment such as PCs, intelligent home devices etc. Taken a step
further, virtual reality can give a higher level of interaction between user and
computer. The user’s temporal state is extracted and virtual forces resulting
from his body’s dynamics are calculated. These forces are then included in
the virtual environment stored by the computer, this way creating human
interaction with the virtual world.
Even though marker-based methods might be used to give satisfactory
and accurate results for the purposes of some of the applications outlined
above (e.g. in medicine, virtual reality), the process of wearing special cloth-
ing and/or markers is generally unpleasant and time consuming, thus favour-
ing the use of markerless systems. Additionally, most of the applications
discussed in this section demand that there is no intrusion to the subject’s
body whatsoever (e.g. surveillance, automated directing, content-based cod-
ing etc.). It is therefore clear why development of robust markerless tracking
algorithms is desirable and how applications for such systems could be a su-
perset of the applications for marker-based systems.
16 Introduction
Formulation 2
This chapter will formally outline the fundamentals of model-based track-
ing and what adaptations need to be made for the application of Motion
Capture. Alongside, we present an introduction into how our system is
formulated. £ (k) (k) (k) (k) ¤T
The state vector Xk = x1 x2 . . . xs . . . xS denotes any
possible configuration that the subject’s body may attain at any given time
tk . Let M (Xk ) be a 3D model under some arbitrary state Xk , and whose
appearance closely resembles that of the subject. A projection of the model
onto the image planes of the cameras may be expressed as Λ (M (Xk )).
Since only one model is generally considered, the notation Λ (Xk ) will be
used for the model’s projection.
The current state Xk of the subject’s body depends on all the previous
states that the subject has attained so far, meaning that the current position
and orientation of the subject’s body always depends on the way that the
body has behaved in the past (e.g. how fast it moved, with what acceleration
and so on). On the other hand, the subject’s state also depends on any
external inputs u that might be present, e.g. decisions that the subject has
made in the past (stop running, start jumping etc.). This could be expressed
as
Xk = f (Xk−1 , . . . , Xk−∞ , uk−1 , . . . , uk−∞ ) ,
where f is the process function, a non-linear function relating the previ-

ous states and external inputs to the current state. A fair approximation to
the above is to express the state as a Markov chain with no external input,
i.e.
Xk = f (Xk−1 ) + wk , (2.1)
and is known as the process equation. The term f (Xk−1 ) is the state
prediction (also referred to as the a priori state estimate) and wk is the
18 Formulation
process noise. Assuming a suitable model, the model’s projection onto the
image planes of the cameras Λ(Xk ) adequately describes the projection of
the subject Dk , as indicated by the measurement equation:
Dk = Λ(Xk ) + Nk (2.2)
where Dk is the measurement (also referred to as the data) and Λ the
non-linear measurement function. Λ(Xk ) is the measurement prediction and
Nk the measurement noise.
Both, process and measurement noise are assumed to follow a zero-mean
Gaussian ditribution, i.e.
wk ∼ N (0, Q) and Nk ∼ N (0, R) (2.3)
where Q and R are the process and measurement covariance matrices

respectively.
The state of the model for the current frame is changed according to the
process equation by using an approximation to the process function. Then,
the state is further adjusted using feedback from the measurement equation.
Once the state-space is searched adequately, both equations are satisfied and
the optimal state is extracted.
According to the measurement equation, the robustness of a motion cap-
ture system depends on the ability of the measurement prediction Λ(Xk ) (i.e.
the model’s projection to the cameras) to represent the measurement Dk .
On the other hand, the robustness also depends on the choice of process
function f assumed by the system, as indicated by the process equation.
In general, the more sophisticated these are, the greater the tracking abil-
ity of the system will be, on the expense of implementation difficulty and
computational complexity. Hence, according to their intended application,
different systems use different features from the video sequences to extract
the measurement, and different ways for projecting the model and predicting
the state.
Literature Review 3
The great majority of motion capture systems use an underlying 3D skeletal
model of the subject in order to track the data, while a few introduce 2D
models. In general, 3D models have to simulate the nature of the subject
and be as close to it in appearance as possible, while 2D models need to
make a good approximation of how the subject projects to the cameras.
The deformations that the subject’s projection suffers under perspective
projection are complex (limbs appear to change lengths, parts of the body
are occluded etc.) meaning that any 2D models are not capable of tracking
arbitrary movements. Systems that use 2D models [4, 5] usually make as-
sumptions of the type of motion observed but are generally unreliable. On
the other hand, introduction of 3D models is a very popular technique, as
it naturally follows the process and measurement equations, yielding better
results.
The model’s state parametres are allowed to change and the model is
continuously observed through the measurement function and compared to
the measurement. The state-space is searched until the measurement pre-
diction’s deviation from the measurement (data) is small. The quality of
the fit of course depends on several factors, among which is the complexity
of the model and how close it approximates the subject, the measurement
function under which the model is observed, and the quality and nature of
the measurement.
Very accurate models of the human body can be built using computer
graphics techniques and superior rendering methodologies [6] to assure ac-
curate results, but such systems can be very time consuming in performance
and tracking of a few seconds of footage could take days to process. There
is therefore a big trade-off between high model complexity and quality of
the data used, versus low processing time and cheap equipment.
Most systems avoid using the sequence of raw frames, as they are too
large to store and carry too much useless information. For this reason there
always is a pre-processing stage aiming to reduce the size of the data while
20 Literature Review
preserving its information content as much as possible. Many systems use

edges extracted from the raw footages [7], or transformations of these such
as the chamfer distance [8] etc. Other systems use monochrome silhouettes
[9], while many use optical flow data [10]. The textured silhouette of the
subject is also used by systems that model the subject’s clothes, as well as its
pose [11, 12]. Instead of creating measurement predictions by projecting the
model onto the cameras and therefore making comparisons between model
and data in 2D, some systems extract a 3D measurement such as correlation
[13] or voxel [14] data making the comparison in 3D. Finally, some systems
use a combination of features from the raw images to create the measurement
[15]. Further than just the nature of the data used, the system’s accuracy
in tracking depends on how precise the data is, which is affected by the
precision of the cameras, or the technique used to extract the data from the
raw footage.
Another issue that is very important in the design of motion capture
systems is the way in which the state-space is searched for the global opti-
mum which will bring the measurement prediction (i.e. the appearance of
the model) as close to the data as possible. The posterior distribution (or
a fitness function that approximates, or derives from it) is evaluated to as-
sess the goodness of each state estimate. The problem is that the posterior
distribution is usually multimodal because of the articulated nature of the
model, and often involving narrow peaks depending on the nature of the
data used. Furthermore, for highly articulated bodies such as humans, the
posterior distribution occupies a very large state space.
While some systems rely on stochastic approaches such as the Particle
Filter (PF) [16] or Monte Carlo Sampling (MCS) [17] to handle the multi-
modality of the posterior distribution, such systems are incapable of running
in real time given the processing capabilities of today’s computers. To cir-
cumvent this many variations of the traditional PF have been developed
[18] and significant effort has been expended in an attempt to decrease the
state-space [19], without too much success. On the other hand, most sys-
tems use a deterministic approach, which is usually a gradient-based search
in the state-space. Such systems usually approximate the posterior with a
unimodal function and try to converge to its maximum. These algorithms
tend to be faster, but show no robustness when the state lies in the vicinity
of a non-optimal local mode of the posterior distribution. A smaller class of
systems [20] makes associations between points lying on the model’s surface
and the data in order to apply forces of attraction between the two, this
way making the model’s projection to move towards the data.
Some systems embed prior distributions for the state in order to make
3.1. Markerless Systems 21
the effect of the posterior distribution’s local modes less severe. In several
applications a restricted number of types of motion need to be tracked, and
in such cases motion models are introduced. State predictions are therefore
able to be made according to the motion model, this way helping the tracker
follow a greater change in state. Despite the fact that we now have better
results when the a priori expected types of motion are observed, motion
models result in limitation of the system’s ability to track other types of
motion. Recently, systems falling into this category have been designed
that adapt the parameters of the temporal model with time, allowing it to
lock onto a different motion. Although this greatly enhances the capabilities
of such systems, motion models are generally only applied when there is no
need to track arbitrary motion.
Because of the various design issues that each motion capture system
deals with, several reviews have been published on the subject, dividing
systems into several taxonomies [21, 22]. In the following sections we present
a review of several systems on motion capture.
3.1 Markerless Systems
Among the first researchers to introduce an underlying 3D model to recover

the state of the subject was Hogg [23]. He used image subtraction to make a
rough localisation of the subject and extracted the edges within the subject’s
vicinity. These were then compared with the edges of the model’s projection,
and the deviation between the two was minimised in an effort to fit the model
to the data. Even though the people tracked by his system are restricted to
walk in a direction parallel to the camera (this way dramatically reducing
the dimensionality of the state-space) he successfully showed the effect of
the use of a 3D model for articulated body tracking.
In a similar fashion Rehg and Kanade [24] used a model of a human
hand to achieve hand tracking. Their model consists of truncated cylinders
with hemispherical ends to simulate the appearance of the phalanges and
fingertips. They use raw video footage to extract edges lying close to the
edges of their model’s projection, and use a least squares approach to fit
the model to the data. With their highly articulated model (27 degrees of
freedom were quoted) they manage to track complex hand deformations in
real-time, but dealing with self-occlusions is treated separately in [25].
Gavrila and Davis [8] proposed a system to simultaneously track multiple
subjects using chamfer distance maps applied to edge data. In their system
they make assumptions of subjects wearing tight-fitted coloured clothes to
facilitate the edge-extraction algorithm. The state-space for each individual
tracked is searched by first making a prediction for the individual’s state that
is as accurate as possible. Then the state variables are allowed to change
in order to improve the cost function, but this is done in a hierarchical
fashion: the head and torso are searched first while the rest of the body
is frozen in its predicted state, then the arms are searched, and finally the
legs. On a lower level down their state-space hierarchy, each arm (or leg) is
searched by fitting the upper arm (thigh) first, while the forearm (calf) is
frozen in its state prediction. Finally the state of the forearm (calf), which
stands on the lower end of the state-space hierarchy is allowed to change
in order to improve the fitness function. This hierarchy in search results in
the decomposition of the state-space into a set of smaller spaces, this way
making the search easier. They claim that this decomposition is paramount
since several individuals are tracked simultaneously, leading to an otherwise
very large state-space. The search method itself is a best-first search, i.e.
the state-space is searched in the vicinity of the state that yields to the
best evaluation of the cost function. Then, given the best state estimate so
far, the new state estimate is found by incrementing pre-set values on each
of the state-variables, as dictated by the state-space decomposition method
explained above. Since best-first search implies that the final converged
state is very close to the state prediction, their system relies heavily on an
accurate state prediction, which is calculated by considering a 2nd order
kinematic model and the series of states previously attained by the subject.
Kakadiaris and Metaxas [26] use physics-based tracking techniques to fa-
cilitate tracking. They use a contour representation of the subject, extracted
by three orthogonally-positioned cameras. A kinematic model is used to pre-
dict the state in the current frame, then 2D forces of attraction are applied
on the model from each point of the occluding contour, making the model’s
projection move towards the data. They use principles from projective ge-
ometry to associate each point of the occluding contour with a point of the
model, given the model’s predicted position. The set of forces are used as
the measurement for an Extended Kalman Filter (EKF) [27, 28, 29], which
adapts the state in order to minimise the net force applied on the model.
One of the contributions of their work is that they also propose a formal
method to decide which cameras used by their system will be active while
processing each frame. To make this decision they consider the visibility
of each limb of the model and the observability of its predicted motion as
viewed by each camera.
3.1. Markerless Systems 23
Delamarre and Faugeras [30] use 3D reconstructed data obtained by cor-

relation between a pair of cameras. Correlation data is an option in systems
where the cameras are close to each other and have similar orientations with
respect to the subject so that the raw images captured are similar. The re-
sulting measurement is a noisy set of points resting on the surface of the
subject that is visible by the cameras. Like [26], they propose a method
that applies physical forces to move the model towards the data, but they
associate each correlation point to the point on the model that lies closest to
the sample, making the data association algorithm less rigorous. A force of
attraction proportional to the distance between them is applied between the
two points. For each limb of the model a resultant force is calculated and
then the set dynamical equations for all the limbs are solved (always obeying
the model’s constraints of articulation), in order to give a resultant force for
the whole model and a bending moment for each of its limbs. This allows
the model to move towards the reconstruction data, making the system able
to track big variations in the subject’s state, such as when the subject is
running. In [31] they relax the constraint of the cameras having to be close
to each other and modify their algorithm to use 2D silhouettes instead of 3D
correlation points to track people using the same physical force framework.
Urtasun and Fua [32] use 3D correlation data which are automatically
extracted using a DigiclopsTM system. Each of the limbs of the model is an
ellipsoid, which they represent implicitly as a quadric. Around each limb
they apply a field (that they call a metaball ), which is a Gaussian function
with respect to the ellipsoidal distance from the limb. Their model’s skin
is therefore a contour of constant metaball value for all the limbs in the
model. The value of the metaball for each position of the correlation point-
cloud is evaluated and gradient descend is used to fit the model’s skin on
the data. Their system uses temporal models to track pre-set motions. By
allowing the motion model to adapt during tracking, their system can also
track transitions between one motion class to an other. Their motion model
is trained with a motion database containing 4 samples (subjects performing
the motion) per motion class.
Plänkers and Fua [33] also use correlation data, but this time they use
a far more complex and realistic model consisting of several metaballs per
limb, this way simulating each of the subject’s muscle groups. Unlike [32],
they are interested in tracking arbitrary movements and therefore use no
temporal models. Apart from tracking the subject, their system allows
adaptation of the size and shape of each metaball in order to track muscle
contractions or even to allow for subjects of varying body sizes and shapes.
With their results they claim that determining the size of each metaball
as well as its position and orientation is an under-constrained problem if

only the noisy correlation point-cloud is to be used. They show that such a
system would erroneously fit the model further away from the subject’s true
spatial position, while the model’s metaballs are enlarged to compensate.
To avoid this they combine their data with information extracted from the
subject’s silhouette.
Stenger et al. [34] also propose a very realistic model of a human hand
that consists, this time, of a set of truncated quadrics. They aim to ap-
ply hand tracking for the purposes of human-computer interaction, thereby
allowing them to limit the dimensionality of their model’s state to just 7 de-
grees of freedom. They use projective geometry to find analytical solutions
for the occluding contours of the model’s projection to the cameras [35]. For
tracking they use an Unscented Kalman Filter (UKF) [36], which they claim
provides better accuracy than the EKF while being simpler to implement.
As a measurement for the UKF they use edges lying in the vicinity of their
model’s occluding contour. Under the constraints they have set concerning
the subject’s state, their results are precise and close to real-time.
Mikić et al. [2] have obtained impressive results by using a relatively
simple model on 3D voxel data. They use a set of four cameras to ex-
tract precise silhouette data by employing a reliable algorithm [37] that
incorporates shadow detection, and then obtain voxel data using the octree
algorithm [38]. Extracting voxel data makes it possible for the system to
make an association between each voxel and each body part. Once all the
voxels have been assigned to body parts, a subset of the voxels is selected
to be the measurement of the EKF filter. The measurement is therefore a
small collection of 23 voxel 3D positions and the EKF uses several iterations
to converge to the correct state. Apart from achieving very good tracking
results, their system also consists of a state initialisation step that combines
model acquisition.
Bregler and Malik [10] used optical flow to achieve tracking in an algo-
rithm that minimises least-squares. Their model uses the twists and products
of exponential maps framework for kinematic chains, that was originally for-
mulated in [39]. The same framework was used by Drummond and Cipolla
[40] this time using edge data, and Mikić et al [2] using silhouettes.
Deutscher et al. [18] use a particle-based algorithm as a way to avoid
the state getting trapped in a local mode of the (multimodal) posterior dis-
tribution. As an effort to reduce the number of samples drawn (and hence
to increase the speed of the system), which is dictated by the dimensionality
of the state-space, they propose the Annealed Particle Filter (APF). Their
method uses particles in order to optimise the posterior ditribution several
3.2. Marker Technologies 25
times, each time corresponding to one annealing run. The particles are al-
lowed to move with decreasing freedom as the level of the annealing run
increases. Greater freedom allows the particles to jump from one local opti-
mum to another, while at the final layer they are effectively trapped in one
of the maxima, which is hoped to be the global optimum. Their results are
very good, but the processing cost is extremely high comparing with other
systems (an hour’s processing time for 5 seconds of footage was quoted).
Even though a method was proposed in [19] to partition the state-space in
order to improve the speed, their results remain precise yet slow to obtain.
3.2 Marker Technologies
In the past two decades many systems have been developed for achieving
motion capture making use of marker technology, which imposes restrictions
on the clothing of the subject. In most cases lycra clothing has to be worn,
and IR emitters/reflectors have to be attached all over the subject’s body,
usually close to the joints. Even though these methods have limited porta-
bility, with filming taking place in dedicated studios, they have proven to
yield accurate results and are widely used in the film industry for precise
animation of digital characters.
Design of marker technology systems for motion capture considers very
similar principles as their markerless counterparts: (i) there is still a need
to collect data, which in this case is the projection of markers’ position in
space, (ii) an underlying model for the subject still needs to be used, and
(iii) the state-space of the model still needs to be searched for the state that
gives optimal fit between the model’s projection and the data.
In addition to the above, since the number of markers used by such
systems is finite, the expense of one of these markers being occluded is high
and could seriously confuse the system. Marker systems therefore face the
problem of data association that was not explicitly tackled by markerless
approaches. Each marker data point now needs to be associated to the
projection of the correct marker of the model, this way identifying which
markers have been occluded. For completion a few systems are outlined
below.
Ringer and Lasenby [41, 42] use Multiple Hypothesis Tracking (MHT) to
solve the marker association problem. For each frame grabbed they main-
tain several association hypotheses, whose validity is revealed on-line. As the
subject moves to different states, some association hypotheses are scrapped

and the most likely one is taken to be the true association based on which
tracking takes place. This robust association method helps the system iden-
tify occlusions and recover from them once the occluded markers appear
again. The system uses a kinematic model to predict the pose of the subject
in each frame, and an EKF to fit the state to the measurement.
Choo and Fleet [43] also fit a 3D model onto the 2D marker data, but this
time their tracking technique is a stochastic one. They use Hybrid Monte
Carlo (HMC) to obtain samples from the posterior distribution. To circum-
vent problems arising from the number of samples that need to be taken,
they use multiple Markov chains to explore the state-space. An improve-
ment factor of several orders of magnitude over traditional particle filtering
methods is quoted.
While the majority of marker-based systems use passive markers in as
their measurement, newer techniques such as PhaseSpace [44] introduce ac-
tive ones. This way the need to correctly associate the markers to the data is
avoided, in the expense of more sophisticated marker equipment, since each
marker now emits modulated light containing the marker’s identity. In gen-
eral, marker-based motion capture has proved great potential for precision,
and as a result several systems are now commercially available [44, 45].
Current Work 4
The formulation for our system has been introduced in Chapt. 2, with
attention given to the process and measurement equations (2.1) and (2.2).
In our work so far we have assumed that the rate of frame extraction by
the cameraset is high enough, meaning that the subject’s state in successive
frames does not change much. To date, our system therefore makes no
investigation of the way the subject has moved in the past, and we will
therefore no longer consider the process equation throughout this chapter.
Since no previous states are considered, the state at time tk is denoted by
X for simplicity, and (2.2) becomes,
Dk = Λ(X∗ ) + Nk , (4.1)
where X∗ is the true pose of the subject. For our system’s case the
measurement Dk is a silhouette of the subject extracted from the set of
N frames captured by the cameras at time tk , and Λ(X) is the silhouette
produced by rendering M(X), the 3D model under state X. Nk describes
all the discrepancies between the model’s silhouette and the data. Since N
cameras of pixel resolution U × V are used, Λ(X), Dk and Nk are all column
vectors of size N U V , e.g.,
£ ¤T
Λ(X) = λ1 (X) λ2 (X) . . . λp (X) . . . λN U V (X) (4.2)
where λp is the intensity of the pth pixel.

Our task is to find the best estimate for the true state X∗ of the sub-
ject. Adopting a Bayesian framework, we choose X̂M AP , the maximum a
posteriori estimate, i.e. the state that maximises the posterior probability
distribution:
¡ ¢
X̂M AP = arg max P X|Dk (4.3)
X
where, using Bayes’ theorem,
28 Current Work
¡ ¢ ¡ ¢
¡ ¢ P Dk |X P X
P X|Dk = (4.4)
P (Dk )
¡ ¢ ¡ ¢
with P Dk |X being the likelihood, P X the prior distribution, and
P (Dk ) the evidence.
Regarding the evidence as a normalising constant and assuming no prior
knowledge about the position of the subject’s body in the k th dataset, the
prior distribution becomes flat, and we therefore need to maximise the like-
lihood, this way using the maximum likelihood estimator:
¡ ¢
X̂M L = arg max P Dk |X . (4.5)
X
Furthermore we assume that the intensity of the noise on each pixel in

the measurement is independent of and identically distributed with the noise
on neighbouring pixels, giving a diagonal measurement covariance matrix R
for Nk given by
R = σn2 IN U V , (4.6)
where IN U V is the identity matrix of size N U V and σn2 is a constant.
This gives,
½ ¾
¡ ¢ ¡ ¢− N U V 1 ¡ ¢T ¡ ¢
P Dk |X = 2πσn2 2
exp − 2 Dk − Λ(X) Dk − Λ(X) (4.7)
2σn
and we therefore need to minimise the negative log-likelihood function,

i.e. the (squared) distance between the measurement and its prediction:
¡ ¢T ¡ ¢
X̂M L = arg min Dk − Λ(X) Dk − Λ(X) (4.8)
X
To converge to the maximum likelihood estimator X̂M L that minimises

(4.8) we make iterative use of the Gauss-Newton method for Non-Linear
Least Squares. For each iteration i, the maximum likelihood estimate for
the new state in dataset k is given by
³ ´
X̂(i) = X̂(i−1) + K Dk − Λ(X̂(i−1) ) (4.9)
where ³ ´−1
K = JΛ (X̂(i−1) )T JΛ (X̂(i−1) ) JΛ (X̂ (i−1) )T (4.10)
4.1. The Measurement Prediction Λ(X) 29
∂
and JΛ (X) = ∂X Λ(X) being the Jacobian of the model’s silhouette.
Equations (4.9) and (4.10) show that the tracker needs a new measurement
for every frame k grabbed by the camera system, and that the model is
rendered and its Jacobian calculated for each iteration i. The remainder
of this chapter will give details about how the model is constructed and
rendered in order to give the measurement prediction, how its Jacobian is
calculated, and how we extract the measurement from the camerasets.
4.1 The Measurement Prediction Λ(X)

Equation (4.1) implies that the quality of the measurement prediction Λ(X)
determines the level to which the measurement Dk is described at each
frame, and therefore the robustness of the tracker. The measurement pre-
diction itself depends on the 3D structure of the model M, the state X that
it attains and the measurement function Λ (also referred to as the render-
ing method ) that projects the model to the image planes of the cameras.
In other words, X determines the position and orientation of M to give
M(X), which is then rendered to give the final measurement prediction,
Λ (M (X)). For brevity we will be using Λ(X) for the remainder of this
report to represent the measurement prediction.
In this section we will formally define the structure of the model, explain
how the state encodes information about the model’s position and orienta-
tion, describe how the model configures itself according to the state, and
discuss issues concerning how to project the model to the cameras in order
to get the final measurement prediction.
4.1.1 Model Definition

The model M is a 3D structure defined as a set of C articulated chains that
simulate the articulated nature of the human body, and can assume any
position and orientation encoded by the state X. In our current model of
the upper body C = 5, and there is one chain per arm, one for the model’s
head and one for each lateral side of the torso, as shown in Fig. 4.1. All
chains grow from the same root joint, which in our case is the centre of the
torso’s base.
Each chain c of the model consists of Lc inextensible limbs, e.g. L1 = 4
for the model’s right arm. Numbering starts from the root of the chain going
upwards, e.g. limb 2 identifies the right shoulder and so on. For each chain
30 Current Work
Figure 4.1: Top Row: Two views of the model when it assumes state X.
Bottom Row: The set of chains c defined by the model: a chain is used
for each arm, the head and for each lateral part of the torso.
c and for each limb l we also define a limb radius rc,l , limb length sc,l , and
initial (X = 0) limb orientation, expressed as a unit vector v̂c,l . Note that
chains 3 and 4 are rigidly attached to, and therefore constrained to always
have the same orientation as, limb 1 of chains 1, 2, and 3.
Put in a more formal context, the definition of the model M includes
definitions for the following:
C the number of chains included in the model,
{Lc }C
c=1 the length of each chain c,
n oC
{v̂c,l }L c
l=1 the initial orientation of each limb l in each chain c,
c=1
n oC
Lc
{rc,l }l=1 the radius of each limb l in each chain c, and
c=1
n oC
{sc,l }Lc
l=1 the length of each limb l in each chain c.
c=1
4.1.2 Model Adaptation to State
When the model assumes a state X, the state-dependent orientation of limb

l in chain c depends not only on its own rotation as dictated by the state,
but also on the rotation of all the limbs m ≤ l of the same articulated chain
down to the trunk (torso). To put this formally,
Ã l
!
Y
∗
v̂c,l (X) = Rc,1 (X)Rc,2 (X) . . . Rc,l (X)v̂c,l = Rc,m (X) v̂c,l (4.11)
m=1
∗ (X) is the unit vector representing the orientation of the limb l

where v̂c,l
of chain c under state X, and Rc,m (X) are rotation matrices parameterised
by the state, which are formally defined by (4.15) in Sect. 4.1.3. It should be
noted that the ordering of rotations Rc,m (X) is important and that the Π-
notation used above preserves their ascending order w.r.t. their limb index
m.
The 3D position pc,l (X) of the leaf joint (i.e. the joint further away from
the chain’s root) of each limb l in chain c now depends on the orientations
∗ (X) of all the limbs m ≤ l, but also on the length s
v̂c,m c,m of each of these
limbs, and the 3D position of the global root joint, xsp (X), which is directly
defined by the state in (4.13) and (4.14) of Sect. 4.1.3.
∗ ∗ ∗
pc,l (X) = xsp (X) + sc,1 v̂c,1 (X) + sc,2 v̂c,2 (X) + . . . + sc,l v̂c,l (X)
l
X
∗
= xsp (X) + sc,m v̂c,m
m=1
or, substituting from (4.11),
l
Ã m
!
X Y
pc,l (X) = xsp (X) + sc,m Rc,n (X) v̂c,m (4.12)
m=1 n=1
Since the root joint of every limb l is the leaf joint of limb l − 1 in the
same chain, pc,l and pc,l−1 are the end-point positions of limb l. The set of
n oC
3D positions of all the leaf joints {pc,l }Lc
l=1 therefore contains all the
c=1
necessary information for representing M(X) under any arbitrary state X.
32 Current Work
4.1.3 The State

The model is assumed to be an articulated chain of inextensible limbs. Suf-
ficient information about the model’s position and orientation is therefore
given by a parameterisation for (i) the spatial offset for the model’s root,
and (ii) the rotation for every limb.
In our state vector X, the first three coefficients form the spatial offset
vector xsp of the body’s root or centre of mass (in our case the base of the
torso). The rest of the coefficients in X group up in fours and each of these
groups describes the rotation of a single limb. In other words,
£ ¤T
X = xTsp qT1,1 . . . qT1,L1 qT2,1 . . . qT2,L2 . . . qc,l . . . qTC,LC (4.13)
where
   
  q0 cos 2θ
x  qx   ux sin θ 
xsp = y  , and qc,l =  
 qy  =  uy sin θ
2  .
 (4.14)
z 2
qz uz sin 2θ
In (4.14) above q0 etc. is used for brevity instead of q0,c,l , and likewise
for θ, ux etc.
Each quartet qc,l of state rotation coefficients as defined in (4.14) con-
stitutes a quaternion [46] and parameterises the rotation of limb l in chain
c about a unit axis û = [ux uy uz ]T and through an angle θ. Even though
each rotation is now described with 4 parameters, these are constrained in
that |qc,l | = 1. The new estimate for the state is calculated using (4.9)
and (4.10), and subsequently all the quaternions are normalised, effectively
resulting in 3 degrees of freedom per limb rotation.
Euler angles ( qc,l = [φx φy φz ]T ) could have been used to parameterise
the limbs’ rotations, resulting in a smaller size for the state vector X. Such
a parameterisation, however, suffers from discontinuities in qc,l even during
small variations of the limb’s orientation [47]. Using quaternions, on the
other hand, smooths the state-space for X thus eliminating local minima
and singularities.
n oC
The state-dependent limb rotations {Rc,l (X)}L c
l=1 , of equations
c=1
(4.11) and (4.12) are 3 × 3 rotation matrices resulting from the quaternions
n oC
{qc,l }L
l=1 c=1 stored in the state. For the quaternion as defined in (4.14)
c
the resulting rotation matrix is given by,

λp
255
0 0.25 r2 r2 2
2.25 rc,l d2p
c,l c,l
Figure 4.2: Left: Plot of λp (X) vs d2p (X). Right: Illustration of distance
measure dp .
 
q02 + qx2 − qy2 − qz2 2(qx qy − q0 qz ) 2(qx qz + q0 qy )
Rc,l (X) =  2(qx qy + q0 qz ) q02 − qx2 + qy2 − qz2 2(qy qz − q0 qx )  . (4.15)
2(qx qz − q0 qy ) 2(qy qz + q0 qx ) q02 − qx2 − qy2 + qz2
4.1.4 The Rendering Function

n oC
Knowing {pc,l }Lc
l=1 , we now need to create a measurement prediction
c=1
Λ(X) that is to be compared with the measurement Dk . Since every pc,l ∈
R3 , and since Dk ∈ R2 , the rendering function will clearly have to be a
projection process Λ : R3 → R2 , simulating the perspective nature of the
system’s real cameras and the maintaining as much of the information about
the state as possible.
Equation (4.10) suggests that Λ(X) has to be differentiable, and for this
reason we have chosen to render each limb of the model as a cylinder whose
colour is a function of the distance from the limb axis, thus allowing for a
finite gradient to be calculated. The intensity of each pixel belonging to
Λ(X) is therefore given by

 255 if dp (X) > 32 rc,l




255 2 255
λp (X) = rc,l dp (X) − 2 if 12 rc,l ≤ dp (X) ≤ 32 rc,l . (4.16)





0 if dp (X) < 12 rc,l
34 Current Work
Figure 4.3: Left: The smallest line-to-line distance dp takes place at a

location outside the limb’s length. Right: Example of a ray that produces
a d¯p that misses the length of the limb. In such a case detection of an
intersection between ray and cylinder is facilitated by calculation of dˆp and
dˇp on either sides of the limb.
where dp is the shortest distance between the axis of the limb and the
current ray, as shown in Fig. 4.2 on the right. We have chosen λp = f (d2p )
so that the rendering function is differentiable as will be seen in Sect. 4.2
and Appendix A.
We render the model by firing rays from the camera’s centre of projec-
tion, passing through each pixel from the set of cameras. For each pixel
p, and for each limb l in chain c of the model, we calculate dp using the
line-to-line distance formula, which in our case becomes
(ûr × (pc,l − pc,l−1 )) · (c − pc,l−1 )

dp (X) = , (4.17)
|ûr × (pc,l − pc,l−1 )|
where ûr is the unit vector in the direction of the ray passing through
pixel p, and c is the position of the current camera’s centre of projection.
The rendering method described so far results in cylinders with an infi-
nite length, and is therefore not sufficient to use as is for our measurement
prediction Λ(X). The reason for this is that dp can be smaller than rc,l
even at times when the ray misses the axis’ length as shown in Fig. 4.3 on
the left. In such cases, and with the method described above, the pixel is
rendered erroneously. For this reason it is essential to introduce additional
distance measures that parameterise how close the ray is to the limb. Vari-
ous resulting rendering methods are discussed in the following subsections.
Figure 4.4: Rendered cylinder under various states X. Notice the dis-
continuities in pixel intensity along the cylinder’s axis in the image on the
top-right.
Cylinder
It was initially decided that the limbs should be rendered to be cylinders.

To allow for the finite lengths sc,l of the limbs, and for the case of each
pixel, we decided to paint the pixel according to (4.16) only if the smallest
distance between the limb axis and the ray (labelled d¯p , from now on) occurs
somewhere along the length of the limb. This however would neglect all the
cases when the ray crosses either of the flat circular surfaces of the cylinder,
as shown in Fig. 4.3 on the right. All the pixels corresponding to rays
crossing the flat surfaces of the cylinder will mistakenly be painted white,
implying that there is no intersection between ray and limb.
To compensate for this we have introduced another two distance mea-
sures dˆp and dˇp , that represent the distance between the centre of either flat
surface of the cylinder and the point where the ray crosses the flat surface,
as shown in Fig. 4.3 on the right. For each pixel’s ray, all three distance
measures (d¯p , dˆp and dˇp ) are now calculated, and the smallest one of the
three (i.e. the one that implies that the ray is more ‘immersed’ in the limb)
is selected as the distance measure to be used with (4.16). In other words,
36 Current Work
Figure 4.5: Left: Illustration of distance measures d¯p , dˆp and dˇp , used
to render a cylinder with hemispherical ends. Right: Distances {dn,p }N n=1 ,
s
used to render the limb as a sphere tree.
dp = min{d¯p , dˆp , dˇp } (4.18)
The resulting measurement prediction for the case of a single cylinder is

shown in Fig. 4.4. Even though the cylinder forms a good approximation
for a human limb, it does not result in a smooth variation of pixel intensity
under all possible states. In fact, when the state is such that the camera
gets an exact side view of the cylinder, we see that there is a discontinuity
in pixel intensity along the limb’s axis, as seen at the top-right image of Fig.
4.4. As a result the axial gradient corresponding to the pixels defining the
sharp edge of the cylinder is undefined, meaning that the tracker is unable
to move the limb along its axis, as will be further discussed in Chapt. 5.
Cylinder with Hemispherical Ends

To guarantee that the measurement prediction has a finite gradient under
all the possible states of the model, we have rendered the limbs as cylinders
with hemispherical ends. We have done this by re-defining the role of dˆp
and dˇp . Under this rendering method, these distance measures are now
the point-to-line distances between the current ray and either end of the
cylinder, as shown in Fig. 4.5 on the left. Like in (4.18), the minimum of
the three distance measures is used with (4.16) to get each pixel’s intensity.
Distance measure d¯p still contributes towards the lateral part of the cylinder,
while dˆp and dˇp now contribute towards a sphere centred at either end-point
Figure 4.6: The resulting measurement predictions for one limb using the
Cylinder with Hemispherical Ends method.
of the limb, with its radius being equal to the radius of the cylinder. As seen
in Fig. 4.6, the resulting measurement prediction now shows finite gradients
from all possible states.
Sphere Tree
The performance of the system when the limbs are rendered as cylinders with
hemispherical ends has proven to be very promising. The limbs now produce
a differentiable measurement prediction under any X thanks to the point-
to-line distance measures introduced in the previous subsection. For this
reason, and because of future aspirations of the project to render models
that are a lot more realistic, we have decided to implement a rendering
method that assumes limbs as a nested group of spheres, or sphere trees.
Sphere trees have been studied by G. Bradshaw, and C. O’Sullivan in
[1] and efficient algorithms exist that can render objects of arbitrary shapes
as groups of spheres, the number of spheres increasing with the intended
accuracy, as illustrated in Fig. 4.7. For our case sphere trees could eventu-
ally form a very realistic representation of the subject’s limbs, resulting in
measurement predictions Λ(X) of superior quality. Apart from the future
38 Current Work
Figure 4.7: Demonstration of the Sphere-Tree algorithm, an algorithm that

represents objects of arbitrary shapes as groups of spheres. The number of
spheres used depends on the desired accuracy. This image is by courtesy of
G. Bradshaw and C. O’Sullivan [1].
advantages of rendering limbs as sphere trees, we have investigated such

rendering methods because calculating the derivative of the corresponding
distance measures proved to be relatively easy. Furthermore, the fact that
spheres always project into circles under perspective projection means that
simpler and faster rendering methods may be implemented in the future.
Under this rendering scenario we have chosen a level of accuracy that
determines Ns , the total number of spheres needed to represent the limb.
Then, the centre for each sphere is located and the set of distances {dn,p }Ns
n=1
of the ray from each of these centres is calculated, as shown in Fig. 4.5 on
the right. The candidate distance measures are used in (4.16) to get a set
of candidate pixel intensities {λn,p (X)}N
n=1 , and of course, the current pixel
s
is given the lowest (darkest) available intensity:
λp (X) = max {λn,p (X)}Ns

n=1 (4.19)
Figure 4.8 shows a limb that was rendered as a group of equally sized
spheres. Note that the sphere tree rendering method is not constrained in
a constant sphere radius, this way allowing for the natural curvature of the
subject’s limbs.
4.2. The Jacobian JΛ (X) 39
Figure 4.8: Resulting measurement predictions for various X of one limb

when rendered as a sphere tree.
4.2 The Jacobian JΛ (X)
Equations (4.9) and (4.10) indicate that for every new estimate X̂(i) , the
Jacobian JΛ (X̂(i−1) of the measurement prediction has to be calculated. JΛ
is a matrix of size N U V × S where S is the size of the state vector X. For
∂λp (X)
each pixel of Λ(X) we calculate the derivative ∂X , which is a 1×S vector.
All these vectors are subsequently stacked to form JΛ .
By differentiating (4.16) we have,
 255 ∂

2
rc,l ∂X (dp (X)) if 21 rc,l ≤ dp (X) ≤ 32 rc,l
∂λp (X)
= (4.20)
∂X 
0 otherwise .
∂ 2
Of course, the value of ∂X dp (X) depends on the distance measure used
to render the pixel. As an example of how the distance derivatives are
calculated, the derivative of d¯2p is calculated in Appendix A. In general, for
every distance measure used, d2p = f (pc,l , pc,l−1 ), and so the derivative of
every joint position pc,l of every chain c has to be calculated. Differentiating
(4.12) w.r.t. the state gives,
40 Current Work
( l
Ã m
! )
∂pc,l ∂xsp ∂ X Y
= + sc,m Rc,n (X) v̂c,m (4.21)
∂X ∂X ∂X
m=1 n=1
∂p
which is a 3 × S matrix whose columns are ∂xc,l s
, xs being the sth state
variable. For the first 3 state variables (i.e. the spatial components of the
state) equation (4.21) becomes,
∂pc,l ∂xsp
= = I3 (4.22)
∂xsp ∂xsp
where I3 is the 3×3 identity matrix. For the rest of the state components,
which correspond to rotations, (4.21) becomes
l
(Ã `−1 ! Ã l
!)
∂pc,l X Y ∂Rc,` Y
= sc,m Rc,n Rc,n v̂c,m (4.23)
∂qξ,c,` ∂qξ,c,`
m=1 n=1 n=`+1
where qξ,c,` is the ξ th quaternion coefficient of limb ` in chain c, and

∂pc,l
ξ ∈ {0, x, y, z}. Note that ∂qξ,c,` is non-zero only if pc,l and qξ,c,` refer to the
c,` ∂R
same chain c, and if ` ≤ l. Calculation of ∂qξ,c,` is done by differentiating
(4.15).
∂p
Equations (4.23) and (4.22) are used to complete the 3×S matrices ∂Xc,l ,
p ∂d2 (X)
which are used to determine a 1 × S vector for ∂X . Equation (4.20) is
then used to calculate a 1 × S vector for the gradient of every pixel. The
∂λp (X)
resulting vectors ∂X of all the N U V pixels are finally stacked on top of
each other to form the Jacobian whose size is N U V × S.
4.3 The Measurement Dk

The measurement or dataset Dk is the monochrome silhouette of the subject
as captured by the cameras. To get a good silhouette of the subject we first
take a short sequence (of about 50 framesets or so) B1:50 of the scene’s
background. We assume that the framesets in the background sequence
follow a Gaussian distribution:
Bk ∼ N (Bµ , Bσ2 ) (4.24)

4.3. The Measurement Dk 41
Figure 4.9: The pre-processing stage of the system. 1st Row: View from
camera 1. 2nd Row: View from camera 2. Raw framesets Ik are grabbed
(column 2), and compared against the mean background frameset Bµ (col-
umn 1) to produce a noisy silhouette image Ek (column 3). The frameset
is then morphologically filtered and smoothed to produce the final measure-
ment Dk (column 4).
where Bµ = [µ1 . . . µp . . . µN U V ]T and Bσ = [σ1 . . . σp . . . σN U V ]T

are the background mean and standard deviation frames respectively. Since
an adequate sample is captured, Bµ and Bσ are easily calculated.
Assuming that the subject’s presence in the scene does not change the
background’s statistics, and by comparing the current raw frameset Ik =
[i1 . . . ip . . . iN U V ]T with framesets Bµ and Bσ , we can make decisions
on whether each pixel of the raw frameset belongs to the background or not.
For each pixel p, the pixel intensity of the background-extracted frameset
Ek is given by,

 0 if |ip − µp | > βσp
ep = (4.25)

255 otherwise
i.e. the pixel is painted black indicating that it represents the subject’s
silhouette, only if ip falls outside a β % confidence interval. By setting
an appropriate value for β, we can minimise the number of background
pixels that are mistakenly taken to be part of the background. For this
experimental setup β = 3.891, which leads to an error rate of 10−4 , i.e. 1
erroneous pixel in each 100×100 block of pixels, provided that the foreground
does not interfere with the background.
42 Current Work
After extracting Ek , the framesets are morphologically filtered in order

to decrease the level of noise, and subsequently smoothed for the smoothed
measurement prediction Λ(X) to be compared to them. Fig. 4.9 shows all
the stages discussed in this section leading to the final measurement Dk .
Results 5
This chapter will present the ability of the current system to track the data.
We initially display results concerning the case of a model of a single cylinder,
based on simulated data. The tracking performance of different rendering
methods is compared. The most promising rendering method is chosen to
test how well the system tracks the model of the upper human body, again
using simulated data. Finally, we test our method on some real data.
In all the examples that follow, the measurement prediction, its Jacobian
and the measurement (for the case of experiments using real data) were
produced as described in the previous chapter. For reasons of clarity the
measurement prediction Λ(X) is rendered white and is superposed on the
measurement which is painted black.
5.1 Simulated Data

In the following examples a single measurement D is produced by rendering
the model itself under a state X∗ , and then ‘forgetting’ about the state. We
then initialise our model’s state at X̂(0) , and equations (4.9) and (4.10)are
employed iteratively to converge to X̂M L , (which in this case is identical
to X∗ ). Under this scenario the additive noise process is N = 0 and the
measurement equation implies that the model is perfect, as it can describe
the data with no error.
5.1.1 Single Cylinder

This part is included in this report in order to illustrate the importance of
adopting the appropriate rendering method in the system. Three rendering
methods have been tested for this simple case, (i) the ‘Cylinder’, (ii) the
‘Cylinder with Hemispherical Ends’ and (iii) the ‘Sphere Tree’. For this
simulation the model only consisted of a single limb.
44 Results
Single Limb Experiment with Simulated Data

110
100
% of Negative Log Likelihood

90
80
70
60
50
40
30
20
Cylinder
10 Hemispherical Ends
Sphere Tree
0
10 20 30 40 50 60
No. of Iterations
Figure 5.1: Silhouette matching on a single limb using a single simulated

measurement D and 1 camera. Three rendering methods are compared:
‘Cylinder’, ‘Cylinder with Hemispherical Ends’ and ‘Sphere Tree’. Top
Left: Λ(X̂(0) ) (white) superposed on D (black) using the ‘Cylinder’ method
to render both silhouettes. Bottom Left: Same using the ‘Cylinder with
Hemispherical Ends’. Right: Performance of the three rendering methods.
The ‘Cylinder’ method fails to converge as there is not sufficient gradient
information to move the model along the x-axis.
Even though the system using any of the rendering methodologies de-
scribed above generally performs well and converges to the correct state
X∗ , problems using the ‘Cylinder’ method arise when we consider an initial
state X̂(0) where the limb is viewed sideways by the camera, in which case
the gradient in Λ(X̂(0) ) along the limb’s axis is undefined and the Jacobian
is therefore unable to provide gradient information to direct the state cor-
rectly, as mentioned in Sect. 4.1.4 of Chapt. 4. This is shown in Fig. 5.1,
where the model using the ‘Cylinder’ method fails to converge. We see that
neither of the two alternative rendering methods suffer from this problem,
and therefore the rest of the testing of the system was done using the ‘Hemi-
spherical Ended Cylinder’. The ‘Sphere Tree’ method is left for future use,
when we implement the Sphere-Tree algorithm for creating realistic models,
but it is important to see that its performance is equivalent to that of the
‘Hemispherical Ended Cylinder’.
5.1. Simulated Data 45
Figure 5.2: Silhouette matching on a simulated dataset using 1 camera.

Λ(X̂(i) ) superposed on D for iterations i = 0 (left), i = 7 (middle), i = 120
(right).
5.1.2 Upper Body
Single Camera
In Fig. 5.2 we display three snapshots from the silhouette matching process
for the case of a single-camera system. Despite the ambiguity problem when
working with pure silhouettes, we see that the algorithm can converge to the
global optimum, provided that X̂(0) lies in the vicinity of X̂M L .
7
x 10 Single Camera Experiment x 10
7
Double Camera Experiment
12
12
(D − Λ(X))T(D − Λ(X))
(D − Λ(X))T(D − Λ(X))
10
10
8
8
6
6
4
4
2 2
0 0
20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
iterations iterations
Figure 5.3: Variation of the negative log-likelihood w.r.t. number of itera-

tions i during silhouette matching with simulated data on dataset D for the
single camera (left) and double camera (right) experiments.
46 Results
Figure 5.4: Silhouette matching on a real dataset using 2 cameras. 1st

Row: View from camera 1. 2nd Row: View from camera 2. Column 1
(i)
shows real data D1 ; columns 2 to 5 show Λ(X̂1 ) superposed on D1 for
iterations i = 0, i = 6, i = 10 and i = 25 respectively.
Stereo Vision
Double camera configurations helped the algorithm to converge towards the

global minimum with a smaller number of iterations due to the additional
constraints of the extra data, as shown in Fig. 5.3: The stereo vision con-
figuration (right) has converged correctly with only 30 iterations while the
single camera configuration (left) needed 120, even though identical initial
state approximations (X̂(0) ) and true states (X∗ ) have been used. Put in
a different way, we are now able to choose the initial state X̂(0) for silhou-
ette matching with greater freedom, and still converge to the correct state
X∗ . The maximum allowable distance that guarantees correct state recovery
² = |X̂(0) − X∗ | for the double camera configuration is now greater than the
one for the single camera, implying that the camera frame capture rate may
now be reduced. We can generalise this by noting that data captured by
additional cameras will make the state-space more convex, and the tracker
consequently more robust.
5.2 Real Data

Unfortunately, datasets Dk are far from ideal, as they suffer from two types
of distortion: (i) the model and its rendering function are not currently
complex enough to describe the natural curvature of the human body and
the clothing around it, and (ii) Dk has been further corrupted during the
5.2. Real Data 47
Figure 5.5: Motion recovery after successful silhouette matching on a video

sequence of real datasets D1:50 using 2 cameras. 1st Row: Dk as viewed
from camera 1; from left to right: D1 , D9 , D17 , D25 and D33 . 2nd Row:
(25)
Λ(X̂k ) for each corresponding Dk of row 1 from the same view. 3rd
(25)
Row: Same Dk as viewed from camera 2. 4th Row: Λ(X̂k ) for each
corresponding Dk of row 3 from the same view.
process of background removal. All these contribute towards Nk , the random

process described in (2.3) and (4.6). The performance of our algorithm under
these conditions for the case of the 1st dataset D1 is shown in Fig 5.4.
When silhouette matching is achieved repeatedly over the continuous
flow of datasets we achieve tracking. For each new dataset Dk , a new op-
timisation process of 25 iterations takes place and by default the initial
(0)
approximation for the current state X̂k is taken to be the converged state
(25)
of the previous frame X̂k−1 .
Fig. 5.5 shows the real datasets Dk as well as the converged model
(25)
silhouettes Λ(X̂k ) as observed by the double-camera system for different
time instants tk , using 144 × 192 pixel resolution. We see that despite the
low quality of the measurement the system recovers the subject’s state at
each frame to a considerable degree of accuracy. Our system also works in
48 Results
Figure 5.6: Tracking with low resolution data. Top Row: Dk for k = 1,
9, 17, 25 and 33. Bottom Row: The model’s converged state after 25
iterations for each corresponding Dk of the top row from the same view.
The model was rendered with a higher resolution for clarity. Only views
from camera 1 are shown in this illustration.
lower resolutions, with lower yet acceptable accuracy, as shown in Fig. 5.6
for the case of 36 × 48 pixels in each camera.
Conclusions and Future Work 6
The current system’s robustness hasn’t been completely tested yet. So far
we have successfully tracked a sequence showing a simple movement with
the subject moving at a relatively slow pace. The movement is tracked
successfully even at medium to low resolutions, down to 36 × 48 pixels in
each camera. We will however need to test the system on additional gestures
of increasing complexity.
Special attention will be given to gestures that involve occlusions. It is
expected that the system will cope with sequences involving a low degree
of occlusion, e.g. sequences where one of the two cameras sees an occluded
silhouette while the other does not. It is also believed that higher degrees
of occlusions will be handled when introducing more cameras, but unfor-
tunately our camera-capturing equipment as it stands now can only store
datasets produced by 2 cameras. Nevertheless, there are many issues to be
improved in order to increase the system’s robustness.
6.1 Extended Kalman Filter
Limitations on the current system’s performance under occlusions imply

that the tracker’s robustness will have to be upgraded. For the remainder of
this project we propose the adoption of the Extended Kalman Filter (EKF)
framework [27, 28, 29], which is perhaps the most widely used tracker today.
The EKF equations are shown below, using standard notation:
¡ ¢
xk = x̂k + Kk zk − h(x̂k ) (6.1)
where xk and zk are the current state and measurement respectively, h

is the measurement function, x̂k is the state prediction and Kk the Kalman
gain. Kk and x̂k are given by,
50 Conclusions and Future Work
x̂k = f (xk−1 ) (6.2)

0 0
Kk = Pk JhT (Jh Pk JhT + R) −1
(6.3)
where
0
Pk = Jf Pk−1 JfT + Q (6.4)
0
Pk = (I − Kk Jh )Pk . (6.5)
∂ ∂
Jf = ∂x f (xk−1 ) and Jh = ∂x h(x̂k ) are the Jacobians of the process and
measurement functions respectively.
Considering the special case when the process function f is the identity
matrix, and when the measurement covariance matrix R is diagonal having
its entries a lot smaller compared to the process covariance matrix Q, then
Jf = I and the EKF equations shown above become,
x̂k = xk−1
0
Pk = Pk−1 + Q
¡ ¢−1
Kk = (Pk−1 + Q)JhT Jh (Pk−1 + Q) JhT + R
¡ ¢−1
= (Pk−1 + Q) JhT Jh (Pk−1 + Q) JhT
= (Pk−1 + Q) JhT (JhT )+ (Pk−1 + Q)−1 Jh+
= Jh+
¡ ¢−1 T
= JhT Jh Jh
where A+ is the Moore-Penrose matrix inverse for non-square matrix A.

Substituting into (6.1) gives,
¡ ¢−1 T ¡ ¢
xk = xk−1 + JhT Jh Jh zk − h(xk−1 ) (6.6)
which, if used over several iterations for each k, is identical to (4.9)
and (4.10). We see that the current system forms a special case of the
EKF and that its formulation allows for it to be upgraded into a more
sophisticated one, making use of full covariance matrices R and Q, and
further investigation of the process function. We can therefore relax the
assumption of i.i.d. pixel noise stated in (4.6).
The EKF used in the current system as stated in (6.6) gives a state
prediction x̂k which is the same as the previous state xk−1 . The system uses
6.1. Extended Kalman Filter 51
no information about the model’s dynamics from previous frames in order

to predict the current state, implying that the model is static.
Furthermore, the EKF of equation (6.6) shows faith that the measure-
ment zk is an accurate one, since its noise covariance R is small, as assumed
earlier on. On the other hand, since the entries of the process noise covari-
ance Q are a lot greater than those of R, the EKF gives less credit to the
state prediction. All this is encoded in the Kalman gain Kk , which holds its
maximum possible value when R → 0, as indicated by (6.3).¡As a result the ¢
EKF now gives more weight to the measurement residual zk − h(xk−1 ) ,
and the estimate for the new state xk is more free to move away from the
(imprecise) state prediction x̂k .
In order to improve the filter’s faith in x̂k and therefore in order to relax
the assumption that the entries of R are negligible compared to those of
Q, it is proposed that the tracker is upgraded into an EKF that uses a 1st
order kinematic model, where the state now holds the angular velocities of
the limbs, as well as their position in 3D space. That is,
· ¸
xk
xk = (6.7)
ẋk
£ (1) (S) ¤T
where xk = xk . . . xk is now the part of the state that describes
£ (1) (S) ¤T
the pose (position and orientation) of the body, and ẋk = ẋk . . . ẋk
the part of the state describing its 1st order dynamics. Under this extension
the process and measurement equations become,
xk = f (xk−1 ) + wk (6.8)
and
zk = h (xk ) + vk (6.9)
This modification will make the system able to predict the pose of the
subject in the new frame based on the pose and its rate of change in the
previous frame, making the state prediction better. Taking the time interval
between successive frames as unit time, the pose state parameter s and its
rate of change are given by,
(s) (s) (s)

xk = xk−1 + ẋk−1 (6.10)
(s) (s) (s)
ẋk = xk − xk−1 (6.11)
Evaluation of (6.10) and (6.11) for all s ≤ S leads to a linearised process

function f , while taking their state-derivatives leads to the evaluation of Jf .
With consideration of the process and measurement noise covariances the
EKF equations may be incorporated into the system.
Even though the size of the state is now 2S, the system will gain in
robustness, and ε, the allowable change in state between successive frames
that guarantees convergence, will increase. The tracker will be able to handle
faster movements, or, alternatively, the frame-capturing frequency may be
reduced to save computational power.
Further upgrade of the tracker’s performance could be achieved by intro-
ducing prior distributions about the current state in order to narrow down
the state search. This however means that the system will need to be trained
with sufficient datasets, and may eventually reduce the system’s ability to
track unknown movements, such as dancing. For these reasons prior distri-
butions will only be considered after sufficient testing of the tracker under
the extensions discussed above.
6.2 Shadow Detection
Work can also be done in upgrading the quality of the silhouettes that are
used as the measurement for the EKF. Currently the measurement is created
by classifying pixels of the current raw frame Ik into two classes: foreground
and background. The pixels are classified as foreground if they fall outside
a β % confidence interval of the mean background frame Bµ , as discussed
in Sect. 4.3 of Chapt. 4. This algorithm works well when the subject’s
presence in the scene does not interfere with the background. However, in
many cases we have a subject located very close to parts of the background,
casting shadows or reflecting ambient light. This produces a number of
pixels in the measurement that are erroneously assigned to the foreground.
Sparsely distributed erroneous pixels can easily be removed using morpho-
logical filters, but patches of erroneous pixels act as attractors for the model
causing confusion to the tracker. The current algorithm shows no robustness
against this, as shown in the 3rd image of the 2nd row in Fig. 4.9, where
shadows produced by the subject have resulted in the classification of the
pixels representing the keyboard as part of the foreground.
For these reasons it is proposed that in addition to Ik , Bµ and the back-
ground standard deviation frame Bσ , derivative framesets |∇Ik |, |∇B|µ and
6.3. Extension of the Measurement 53
|∇B|σ could also be used to facilitate shadow detection, so that pixels may
be classified into three classes, not two. Under this scenario, equation (4.25)
makes a rough classification of the pixels that belong to the foreground, and
subsequently their corresponding values in |∇Ik | and |∇B|µ are compared.
If these values are found to be similar, the pixel is classified as a shadow
pixel, otherwise it remains in the foreground class. The level of similarity
tolerated will clearly be dictated by the corresponding value of |∇B|σ , like
before. All this changes equation (4.25) to,

 0 if |ip − µp | > βσp and |δip − δµp | > βδσp
ep = (6.12)

255 otherwise
where δip , δµp and δσp are elements of |∇Ik |, |∇B|µ and |∇B|σ respec-
tively.
It is hoped that with this extension we will be able to reduce the number
of pixels that are mistakenly assigned to the silhouette of the subject, this
way reducing the number of erroneous pixels that appear as patches and
therefore helping the tracker converge to the correct state. The increase
in computational complexity of the system is affordable, since |∇B|µ and
|∇B|σ are only calculated once, while |∇Ik | will have to be calculated for
each new frame grabbed.
6.3 Extension of the Measurement

Additional extensions to the measurement will also be considered, since by
definition of the system as one that uses measurements comprising solely of
silhouettes, there is an inherent inability to track gestures involving several
occlusions (e.g. when the subject ties his arms). For this reason we propose
expansion of the measurement to utilise additional features from the raw
footages, to be used only in cases of a detected occlusion, this way keeping
the computation complexity low and the time performance high.
Such features could simply be the edges extracted from the framesets,
optical flow or correlation data. Despite the fact that from these features
edge detection is the simplest algorithm for implementation, any system
relying on it would eventually fail as creases on the subject’s clothes tend
to mislead the tracker to erroneous state estimates. On the other hand,
correlation data need more computation power to be extracted and may even
need dedicated hardware to be grabbed in real-time. However, correlation

will produce clouds of 3D points, sitting on the subject’s surface. Even
though the point cloud will clearly suffer from noise, it will alert the system
when the silhouette tracker converges to an erroneous state, and hopefully
help the tracker maintain the correct state even in cases of occlusions.
6.4 State Initialisation
Several things can also be said about the initialisation of the model: to date
this has not been given much attention. So far in all of our experiments the
state of the subject’s model is started up manually at the first frame of the
sequence. Even though this does not decrease the performance of the system
in any other way, we will try several methods for finding out more about
the subject’s initial state at the 1st frame. Particle Filtering (PF) is one of
the methods under consideration, in which the whole of the state-space is
sampled to investigate whether a high likelihood P (zk |xk ) is observed. After
a sufficient number of particles (samples) has been drawn (depending on the
size of the state-space), the initial state becomes the state that maximises the
observed likelihood. The big disadvantage of using a particle-based method
to track or initialise the state of the subject in motion capture systems is that
the state-space is very large, implying that ¡an enormous
¢ number of particles
will need to be taken. Since P (zk |xk ) = f h(xk ) and since the evaluation
of the measurement prediction is one of the most computationally expensive
stages of our system, an initialisation step involving Particle Filtering will
be a time-consuming process. For this reason Particle Filters will only be
considered with emphasis given in finding ways to reduce the state-space,
e.g. by partitioning the state-space into a group of smaller ones. Assuming
independence between the different chains c of the model, each chain could
now have its own state and a separate particle filter could be applied to it.
An alternative method to particle filtering for the initialisation of the
state could be the slight modification of equation (4.16) to have a smaller
gradient, such that the linearly-dependent part of the equation extends to
greater distances. This has the effect of decreasing the accuracy of the
tracker, but at the same time making the tracker able to converge to states
which are further away. As the model gets closer to the spatial position
of the subject, the gradient of the rendering function in (4.16) is allowed
to increase (therefore increasing the system’s accuracy) and more estimates
6.5. Model Acquisition 55
for the true state are calculated. Eventually the model’s spatial position
will coincide with the spatial position of the subject, and the gradient in
the rendering function will be the same as in (4.16). By then it is hoped
that the model’s state estimate will be in the vicinity of the true state, and
therefore convergence to the true initial state is achieved.
A third way to initialise the state is to search in 3D space and locate
the position of various parts of the subject using templates. To begin with,
we search for the position of the head by moving a template of the model’s
head in 3D space and projecting it onto the image planes of the cameras.
The 3D position that projects to areas of the measurement containing the
maximum number of edge foreground pixels (defined as foreground pixels
that have at least 1 background pixel neighbour) is chosen as the most
likely position for the head. Once the head is located, constraints about the
model’s structure are taken into consideration to help locate the position of
the rest of the body. As an example, the torso is constrained to be attached
on the base of the head and it is likely to extend vertically downwards.
The torso template is therefore initialised vertically below the head and
the number of pixels in the projected template as seen from the cameras is
counted and their centroid located. The torso template is then rotated for
its centroid to coincide with the centroid of the pixels in it. A new pixel
centroid is calculated and the template is rotated again until it converges
to an orientation that no longer changes. When this happens a very similar
technique is used for the initialisation of the model’s limb orientations again
using the constraints imposed by the model.
6.5 Model Acquisition
The technique described in the previous section for the initialisation of the
state is due to I. Mikić et al. and it has already been used with 3D voxel
data in [2]. An extension of this technique allows for model acquisition,
during which the size of the model’s limbs is defined based on the first
few measurements, this way allowing for any subject to be tracked. After
converging to the orientation of each body-part as explained above, the
corresponding template is reduced in size so that its projection only includes
foreground pixels. The template is then allowed to grow in all directions,
and its orientation is continuously adjusted to make sure that it grows in
the right direction. Growing along a certain direction continues until the
Figure 6.1: Illustration of the model acquisition method using templates.

(a) The position and orientation of the body-part (here the torso) is found.
(b) The template shrinks until its projection contains no background pix-
els. (c) It is then allowed to grow in all directions, and its orientation is
re-adjusted to make sure that it does not go to the wrong direction. (d)
Growing stops when background pixels start entering the template. This
image is by courtesy of I. Mikić, M. Trivedi, E. Hunter and P. Cosman. [2]
template starts to include non-foreground pixels. Notice that the length-

to-radius ratios of the model’s limbs are allowed to change in this way. All
this is illustrated in Fig. 6.1. This method assumes measurements Dk
whose silhouettes are ‘full’, i.e. that there are no (erroneous) background
pixels within the area of the foreground. This may be achieved by further
morphological filtering of the measurement.
An alternative way to obtain the dimensions of the model is to implement

the Sphere-Tree algorithm, which will produce sphere-tree representations
of the subject with varying level of precision. This method naturally goes
together with the ‘sphere tree’ rendering method discussed in Sect 4.1.4 of
Chapt. 4.
6.6. Pattern Recognition 57
Year Term Task

Further testing of the current system. Calcu-
lation of measurement covariance R, process
2 1
function f and Jacobian Jf . Upgrade of the
EKF.
Enhancement of the background removal al-
gorithm to obtain measurement with less dis-
2 2
tortion. Robustness upgrades using extra
features from the raw footages.
State Initialisation and Model Acquisition.
Use of the Sphere-tree algorithm to obtain
2 3&4
the best model for the measurement. Testing
of the tracker in its final form.
Consideration of PCA, HMMs and SVM for
3 1&2 segmentation of the motions extracted by the
tracker. Creation of a small motion classifier.
Extension of the classifier to include further
3 3
motion classes.
3 4 Thesis write up.
Table 6.1: Timing of Future Work
6.6 Pattern Recognition

Depending on the quality of the results obtained in the next year of study,
we propose investigation of ways to use the sequence of recovered states xk
extracted by the tracker in order to classify the subject’s motion for action
recognition. This field of study has seen very limited research, primarily
because of the lack of robust tracking systems. Many systems developed
so far use PCA or HMMs and SVMs, which were primarily considered for
other aspects of pattern recognition, such as object and speech recognition
etc. We will try to adapt and apply similar ideas in order to classify patterns
of motion observed by the tracker. The number of classes of motion to be
recognised will initially be small and the movements will be fairly distinct to
help the early stages of classification. Classes studied could include walking,
running, jumping etc.
6.7 Time Management

During the two remaining years of the PhD course, some of the ideas dis-
cussed above will be implemented and embedded in the current system.
Action Recognition will be investigated towards the end of the project, but
since this is an emerging field of study, we will try to push the agenda
upwards so that a longer period of focused study may be dedicated to it.
A plan is to try to have a robust tracker implemented by the end of
Year 2. Ideally by this time the tracker will make use of a more sophisti-
cated Extended Kalman Filter that uses covariance matrices R and Q, and
a process function f derived by the kinematic model. It will also utilise
algorithms that allow shadow detection leading to better measurements Dk ,
and be able to handle occlusions by considering more features from the raw
footages captured by the cameras. Finally the problem of state initialisation
will be addressed, and if time permits we will investigate model acquisition
using the sphere-tree algorithm.
Finishing off with the tracker at the end of Year 2 will leave adequate
time to study and investigate the recognition part of the system. If time
and results permit, more classes will be considered, according to the gesture
database as defined by Decision Stream Ltd. Table 6.1 outlines the timing
of the project’s progress through the remainder of this PhD course.
Appendix A
Convention
Vectors in 3-space are represented as 3 × 1 column vectors:
 
vx
v =  vy  ∈ R3×1
vz
Derivatives of scalars w.r.t. the state are are represented as 1 × S row

vectors: · ¸
∂α ∂α ∂α
α∈R⇒ = ... ∈ R1×S
∂X ∂x1 ∂xS
Derivatives of vectors w.r.t. the state are are represented as 3 × S ma-
trices:
 ∂vx 
∂X  ∂vx ∂vx 
· ¸   ∂x1 . . . ∂xS
∂v ∂v ∂v   
v ∈ R3×1 ⇒ = ... =
∂vy 
 =  ... ..
. ... 
∈R
3×S
∂X ∂x1 ∂xS ∂X
  ∂vz ∂vz
∂vz ∂x1 . . . ∂xS
∂X
Derivative of d¯2p (X)

Given a limb l from chain c with root and leaf joints pc,l−1 and pc,l respec-
tively, the smallest (squared) distance between the ray passing through pixel
p and the limb’s axis is given by (4.17):
60 Appendix
µ ¶2
(ûr × (pc,l − pc,l−1 )) · (c − pc,l−1 )
d¯2p = ,
|ûr × (pc,l − pc,l−1 )|
where ûr is the unit vector in the direction of the ray passing through
pixel p, and c is the position of the current camera’s centre of projection.
Equation (4.17) may be re-written as,
¡ T ¢2
¯2 n (c − pc,l−1 )
dp = , (A.1)
nT n
where the normal vector n is given by
¯ ¯
¯ i j k ¯
¯ ¯
¯ ux uy uz ¯
n = ûr × (pc,l − pc,l−1 ) = ¯ ¯
¯ (x) ¯
¯ pc,l − p(x) (y) (y) (z) (z)
c,l−1 pc,l − pc,l−1 pc,l − pc,l−1
¯
h i
(z) (z) (y) (y)
= i uy (pc,l − pc,l−1 ) − uz (pc,l − pc,l−1 )
h i
(z) (z) (x) (x)
−j ux (pc,l − pc,l−1 ) − uz (pc,l − pc,l−1 )
h i
(y) (y) (x) (x)
+k ux (pc,l − pc,l−1 ) − uy (pc,l − pc,l−1 ) .
That is,
 ³ ´ ³ ´ 
  (z) (z) (y) (y)
uy pc,l − pc,l−1 − uz pc,l − pc,l−1
nx  ³ ´ ³ ´ 
 (x) (x) (z) (z) 
n =  ny  =  uz pc,l − pc,l−1 − ux pc,l − pc,l−1  . (A.2)
 ³ ´ ³ ´ 
nz (y) (y) (x) (x)
ux pc,l − pc,l−1 − uy pc,l − pc,l−1
By differentiating (A.1) we get,
¡ ¢³ ∂n ∂pc,l−1
´
∂d21 2 nT (c − pc,l−1 ) (c − pc,l−1 )T ∂X − nT ∂X
=
∂X nT n
∂n
¡ T ¢2
2nT ∂X n (c − pc,l−1 )
− (A.3)
(nT n)2
∂n
where ∂X is found by differentiating (A.2):
61
 µ (z) (z)
¶ µ (y) (y)
¶ 
∂pc,l ∂pc,l−1 ∂pc,l ∂pc,l−1
 ∂nx   uy ∂X− ∂X − uz ∂X − ∂X 
 µ (x) ¶ µ ¶ 
∂n ∂X  (x) (z) (z) 
= ∂ny  =  uz ∂pc,l − ∂pc,l−1
− ux
∂pc,l
− ∂X
∂pc,l−1
 .
∂X ∂X  ∂X ∂X ∂X 
∂nz  µ (y) ¶ µ ¶ 
∂X  ∂pc,l
(y)
∂pc,l−1
(x)
∂pc,l
(x)
∂pc,l−1 
ux ∂X − ∂X − uy ∂X − ∂X
∂
¡ 2¢
We see that ∂X dp = f (pc,l−1 , pc,l ), i.e. that the derivative of the
distance measure is a function of the derivatives of each limb’s two end-
points. Derivatives for these end-points are calculated as per Sect. 4.2 of
Chapt. 4.
62 Appendix
Bibliography
[1] G. Bradshaw and C. O’Sullivan, “Sphere-tree construction using dynamic me-

dial axis approximation,” in Proceedings of the ACM SIGGRAPH Symposium
on Computer Animation SCA, 2002, pp. pp 33–40.
[2] I. Mikić, M. Trivedi, E. Hunter, and P. Cosman, “Human body model acqui-
sition and tracking using voxel data,” in International Journal of Computer
Vision, July/August 2003, vol. 53.
[3] R. Cipolla and A. Pentland, Eds., Computer Vision for Human-Machine In-
teraction, Cambridge University Press, 1998.
[4] R. Polana and R. Nelson, “Low level recognition of human motion,” in Proc.
IEEE International Workshop on Motion of Non-Rigid and Articulated Ob-
jects, Austin, TX, USA, 1994.
[5] Y. Guo, G. Xu, and S. Tsuji, “Understanding human motion patterns,” in
Proc. International Conference on Pattern Recognition, 1994.
[6] N. I. Badler, C. B. Phillips, and B. L. Webber, Simulating Humans, Ox-
ford University Press, Oxford, UK, 1993, Computer Graphics Animation and
Control.
[7] B. Stenger, P. R. S. Mendoça, and R. Cipolla, “Model-based 3d tracking of
an articulated hand,” in Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, Kauai, USA, December 2001, vol. 2, pp. 310–
315.
[8] D. M. Gavrila and L. S. Davis, “3-d model-based tracking of humans in action:
a multi-view approach,” in Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, San Francisco, U.S.A., 1996, pp. 73–80.
[9] Q. Delamarre and O. Faugeras, “3d articulated models and multi-view tracking
with physical forces,” in Journal on Computer Vision and Image Understand-
ing, 2001, vol. 81, pp. 328–357.
[10] C. Bregler and J. Malik, “Tracking people with twists and exponential maps,”
in Proc. IEEE International Conference on Computer Vision and Pattern
Recognition, 1998, pp. 8–15.
64 BIBLIOGRAPHY
[11] A. Ude, “Robust estimation of human body kinematics from video,” in Proc.
IEEE/RSJ Conf. Intelligent Robots and Systems, Kyongju, Korea, October
1999, pp. 1489–1494.
[12] A. Ude, “Robust human motion estimation using detailed shape and texture
models,” in Journal of the Robotics Society of Japan, July 2001, vol. 19, pp.
15–18.
[13] R. Urtasun and P. Fua, “3d tracking for gait characterization and recogni-
tion,” in Proc. IEEE International Conference on Automatic Face and Gesture
Recognition, Seoul, Korea, May 2004.
[14] I. Mikić, M. Trivedi E. Hunter, , and P. Cosman, “Articulated body pos-
ture estimation from multi-camera voxel data,” in Proc. IEEE International
Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, De-
cember 2001.
[15] R. Plankers and P. Fua, “Articulated soft objects for video-based body mod-
eling,” in Proc. International Conference on Computer Vision, Vancouver,
Canada, July 2001, pp. 394–401.
[16] M. Isard and A. Blake, “Condensation - conditional density propagation for
visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp.
5–28, 1998.
[17] A. Doucet, N. G. de Freitas, and N. J. Gordon, Eds., Sequential Monte Carlo
Methods in Practice, Springer-Verlag, New York, USA, 2001, Statistics for
Engineering and Information Science Series.
[18] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by
annealed particle filtering,” in Proc. IEEE International Computer Vision
and Pattern Recognition (CVPR), 2000.
[19] J. Deutscher, A. J. Davison, and I. D. Reid, “Automatic partitioning of high di-
mensional search spaces associated with articulated body motion capture,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition, Kauai,
Hawaii, December 2001.
[20] I. A. Kakadiaris and D. Metaxas, “Model-based estimation of 3d human motion
with occlusion based on active multi-viewpoint selection,” in Proc. IEEE
Conference on Computer Vision and Pattern Recognition, San Francisco, CA,
USA, June 1996, pp. 81–87.
[21] D. M. Gavrila, “The visual analysis of human movement: A survey,” in
International Journal of Computer Vision and Image Understanding, January
1999, vol. 73, pp. 82–98.
[22] T. B. Moeslund and E. Granum, “A survey of computer vision-based human
motion capture,” in International Journal of Computer Vision and Image
Understanding, March 2001, vol. 81, pp. 231–268.
BIBLIOGRAPHY 65
[23] D. Hogg, “Model-based vision: a program to see a walking person,” in Image

and Vision Computing, 1983, vol. 1, pp. 5–20.
[24] J. M. Rehg and T. Kanade, “Visual tracking of high dof articulated structures:
An application to human hand tracking,” in Proc. European Conference on
Computer Vision, May 1994, vol. 2, pp. 35–46.
[25] J. M. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated
objects,” in Proc. International Conference on Computer Vision, Cambridge,
MA, USA, June 1995, pp. 612–617.
[26] I. A. Kakadiaris and D. Metaxas, “Model-based estimation of 3d human mo-
tion,” in Proc. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, December 2000, vol. 22, pp. 1453–1459.
[27] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association, Aca-
demic Press, Boston, USA, 1988, Number 179 in Mathematics in science and
engineering.
[28] A. H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press,
New York, USA, 1970.
[29] G. Welch and G. Bishop, An Introduction to the Kalman Filter, ACM Press,
Addison-Wesley, Los Angeles, CA, USA, August 2001, SIGGRAPH 2001
course pack edition.
[30] Q. Delamarre and O. Faugeras, “Finding pose of hand in video images: a
stereo-based approach,” in Proc. IEEE International Conference on Automatic
Face and Gesture Recognition, Nara, Japan, April 14-16 1998.
[31] Quentin Delamarre and Olivier D. Faugeras, “3d articulated models and multi-
view tracking with silhouettes,” in Proc. IEEE International Conference on
Computer Vision, Kerkyra, Greece, 1999, pp. 716–721.
[32] R. Urtasun and P. Fua, “3d human body tracking using deterministic temporal
motion models,” in Proc. European Conference on Computer Vision (ECCV),
May 2004, vol. III, pp. 92–106.
[33] R. Plankers and P. Fua, “Tracking and modeling people in video sequences,” in
International Journal of Computer Vision and Image Understanding, March
2001, vol. 81, pp. 285–302.
[34] B. Stenger, P. R. S. Mendoça, and R. Cipolla, “Model-based hand tracking
using an unscented kalman filter,” in Proc. British Machine Vision Conference,
Manchester, UK, September 2001, vol. 1, pp. 63–72.
[35] R. Cipolla and P. J. Giblin, Visual Motion of Curves and Surfaces, Cambridge
University Press, Cambridge, UK, 1999.
[36] S. J. Julier, J. K. Uhlmann, and H. F. Durrant-Whyte., “A new approach for
filtering nonlinear systems,” in Proc. American Control Conference, Seattle,
USA, June 1995, p. 16281632.
66 BIBLIOGRAPHY
[37] T. Horprasert, D. Harwood, and L.S. Davis, “A statistical approach for

real-time robust background subtraction and shadow detection,” in Proc.
IEEE International Conference on Computer Vision FRAME-RATE Work-
shop, Kerkyra, Greece, September 1999.
[38] R. Szeliski, “Rapid octree construction from image sequences,” in Computer
Vision Graphical Models and Image Processing: Image Understanding, July
1993, vol. 58, pp. 23–32.
[39] R. M. Murray, Z. Li, and S. S. Sastry, A Mathematical Introduction to Robotic
Manipulation, CRC Press, 1994.
[40] T.W. Drummond and R. Cipolla, “Real-time tracking of highly articulated
structures in the presence of noisy measurements,” in Proc. IEEE International
Conference on Computer Vision, Vancouver, Canada, July 2001.
[41] Joan Lasenby Maurice Ringer, “Modelling and tracking articulated motion
from multiple camera views,” in Proc. British Machine Vision Conference,
Bristol, UK, 2000.
[42] Maurice Ringer and Joan Lasenby, “Multiple hypothesis tracking for automatic
optical motion capture,” in Proc. European Conference on Computer Vision,
2002, vol. 1, pp. 524–536.
[43] K. Choo and D. J. Fleet, “People tracking using hybrid monte carlo filtering,”
in Proc. International Conference on Computer Vision (ICCV), 2001.
[44] “Phasespace optical motion capture systems,” Online Material,
http://www.phasespace.com.
[45] “Vicon motion systems ltd,” Online Material, http://www.vicon.com.
[46] S.L. Altmann, Rotations, Quaternions, and Double Groups, Clarendon Press,
Oxford, UK, 1986.
[47] T. Y. Zhao, “Articulated models for motion tracking,” MEng Thesis, Cam-
bridge University Engineering Department, June 2001.

Markerless Motion Capture: Silhouette Matching Using a Kinematic Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markerless Motion Capture: Silhouette Matching Using a Kinematic Model

Uploaded by

Copyright:

Available Formats

Markerless Motion Capture

Paris Kaimakis August 31, 2004

Signal Processing Laboratory

My sincerest gratitude goes to my supervisor, Dr Joan Lasenby, for all the

This work is dedicated to my parents Mary and Costas, and to my

6 Conclusions and Future Work 49

6.5 Model Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Silhouette matching on a single limb using a single simulated

5.2 Silhouette matching on a simulated dataset using 1 camera.

5.3 Variation of the negative log-likelihood w.r.t. number of it-

5.4 Silhouette matching on a real dataset using 2 cameras. 1st

5.5 Motion recovery after successful silhouette matching on a

5.6 Tracking with low resolution data. Top Row: Dk for k = 1,

6.1 Illustration of the model acquisition method using templates.

Motion Capture is the domain dedicated to measuring and quantifying the

Historically, the reason why motion capture was introduced as a research

cation of digital directing classifies the interesting parts of tens of hours of

Xk = f (Xk−1 , . . . , Xk−∞ , uk−1 , . . . , uk−∞ ) ,

where f is the process function, a non-linear function relating the previ-

wk ∼ N (0, Q) and Nk ∼ N (0, R) (2.3)

where Q and R are the process and measurement covariance matrices

preserving its information content as much as possible. Many systems use

3.1 Markerless Systems

Among the first researchers to introduce an underlying 3D model to recover

Delamarre and Faugeras [30] use 3D reconstructed data obtained by cor-

as well as its position and orientation is an under-constrained problem if

3.2 Marker Technologies

subject moves to different states, some association hypotheses are scrapped

where λp is the intensity of the pth pixel.

Furthermore we assume that the intensity of the noise on each pixel in

and we therefore need to minimise the negative log-likelihood function,

To converge to the maximum likelihood estimator X̂M L that minimises

4.1 The Measurement Prediction Λ(X)

4.1.1 Model Definition

C the number of chains included in the model,

4.1.2 Model Adaptation to State

When the model assumes a state X, the state-dependent orientation of limb

∗ (X) is the unit vector representing the orientation of the limb l

or, substituting from (4.11),

4.1.3 The State

the resulting rotation matrix is given by,

4.1.4 The Rendering Function

Figure 4.3: Left: The smallest line-to-line distance dp takes place at a

(ûr × (pc,l − pc,l−1 )) · (c − pc,l−1 )

It was initially decided that the limbs should be rendered to be cylinders.

used to render the limb as a sphere tree.

dp = min{d¯p , dˆp , dˇp } (4.18)

The resulting measurement prediction for the case of a single cylinder is

Cylinder with Hemispherical Ends

Figure 4.7: Demonstration of the Sphere-Tree algorithm, an algorithm that

advantages of rendering limbs as sphere trees, we have investigated such

is given the lowest (darkest) available intensity:

λp (X) = max {λn,p (X)}Ns

Figure 4.8: Resulting measurement predictions for various X of one limb

4.2 The Jacobian JΛ (X)

where qξ,c,` is the ξ th quaternion coefficient of limb ` in chain c, and

4.3 The Measurement Dk

Bk ∼ N (Bµ , Bσ2 ) (4.24)

where Bµ = [µ1 . . . µp . . . µN U V ]T and Bσ = [σ1 . . . σp . . . σN U V ]T

After extracting Ek , the framesets are morphologically filtered in order

5.1 Simulated Data

5.1.1 Single Cylinder