You are on page 1of 8

Markerless Monocular Motion Capture Using Image Features and Physical

Constraints
Yisheng Chen∗ Jinho Lee† Rick Parent‡ Raghu Machiraju§
The Ohio State University MERL The Ohio State University The Ohio State University

A BSTRACT However, creating a single-camera, interactive, markerless system


to capture human motion is challenging [6]. On the other hand,
there has been a concerted effort by several researchers to develop
We present a technique to extract motion parameters of a human methods that will reconstruct motion recorded by either a limited
figure from a single video stream. Our goal is to prototype motion number of cameras or just a single camera.
synthesis rapidly for game design and animation applications. For
example, our approach is especially useful in situations where mo- Generally speaking, capturing human motion from video involves
tion capture systems are restricted in their usefulness given the var- extracting features from video sequences and matching those fea-
ious required instrumentation. Similarly, our approach can be used tures to some model or representation of motion to be reconstructed.
to synthesize motion from archival footage. By extracting the sil- Data-driven approaches exploit a motion database for reconstruct-
houette of the foreground figure and using a model-based approach, ing target motion (e.g., [11]). Such strategies can provide real-time
the problem is re-formulated as a local, optimized search of the pose albeit limited reconstruction of motion. Motion outside the confines
space. The pose space consists of 6 rigid body transformation para- of the database is constructed with lesser fidelity. Other approaches
meters plus the internal joint angles of the figure. The silhouette of key on a specific type of motion, such as cyclic, sagittally symmet-
the figure from the captured video is compared against the silhou- ric motion, and look for specific indicators of motion phases (e.g.,
ette of a synthetic figure using a pixel-by-pixel, distance-based cost [2]). In a more general vein, motion-templates are used to recognize
function to evaluate goodness-of-fit. For for a single video stream, a variety of motions (e.g., [1]). More general are systems that use
this is not without problems. Occlusion and ambiguities arising a 3D model of the human figure to recreate a figure’s pose for each
from the use of a single view often cause spurious reconstruction of frame. Rehg and his associates [5], and Sminchisescu [19, 21, 20]
the captured motion. By using temporal coherence, physical con- have used both 3D models of humans to emulate and synthesize
straints, and knowledge of the anatomy, a viable pose sequence can motion as depicted in a single video stream. Our approach also
be reconstructed for many live-action sequences. falls into this last category. We now elaborate on our model-based
approach.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism-Animation— [I.4.9]: Image Processing and
Computer Vision—Applications 1.1 3D Model-based Approach

Keywords: computer animation, model-based reconstruction Model-based approaches use a 3D articulated model of the human
body to estimate the pose and shape such that the model’s 2D pro-
jection closely fits the captured person in the image. Features such
as intensities, edges, silhouettes are widely used. Processing color
1 I NTRODUCTION and texture information is computationally expensive and changes
in illumination present significant problems. It should be noted
Motion Capture (mocap) has become a mainstay of many computer much of the scene and camera information is unknown. As a re-
animated productions and is often used to capture human motion. sult, many of the efforts to track human figures in video, including
Commercial mocap systems capture motion by fixing on the target ours, use silhouettes. Silhouettes are less sensitive to noise than
figure instrumentation such as optical, magnetic, or mechanical sen- edges but fine details might be lost in its extraction.
sors, passive optical reflectors or active optical emitters. However, The use of silhouettes does create problems. Extracting pose from
various constraints imposed by the instrumentation tend to limit the a silhouette can produce multiple candidate solutions. The high
usefulness of mocap either by restricting the physical space of the dimensionality of the articulated model parameter space requires
movement, restricting the environment in which motion can be cap- efficient yet robust search algorithms. A suitable initialization of
tured, or restricting the movement itself (not to mention restrictions model parameters is also required by many algorithms for motion
due to cost). tracking. Constraints like joint angle limits on parameters can be
While motion libraries and motion retargetting techniques extend used as well as knowledge about physics and the anatomy.
the usefulness of mocap, there is a need to develop algorithms and By extracting the silhouette of the foreground figure and using a
methods for commodity sensors. For example, there is often a need model-based approach, the problem is re-formulated as a local, op-
to synthesize the motion of a figure in a video clip obtained from the timized search of the pose space. The pose space, in turn, consists
Web, or from a surveillance camera. Conversely, one would like to of 6 rigid body transformation parameters plus the internal joint
capture human motion with the least amount of equipment possible angles of the figure.
- using a single consumer-grade camera. Additionally, we wish to
produce results at rates that would make such a system useful in Occlusion and ambiguities arising from the use of a single view of-
an interactive environment and therefore suitable for prototyping ten cause spurious reconstruction of the captured motion. There-
animated sequences and exploring initial character motions. fore, the reconstruction is successful when either 2D correspon-
dence is established for each frame of the sequence manually [5],
∗ e-mail:chenyis@cse.ohio-state.edu or when the motion under scrutiny is limited [19, 21]. There is a
† e-mail:leejh@merl.com paucity of methods that allow for general motion and require little
‡ e-mail:parent@cse.ohio-state.edu manual intervention. Additionally, these methods should be effi-
§ e-mail:raghu@cse.ohio-state.edu cient.
We describe herein a method to reconstruct arbitrary motion se- are given in Section 5. Section 6 includes results which demon-
quences that is model-based, operates on images, and exploits strate the effectiveness of our technique. Concluding remarks and
knowledge about the anatomy. Consequently, our method is simple, pointers to future work are described in Section 7.
efficient and requires limited manual intervention. Will our method
reconstruct all motion sequences successfully? The answer is no.
Occlusion of limbs by larger parts of the anatomy cannot always be
resolved through the use of image silhouettes. On the other hand, 2 P REVIOUS W ORK
we wish to explore the limits of efficient monocular motion recon-
struction. Our results show that we can reconstruct increasingly
complex motion when we include a larger set of anatomical fea- We describe work that is closely related to our own . Model based
tures and constraints and employ robust image comparison metrics. approaches use a 3D articulated model of the human body to esti-
We now provide an overview of our approach. mate the pose and shape such that the model’s 2D projection closely
fits the captured person in the image. Bregler and Malik [2] propose
a framework to estimate 3D poses of each body segment in a kine-
1.2 Overview of Our Approach matic chain using twist representation of a general rigid-body mo-
tion. However, their reconstruction assumes lateral symmetry in the
tracked motion. Pavlovic et al [16] present a system to track fronto-
In this paper, we explore inverse methods which reconstruct motion parallel motion using a dynamic Bayesian network approach. Their
from a single camera. We examine techniques which use silhouettes work focuses on the dynamics of human behavior as described by
and edges rather than texture. Since motion is our primary interest bio-mechanical models.
and not identification, silhouettes and edges provide ample grounds
for developing robust methods. Inverse design methods like ours Deutscher et al [4] use a human model to build a framework of a
employ optimization algorithms to minimize an objective function. kinematic chain using limbs of conical sections for computational
Resolving occlusion between various parts of the human body is simplicity and high-level interpretation of output. They use edges
certainly an ominous challenge given the difficulty of matching an and silhouettes for their cost function to estimate pose from multi-
imperfect, highly flexible, self-occluding model to cluttered image ple camera views. A condensation algorithm is employed to search
features. Viable human models have at least 20 joint parameters the high dimensional space without restrictions. Carranza et al [3]
subject to highly nonlinear physical constraints. Also, a significant use silhouettes from multi-view synchronized video footage to re-
number of the possible degrees of freedom afforded by the vari- construct the motion of a 3D human body model, and then re-render
ous joints are not uniquely determined by any given image. Thus, the model’s appearance and motion interactively from any novel
monocular motion capture is ill-posed and non-linear. Methods re- view point. Kakadiaris et al [10] use a spatio-temporal analysis to
ported in the literature are either complex and expensive for com- track upper body motions from multiple cameras.
puter graphics applications or impose severe restrictions on the type
Monocular markerless motion capture has been studied by a few re-
of motion that can be captured.
searchers. Sminchisescu and Triggs [20] achieve successful motion
The work described herein is an exploration of simple yet robust synthesis based on the propagation of a mixture of Gaussian den-
cost functions and incremental search strategies. Our focus on sim- sity functions, each representing probable 3D configurations of the
pler cost functions will eventually allow for the realization of near human body over time. Difranco et al [5] propose a batch frame-
real-time reconstruction often needed for computer graphics appli- work to reconstruct poses from a single view point based on 2D
cations while at the same time making few assumptions about the correspondences subject to a number of constraints. Their methods
motion being tracked. are shown to be successful when deployed on moderately difficult
sequences that include athletic and dance movements.
The starting point of our method is a model of a human actor with
multiple quadrics assembled at various joints. After an initial pose Alternative approaches to model based tracking of human bodies
is established either automatically or with the aid of the user, a have been also used. Bobick and Davis [1] construct temporal tem-
frame-to-frame tracking procedure ensues in which the solution of plates from a sequence of silhouette images and present a method
the last frame is the initial guess for the next frame. In addition to to match the temporal template against stored views of the known
silhouettes and edges, our objective function uses anatomical and actions. Wren et al [22] use 2D blobs for tracking motion in a video
physical constraints to aid in disambiguating the view. image. Leventon and Freeman [14] take a statistical approach and
used a set of motion examples to build a probability model for the
The main contributions of our work include: purpose of recovering 3D joint angles for a new input sequence.
• development of a new core-weighted XOR metric for model
localization in an image Lee et al [11] present a vision-based interface to control avatars in
virtual environments. They extract visual features from the input
• robust detection and tracking of body parts such as head, silhouette images and search for the best motion matching the vi-
arms, feet and the V between the legs in the image sual features obtained from rendered versions of actions in a motion
database. Ren et al [18] use the AdaBoost algorithm to select a few
• inclusion of image features and anatomical constraints to re- best local features from silhouette images to estimate yaw and body
duce the size of optimal search space configurations. Finally, a survey of computer vision-based human
Our new method is shown to be capable of reconstructing full ar- motion capture techniques is presented by Moeslund and Granum
ticulated body motion efficiently. Additionally, our results include [15].
video sequences of varying length and arbitrary illumination and
the reconstruction is quite faithful to the original sequence. We reconstruct general 3D motion from the silhouettes extracted
from a single view without relying on a motion database. As a
The paper is organized as follows. In Section 2 we describe perti- consequence, our work is closest to that of Sminchisescu. However,
nent previous work in motion reconstruction. Our human model is we demonstrate the tracking of motions that are more complex than
briefly discussed in Section 3. We describe in detail our method for those presented by Sminchisescu and we track at speeds that are
motion reconstruction in Section 4. Implementation considerations equal or faster than those reported.
3 O UR H UMAN M ODEL Once we derive a background model, we can subtract the back-
ground from all the images in the sequence. Often, we do not have
a perfect background model, and therefore the silhouette images
Our human model is a combination of spheres and cylinders with may suffer from the presence of noise. The silhouette quality can
anisotropic scale and rigid transformation for an object coordinate be improved using morphological operations, such as dilation and
system for each part. The model is shown in Figure 8 and Figure 9. size-based object filtering.
The parameters that describe our human model consist of two com-
ponents - shape and motion parameters. We define twelve shape
parameters which describe the scale factors to be multiplied to a 4.2 Core-Weighted XOR
predefined ‘standard’ size of each body part. Motion parameters
are composed of 6 global transformation and total 24 joint angles. Our goal is to find the motion parameter set β that minimizes the
Each joint has anywhere from one to three degrees-of-freedom. total penalty
Given N frames from a video sequence, our human body model
at a specific frame i is represented by M (i) = M(α , β (i) ), where (i)
E(β (i) ) = f (Sinput , Smodel (β (i) )) (1)
α = {α1 , α2 , . . . , αm } (m = 12) is the shape parameter vector and
(i) (i) (i) (i)
β (i) = {β1 , β2 , . . . , βn } (n = 30) is the motion parameter vector. for a suitable cost function f , where Sinput and Smodel (β (i) ) are the
ith input silhouette image and a silhouette image generated by M (i)
(i)
respectively. For the sake of clarity, we instead denote Sinput and
4 T HE O PTIMIZATION S TRATEGY
Smodel (β (i) ) simply as Sinput and Smodel respectively.

The use of a single view makes the reconstruction problem ill- How does one design a viable and robust cost function f as de-
posed, as stated earlier. We now describe our optimization strat- scribed in Eq.(1) ? The easiest way to measure the difference of
egy to fit the motion parameters to a single stream of silhouette two binary images is the number of ‘on’ pixels when pixel-wise
images. First, we describe how we extract silhouettes from a video XOR operation is applied to the two images [13]. In this case,
sequence. Then, we present an objective function based on the dif- H W
ference of area between model-generated silhouette and input sil- f (Sinput , Smodel ) = ∑ ∑ c(i, j) (2)
houette. Next, we discuss how we improve the performance of our i j
optimization algorithm. We achieve this by incorporating edge in-
0 if Sinput (i, j) = Smodel (i, j)

formation along with both physical and anatomical constraints into c(i, j) =
the objective function. Finally, we discuss a non-linear multidimen- 1 otherwise.
sional optimization algorithm we employ to minimize the proposed
objective function. where the double summation iterates over all pixel locations. If our
goal requires that f = 0, that is, if two silhouettes overlap exactly,
the optimal solution will be unique in terms of Smodel . However, if
our objective function f cannot be reduced to zero given inherent
4.1 Silhouette Extraction
characteristics of the problem, it is likely that there are multiple
optimal solutions. Any preference among those multiple optimal
The input to our motion synthesis system is a sequence of silhouette solutions should be incorporated into the cost function.
images that describe the gross motion of the human body. We avoid
using coloration or texture information in order to minimize the ef- Since limbs are features of particular importance in any articulated
fects of variable viewing conditions. In addition to using silhou- figure, we do not want to lose track of those features. Limbs in a
ettes we also employ high-contrast edges to further our optimiza- human body are often well characterized by their skeleton or medial
tion strategy as explained below. Given a video footage, there exist axis derived from the silhouette image. Therefore, whenever ambi-
several methods to obtain silhouette images. Although silhouettes guity occurs it would be better to choose the direction in parametric
are less sensitive to noise than edges, fine details of body structure shape space such that model-generated silhouette covers the core
and motion are often lost in their extraction. We employed the fol- area of the input silhouette. The core area includes the silhouette
lowing semi-automatic methods to extract the foreground human pixels close to the skeleton. This requirement can be incorporated
figures from the background. in the cost function by imposing higher penalty if Smodel (β (i) ) does
not overlap any region near the core area of the input silhouette
First, we identify frames wherein the human figure is absent and (i)
Sinput .
construct a statistical model of the background. The mean and stan-
dard deviation at all pixels over all frames (and for every color RGB Our new cost function replaces c(i, j) in Eq.(2) with
channel) is computed. If a pixel differs too much from the back-
ground on any color channel, the pixel is treated as a foreground 0 if Sinput (i, j) = Smodel (i, j)

pixel and thus the silhouette is recovered in a pixel-wise manner. In c(i, j) = (3)
d(i, j) otherwise
case, there exists no image with just the background, we identify
frames that describe motion of figures moving through the scene d(i, j) = D(Sinput )(i, j) + wD(S̃input )(i, j),
in a predictable fashion. We then compute a median image over where D(S) is the Euclidean distance transform of binary image S
the whole sequence followed by building a weighted-mean/variance and S̃ is the inverse image of S and w is a weighting factor that
background model to extract the silhouette out, if over time there controls the importance of coverage of core area relative to the mis-
will more non-motion at a pixel than there will be motion [9]. Oth- match in the region outside of silhouette area.
erwise, we select several frames with little overlap of the human
figures, extract the figures manually, and then combine the result- Note that image d represents a distance map from silhouette contour
ing images to obtain a composite background image. If there exists and can be computed once in a preprocessing step. Figure 1 depicts
background areas that cannot be recovered, they are considered as the coremap image d with w = 5.0. We call Eq. 2 along with Eq. 3
foreground anyway. core-weighted XOR.
Arms and the V shape are treated the same as the edge term,
which means we try to get a maximum matching between detected
arms/legs and those of model’s.

−wal if Rinput (i, j) = Smodel (i, j) = 1



al(i, j) = (6)
0 otherwise

where R is the image with arms/legs detected. This term is summed


over all pixel positions along with the core-weighted XOR func-
tion. We use those two terms to synthesize motions in Figure 8 and
Figure 1: (left) An input silhouette image. (right) The coremap Figure 9.
image used to compute the core-weighted XOR
It should be noted that sometimes there is no proper candidate (both
a hand or a foot), and in those cases we do not require that the
hand/foot constraints be satisfied for particular frames. Moreover,
4.3 Using Image Features and Physical Constraints
the position of the head can be detected along the contour using the
horizontal projection of the silhouette [7] and temporal coherence.
The core-weighted XOR objective function is sufficient in many The head always lies at a local highest position along the silhou-
cases. However, there are also cases in which it falls short or ette outline and near the maximum of horizontal projection of the
our performance expectations. Simply adjusting the core-weighted histogram. Similarly, a head constraint term can be used in the ob-
XOR parameters is insufficient to increase the generality of this jective function.
method. In particular, the problems encountered include: ap-
pendages are sometimes not extended to completely fill the silhou-
ette, feet are able to penetrate the ground plane, and joints are al-
lowed to bend backward as well as forward. These sub-optimal
configurations are found because they represent local minima of the
core-weighted XOR function. To remove these spurious local min-
ima and improve our performance, additional terms are included in
the basic objective function.
In order to stretch the limbs to cover silhouette edges, we augment
our objective function with a term that emphasizes matching edges,
Eq.(4).

if Tinput (i, j) = Smodel (i, j) = 1



−we
e(i, j) = (4)
0 otherwise
Figure 2: Three examples for hands/feet, arms, legs and head detec-
tions along silhouette outlines, where head is marked as red, body
where T is the image after edges have been extracted from the fore- tips are white, arms in green and legs in red.
ground image using a contrast threshold. This term is summed over
all pixel positions along with the core-weighted XOR function.
To disallow configurations that include ground plane penetration,
In addition to the matching edges term, semantic features are used we define a constraint on the limbs of the figure that essentially
for the limbs to fully cover the silhouette. We detect the limbs evaluates to infinity whenever penetration occurs. We also define
(hands and feet) along the contours of the foreground human body constraints for each joint angle such that C lj ≤ β j ≤ Chj , j = 7..30
in some frames [12]. Along the contour of the silhouette, we cal- and penalize configurations that violate the constraints. These con-
culate the curvature of each point on the contour. The curvature at figuration penalties are added to the objective function. These mod-
those points is compared against a specified threshold, and points of ifications help guide the silhouette-model matching process to a
high positive curvature are extracted. Using an estimate of the po- physically meaningful parameter set with less ambiguity.
sition of hands from several of previous frames, unlikely hand/foot
positions are further eliminated as determined by the human body Despite the application of physical constraints, temporal coherence
structure. and knowledge of the anatomy, the 3D reconstruction is still under-
constrained due to self-occlusions and ambiguities.
Moreover, the outlines of the arms and the V shape between two
legs can be detected based on those concave points along edge con- Instead of exploring complex yet not reliable solutions like texture
tours; the neck and the crotch are usually the concave points along detection, we allow users to specify the keyframes interactively
the edge contours, and arms begin from the neck and the V shape to obtain viable results. Similar solutions to this problem are re-
between the legs is formed by the feet and the crotch. However, ported in [5]. Every keyframe has an impact range, and for any
since we are only using a single camera, it is hard to distinguish frame inside this range, instead of using the configuration of the
between the left and right limbs. Therefore, we do not require the last frame as starting body pose, one uses the interpolation of the
exact matching of the limbs from the model and the silhouettes. The keyframe and the previous frame to start the optimization. Enough
hand/foot constraint term only focuses on reducing the distance be- keyframes can always guarantee a good reconstruction, but speci-
tween the model’s hands/feet and the nearest detected ones. fying keyframes is tedious. Our suggestion is the following. When
body features like feet and hands are not discernable for several
q consecutive frames during the pre-processing stage, it is because
l = −wl (im − id )2 + ( jm − jd )2 (5) occlusions or ambiguities occur. Users should add two keyframes,
one before the occlusions happen and one after the occlusions dis-
where (im , jm ) is the limb position on the articulated model, (id , jd ) appear. Users do not necessarily specify all the 30 parameters; they
is the detected foot/hand position, and wl is the weight factor. specify a subset of the parameters, usually those describing position
of the limbs, or the twisted torso. Figure 3 shows a sequence that Then the torso is appropriately rotated to minimize the error. The
requires keyframing to be reconstructed correctly. arm and leg positions are computed last, one by one.
To avoid accumulative errors, another simple technique that is often
beneficial is to restart the optimization routine from the solution
found in the current optimization stage. In the following section,
we explore how these strategies are exploited towards finding the
correct motion configuration from silhouette image sequences.

5 I MPLEMENTATION

Our reconstruction method depends on an optimization process ex-


ecuted in a motion parameter space. Toward this purpose we first
perform a shape fitting exercise for the very first frame so that our
model fully explains the silhouette area from a specific view. The
Figure 3: Keyframes are necessarily to reconstruct this sequence con- resulting shape parameters are then fixed for the remainder of the
taining ambiguities and self-occlusions. Figure 10 shows an example motion tracking process. It should be noted that we use the same
where this keyframe technique is employed. We specify the left and objective function for initial alignment, shape fitting and motion
right pose as the keyframe poses. tracking.

5.1 Automatic Initialization


4.4 Biased Downhill Simplex Method
As mentioned before, the only input we exploit to reconstruct the
Extracting pose from a silhouette can produce multiple candidate
3D motion parameters are silhouette images extracted from a video
solutions. The high dimensionality of the articulated model para-
sequence. We use the area-based metric to compute the likelihood
meter space requires efficient local and global search algorithms.
of our model for a given input silhouette image. Therefore, the
We chose local algorithms given their lower cost. A suitable initial-
projected shape of a 3D model needs to be as close as possible to
ization of model parameters is also required by many algorithms for
the input silhouette. A suitable frame with no self-occlusions is
motion tracking. Constraints like joint angle limits on parameters
used to adapt the shape of our model to the actor being observed.
can be used. The movement of the subject can be restricted at the
The optimization procedure is employed for this frame in order to
cost of losing generality. For example, movement symmetric (but
fix the model’s shape parameters and initial the pose of the model.
out of phase) to the central sagittal plane is often assumed.
Lacking a suitable frame, the shape parameters and initial pose of
The object function disallows the computation of analytic deriva- the model can be initially set by hand.
tives in terms of motion parameters. Downhill simplex method [17]
The optimization parameters we determine in the initialization
serves very well to minimize the proposed cost function since it re-
stage is a subset of the parameter set α ∪ β . Let a translation vec-
quires function evaluation for specific multi-dimensional optimiza- (1) (1) (1) (1) (1) (1) (1)
tion parameters. tor t = {β1 , β2 , β3 } and let s = {βlarm , βrarm , βlleg , βrleg } be a
subset of motion parameters that define the rotation angles around
The simplex method can be easily adapted to our multi-dimensional z-axis of upper arms and upper legs, respectively We divide the ini-
human body model. The initial simplex of n dimensions consists of tialization task into two consecutive optimization steps:
n + 1 vertices. Though we use this method for alignment and shape
fitting, we explain it’s workings by only describing reconstruction 1. Alignment: find t and ŝ such that Eq. 1 is minimized.
(i−1) (i−1)
of the motion. Let the coefficients β (i) = {β1 , · · · , βn } (the 2. Shape adjustment: find α and s so that Eq. 1 is minimized
solution of the previous frame) be one of the initial points p0 of the given the optimal {t, ŝ} found at step 1.
simplex. We can choose the other remaining n points to be
The middle and right image in Figure 4 depict the result for the two
pi = p 0 + µ i e i , i = 1..n, steps described above. Note that our core-weighted XOR metric is
helpful to align the shape of the initial model at the center of the
where ei ’s are n unit vectors and µi are defined by the character- silhouette image. If the initial model was not aligned at the center
istic length scale of each component of β . The movement of this of the silhouette, the subsequent shape fitting process would have
n-dimensional simplex is confined to be within the motion space failed.
close enough to the configuration of the current frame and there is
no need to perform exhaustive searches beyond certain ranges of
movement between two consecutive frames. To further target the 5.2 Motion Tracking
most relevant parameter space to search, parameter velocity is used
to bias the simplex location. This bias and the size of the simplex After the initialization described in the previous section, actual mo-
are determined by limits on parametric acceleration that arise from (i)
principles of physics and basic anatomy. tion tracking is performed by searching a motion configuration β min
such that it minimizes the objective function, as described in Sec-
Due to the inherent hierarchical properties of the human model, a (i)
tion 4, for each frame i. The optimal parameter vector, β min found
hierarchical optimization algorithm is employed. The complete 30 at frame i is used as the initial guess for the optimization routine of
configuration parameters are divided into 6 sub-groups, as dictated frame i + 1.
by anatomical considerations. Starting from the configuration of the
pervious frame, the model’s global translation and rotation trans- To reconstruct only physically meaningful joint angles, we assign
formations are first computed using the downhill simplex method. a constraint for each joint angle as well as the degree of freedom
Figure 4: Automatic initialization step: (left) Initial status. (middle)
After alignment. (right) After shape fitting and ready for motion
fitting. Figure 5: Effectiveness of simple repetition of the same optimization
process.

(DOF) of the joint associated with a body part. A Euler angle is


used to describe the rotation of a joint. For example, an elbow joint
has one DOF with range of (0◦ , 160◦ ). The DOFs are included in
the choice of β ; however, the angle constraints need to be incorpo-
rated during optimization. Considering all terms mentioned in 4.3,
we change the cost function described in Eq. 2 to
 H W
 ∑i ∑ j (c(i, j) + e(i, j) + l + al(i, j))
f (β ) = if Ckl ≤ βk ≤ Ckh for k = 7...30 (7)
∞ otherwise.

Figure 6: First: input image; Second: a correct result (with
cost value 1231); Third and forth image shows the 2 incorrect
where c, e, l and al are given by Eq. 3 Eq. 4, Eq. 5 and Eq. 6. Ckl
results (with cost value 2167 and 2130 respectively)
and Ckh are the lower and upper bound of a joint angle βk respec-
tively. Assigning constraints for each joint angle is very beneficial
in removing the ambiguity when deciding the next step in searching
get better tracking results. In addition, given our motion data, user
downhill direction.
can modify the motion with any 3D character animation tool, like
R
In using the downhill simplex method restarting the minimization Poser from Curious Lab.
routine always helps to escape the shallow local minima that are
proximal to deeper local minima. Figure 5 shows how effective this
simple repetition is. The example sequence consists of four main 6 R ESULTS
actions of a total of 250 frames. The frame range (0, 100) depicts
the bending of the left arm, range (100, 150) shows the bending of
the right arm, range (150, 200) shows the bending of the left leg, We illustrate our method using several full body tracking se-
and finally the frame range (200, 250) depicts the same actor bend- quences. All the video clips are sampled at the rate of 30Hz, and
ing his right leg. The results using no repetition and 2-times rep- some of them are from CMU Graphics Lab Motion Capture Data-
etition show relatively high cost values. Incorrect results are also base [8]. Despite that we are using only one camera and do not em-
obtained. Two such incorrect results from the two tracking exper- ploy any marker, we still obtain desirable results. Figure 8, Figure 9
iment at a specific frame (224) are shown in Figure 6. Tracking and Figure 10 are examples of tracking artists and athletes. The first
with 3-times repetition showed visually correct results for all four dance sequence (Figure 8) is relatively straightforward to track be-
action sequences. In this case, the area with high cost in Figure 6 is cause there are no occlusions and motion is mainly planar. Our
mainly caused by the deformation of the shape of the subject. We tracker manages to track the fronto parallel motions. The second
note this by displaying the two frames with corresponding peaks in badminton sequence (Figure 9) is more complex because of multi-
the graph. The sources of error are marked illustratively in Figure 7. ple occlusions occurring during the action. However we can recon-
The region A, D, and E is from the deformation of the clothes. The struct the motion when we assume that both legs move with con-
region B is caused by the expanded silhouette due to the motion stant velocity. In addition, we capture the in-depth (out-of-plane)
blur effect and C is a region with the inherent difference between motions of the player successfully. The previous sequences could
model and real silhouette area. all be tracked without user intervention. The last sequence (Fig-
ure 10) contains many more challenges given the complex torso ro-
tations and arm swings. After introducing keyframing, our tracker
5.3 Motion Refinement estimates the uncertain positions of arms and torso, and recov-
ers the motions of the arms even though they are totally occluded
in the silhouettes. For this sequence, six keyframes were used.
The final tracking result cannot be perfect because: the shape of More reconstruction results can be found at http://www.cse.ohio-
the articulated model does not match the actor in the video exactly; state.edu/˜chenyis/research/motion/index.htm.
there are occlusions and ambiguities; there are many more DOFs
for each articulated joint. However, our system generates reason- The time expended for analysis is about 3 sec. per frame for sim-
ably good motion data and allows the users to refine the generated ple sequences and about 5 sec. per frame for difficult ones, when
motion in the post-processing stage. The user can also add new executed on a Pentium IV PC with a 2 GHz CPU and 512MB of
constraints to reduce the search space for particular parameters to memory. It should be noted that [3] reported times of 7 to 12 sec per
R EFERENCES

[1] A.F. Bobick and J.W. Davis. The Representation and Recognition of
Action using Temporal Templates. In IEEE Transactions on Pattern
Analysis and Machine Intelligence, volume 23, No.3, pages 257–267,
2001.
[2] C. Bregler and J. Malik. Tracking People with Twists and Exponential
Maps. In Proceedings of IEEE CVPR, pages 8–15, 1998.
[3] J. Carranza, C. Theobalt, M. Magnor, and H-P Seidel. Free-Viewpoint
Video of Human Actors. In Proceedings of SIGGRAPH 2003, volume
22, No.3, pages 569–577, 2003.
[4] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Cap-
ture by Annealed Particle Filtering. In Proceedings of IEEE CVPR,
volume 2, pages 126–133, 2000.
Figure 7: Analyzing the source of error appeared at two peaks in the
[5] D.E. DiFranco, T. Cham, and J.M. Rehg. Reconstruction of 3-D Fig-
third plot of Figure 6: A,D and E are from deformation of clothes,
ure Motion from 2-D Correspondences. In Proc. Conf. Computer Vi-
B is from motion blur effect, and C is from the inherent difference
sion and Pattern Recognition, pages 307–314, 2001.
between model and real image.
[6] M. Gleicher and N. Ferrier. Evaluating Video-Based Motion Capture.
In Proceedings of Computer Animation, pages 75–80, 2002.
[7] I. Haritaoglu, D. Harwood, and L. Davis. Ghost: A Human Body Part
frame for similar operations. We use w = 1.0 from Eq. 3, we = 20 Labeling System Using Silhouettes. In International Conference on
from Eq. 4, wl = 20 from Eq. 5 and wal = 20 from Eq. 6 for the Pattern Recognition, volume 1, pages 77–82, 1998.
cost function for all motion synthesis. [8] J.K. Hodgins. Carnegie Mellon University Graphics Lab Motion Cap-
ture Database, http://mocap.cs.cmu.edu/.
[9] T. Horprasert, D. Harwood, and L.S. Davis. A statistical approach
for real-time robust background subtraction and shadow detection. In
7 C ONCLUSION AND F UTURE W ORK IEEE Frame-Rate Workshop, 1999.
[10] I. Kakadiaris and D. Metaxas. Model-Based Estimation of 3D Hu-
man Motion. In IEEE Transactions on Pattern Analysis and Machine
We presented a simple and robust technique to reconstruct 3D mo- Intelligence, volume 22, December 2000.
tion parameters of a human model using image silhouettes. A novel [11] J. Lee, J. Chai, P.S.A. Reitsma, J.K. Hodgins, and N.S. Pollard. In-
teractive Control of Avatars Animated with Human Motion Data. In
cost metric called core-weighted XOR was introduced and consis-
Proceedings of SIGGRAPH 2002, Computer Graphics Proceedings,
tently used for the automatic alignment, shape fitting and motion
Annual Conference Series, pages 491–500. ACM, ACM Press / ACM
tracking. The computation time of the cost function is directly re- SIGGRAPH, 2002.
lated to the overall performance of our method. Currently our im- [12] M.W. Lee, I. Cohen, and S.K. Jung. Particle Filter with Analytical
plementation does not exploit any hardware acceleration. In the fu- Inference for Human Body Tracking. In IEEE Workshop on Motion
ture, we intend to accelerate the weighted-XOR computation using and Video Computing, 2002.
features of a modern graphics hardware. Our work is closest to that [13] H.P.A. Lensch, W. Heidrich, and H. Seidel. Automated Texture Regis-
of Sminchisescu [20] and that of DiFranco [5]. We compare favor- tration and Stitching for Real World Models. In Proceedings of Pacific
ably in terms of generality to the results they report. By detecting Graphics 2000, 2000.
more human features like head, arms, hands, legs and feet, we can [14] M. Leventon and W. Freeman. Bayesian estimation of 3-d human
further improve the correctness of registering our model with the motion from an image sequence. In Technical Report TR-98-06, Mit-
all the images and recovering the 3d poses. subishi Electric Research Laboratory, Cambridge, MA, 1998.
[15] T.B. Moeslund and E. Granum. A Survey of Computer Vision-Based
One of the biggest challenges in 3D motion reconstruction from Human Motion Capture. In Computer Vision and Image Understand-
a single silhouette image is the inherent ambiguity caused by self- ing, 81(3), pages 231–268, 2001.
occlusion of different body parts. Usually internal edge information [16] V. Pavlovic, J.M. Rehg, T. Cham, and K.P. Murphy. A Dynamic
can be used to solve the ambiguity at certain degree. However, Bayesian Network Approach to Figure Tracking Using Learned Dy-
spurious edges caused by shadows and/or mostly by the patterns of namic Models. In Intl. Conf. on Computer Vision, pages 94–101, 1999.
clothes can result in incorrect reconstruction. Finally, we would like [17] W.H. Press, B.P. Flannery, S.A. Teukolosky, and W.T. Vetterling. Nu-
to study how we can exploit various spatial and temporal features merical Recipes in C: The Art of Scientific Computing. Cambridge
of silhouette image sequence to infer the correct motion of self- University Press, New York, 1988.
[18] L. Ren, G. Shakhnarovich, J. Hodgins, H. Pfister, and P. Viola. Learn-
occluded part in future work.
ing Silhouette Features for Control of Human Motion. In Proceedings
of the SIGGRAPH 2004 Conference on Sketches & Applications, Au-
gust 2004.
[19] C. Sminchisescu. Consistency and Coupling in Human Model Like-
ACKNOWLEDGEMENTS lihoods. In IEEE International Conference on Automatic Face and
Gesture Recognition, pages 27–32, 2002.
[20] C. Sminchisescu and B. Triggs. Kinematic Jump Processes For
The authors would like to thank the Advanced Computing Cen- Monocular 3D Human Tracking. In Proc. Conf. Computer Vision and
ter for Art and Design at Ohio State University for their support, Pattern Recognition, pages 69–76, 2003.
specifically for the use of their motion capture lab and software en- [21] Cristian Sminchisescu and Bill Triggs. Estimating Articulated Human
vironment. The authors would also like to thank the folks at CMU Motion with Covariance Scaled Sampling. International Journal of
for making available their motion captured database. The database Robotics Research, 2003.
was created with funding from NSF EIA-0196217. This work was [22] C.R. Wren, A. Azarbayejani, T. Darell, and A.P. Pentland. Pfinder:
supported, in part, by U.S. National Science Foundation Grant ITR- Real-Time Tracking of the Human Body. In Transactions on Pattern
IIS-0428249 and by the Secure Knowledge Management Program, Analysis and Machine Intelligence, 19(7), 1997.
Air Force Research Laboratory, Information Directorate, Wright-
Patterson AFB.
Figure 8: Tracking a dancer: displaying representative frames. The top row shows the footage, the middle row is the matching between
silhouette (white area) and our model (colored area), and the bottom row is the animation after rendering. The limb stretching term is used
here.

Figure 9: Tracking a badminton player: displaying representative frames. The limb stretching term for legs is used here.

Figure 10: Tracking an actor: displaying representative frames. We employ the keyframe technique on the arms to reconstruct this sequence.

You might also like