Professional Documents
Culture Documents
Constraints
Yisheng Chen∗ Jinho Lee† Rick Parent‡ Raghu Machiraju§
The Ohio State University MERL The Ohio State University The Ohio State University
Keywords: computer animation, model-based reconstruction Model-based approaches use a 3D articulated model of the human
body to estimate the pose and shape such that the model’s 2D pro-
jection closely fits the captured person in the image. Features such
as intensities, edges, silhouettes are widely used. Processing color
1 I NTRODUCTION and texture information is computationally expensive and changes
in illumination present significant problems. It should be noted
Motion Capture (mocap) has become a mainstay of many computer much of the scene and camera information is unknown. As a re-
animated productions and is often used to capture human motion. sult, many of the efforts to track human figures in video, including
Commercial mocap systems capture motion by fixing on the target ours, use silhouettes. Silhouettes are less sensitive to noise than
figure instrumentation such as optical, magnetic, or mechanical sen- edges but fine details might be lost in its extraction.
sors, passive optical reflectors or active optical emitters. However, The use of silhouettes does create problems. Extracting pose from
various constraints imposed by the instrumentation tend to limit the a silhouette can produce multiple candidate solutions. The high
usefulness of mocap either by restricting the physical space of the dimensionality of the articulated model parameter space requires
movement, restricting the environment in which motion can be cap- efficient yet robust search algorithms. A suitable initialization of
tured, or restricting the movement itself (not to mention restrictions model parameters is also required by many algorithms for motion
due to cost). tracking. Constraints like joint angle limits on parameters can be
While motion libraries and motion retargetting techniques extend used as well as knowledge about physics and the anatomy.
the usefulness of mocap, there is a need to develop algorithms and By extracting the silhouette of the foreground figure and using a
methods for commodity sensors. For example, there is often a need model-based approach, the problem is re-formulated as a local, op-
to synthesize the motion of a figure in a video clip obtained from the timized search of the pose space. The pose space, in turn, consists
Web, or from a surveillance camera. Conversely, one would like to of 6 rigid body transformation parameters plus the internal joint
capture human motion with the least amount of equipment possible angles of the figure.
- using a single consumer-grade camera. Additionally, we wish to
produce results at rates that would make such a system useful in Occlusion and ambiguities arising from the use of a single view of-
an interactive environment and therefore suitable for prototyping ten cause spurious reconstruction of the captured motion. There-
animated sequences and exploring initial character motions. fore, the reconstruction is successful when either 2D correspon-
dence is established for each frame of the sequence manually [5],
∗ e-mail:chenyis@cse.ohio-state.edu or when the motion under scrutiny is limited [19, 21]. There is a
† e-mail:leejh@merl.com paucity of methods that allow for general motion and require little
‡ e-mail:parent@cse.ohio-state.edu manual intervention. Additionally, these methods should be effi-
§ e-mail:raghu@cse.ohio-state.edu cient.
We describe herein a method to reconstruct arbitrary motion se- are given in Section 5. Section 6 includes results which demon-
quences that is model-based, operates on images, and exploits strate the effectiveness of our technique. Concluding remarks and
knowledge about the anatomy. Consequently, our method is simple, pointers to future work are described in Section 7.
efficient and requires limited manual intervention. Will our method
reconstruct all motion sequences successfully? The answer is no.
Occlusion of limbs by larger parts of the anatomy cannot always be
resolved through the use of image silhouettes. On the other hand, 2 P REVIOUS W ORK
we wish to explore the limits of efficient monocular motion recon-
struction. Our results show that we can reconstruct increasingly
complex motion when we include a larger set of anatomical fea- We describe work that is closely related to our own . Model based
tures and constraints and employ robust image comparison metrics. approaches use a 3D articulated model of the human body to esti-
We now provide an overview of our approach. mate the pose and shape such that the model’s 2D projection closely
fits the captured person in the image. Bregler and Malik [2] propose
a framework to estimate 3D poses of each body segment in a kine-
1.2 Overview of Our Approach matic chain using twist representation of a general rigid-body mo-
tion. However, their reconstruction assumes lateral symmetry in the
tracked motion. Pavlovic et al [16] present a system to track fronto-
In this paper, we explore inverse methods which reconstruct motion parallel motion using a dynamic Bayesian network approach. Their
from a single camera. We examine techniques which use silhouettes work focuses on the dynamics of human behavior as described by
and edges rather than texture. Since motion is our primary interest bio-mechanical models.
and not identification, silhouettes and edges provide ample grounds
for developing robust methods. Inverse design methods like ours Deutscher et al [4] use a human model to build a framework of a
employ optimization algorithms to minimize an objective function. kinematic chain using limbs of conical sections for computational
Resolving occlusion between various parts of the human body is simplicity and high-level interpretation of output. They use edges
certainly an ominous challenge given the difficulty of matching an and silhouettes for their cost function to estimate pose from multi-
imperfect, highly flexible, self-occluding model to cluttered image ple camera views. A condensation algorithm is employed to search
features. Viable human models have at least 20 joint parameters the high dimensional space without restrictions. Carranza et al [3]
subject to highly nonlinear physical constraints. Also, a significant use silhouettes from multi-view synchronized video footage to re-
number of the possible degrees of freedom afforded by the vari- construct the motion of a 3D human body model, and then re-render
ous joints are not uniquely determined by any given image. Thus, the model’s appearance and motion interactively from any novel
monocular motion capture is ill-posed and non-linear. Methods re- view point. Kakadiaris et al [10] use a spatio-temporal analysis to
ported in the literature are either complex and expensive for com- track upper body motions from multiple cameras.
puter graphics applications or impose severe restrictions on the type
Monocular markerless motion capture has been studied by a few re-
of motion that can be captured.
searchers. Sminchisescu and Triggs [20] achieve successful motion
The work described herein is an exploration of simple yet robust synthesis based on the propagation of a mixture of Gaussian den-
cost functions and incremental search strategies. Our focus on sim- sity functions, each representing probable 3D configurations of the
pler cost functions will eventually allow for the realization of near human body over time. Difranco et al [5] propose a batch frame-
real-time reconstruction often needed for computer graphics appli- work to reconstruct poses from a single view point based on 2D
cations while at the same time making few assumptions about the correspondences subject to a number of constraints. Their methods
motion being tracked. are shown to be successful when deployed on moderately difficult
sequences that include athletic and dance movements.
The starting point of our method is a model of a human actor with
multiple quadrics assembled at various joints. After an initial pose Alternative approaches to model based tracking of human bodies
is established either automatically or with the aid of the user, a have been also used. Bobick and Davis [1] construct temporal tem-
frame-to-frame tracking procedure ensues in which the solution of plates from a sequence of silhouette images and present a method
the last frame is the initial guess for the next frame. In addition to to match the temporal template against stored views of the known
silhouettes and edges, our objective function uses anatomical and actions. Wren et al [22] use 2D blobs for tracking motion in a video
physical constraints to aid in disambiguating the view. image. Leventon and Freeman [14] take a statistical approach and
used a set of motion examples to build a probability model for the
The main contributions of our work include: purpose of recovering 3D joint angles for a new input sequence.
• development of a new core-weighted XOR metric for model
localization in an image Lee et al [11] present a vision-based interface to control avatars in
virtual environments. They extract visual features from the input
• robust detection and tracking of body parts such as head, silhouette images and search for the best motion matching the vi-
arms, feet and the V between the legs in the image sual features obtained from rendered versions of actions in a motion
database. Ren et al [18] use the AdaBoost algorithm to select a few
• inclusion of image features and anatomical constraints to re- best local features from silhouette images to estimate yaw and body
duce the size of optimal search space configurations. Finally, a survey of computer vision-based human
Our new method is shown to be capable of reconstructing full ar- motion capture techniques is presented by Moeslund and Granum
ticulated body motion efficiently. Additionally, our results include [15].
video sequences of varying length and arbitrary illumination and
the reconstruction is quite faithful to the original sequence. We reconstruct general 3D motion from the silhouettes extracted
from a single view without relying on a motion database. As a
The paper is organized as follows. In Section 2 we describe perti- consequence, our work is closest to that of Sminchisescu. However,
nent previous work in motion reconstruction. Our human model is we demonstrate the tracking of motions that are more complex than
briefly discussed in Section 3. We describe in detail our method for those presented by Sminchisescu and we track at speeds that are
motion reconstruction in Section 4. Implementation considerations equal or faster than those reported.
3 O UR H UMAN M ODEL Once we derive a background model, we can subtract the back-
ground from all the images in the sequence. Often, we do not have
a perfect background model, and therefore the silhouette images
Our human model is a combination of spheres and cylinders with may suffer from the presence of noise. The silhouette quality can
anisotropic scale and rigid transformation for an object coordinate be improved using morphological operations, such as dilation and
system for each part. The model is shown in Figure 8 and Figure 9. size-based object filtering.
The parameters that describe our human model consist of two com-
ponents - shape and motion parameters. We define twelve shape
parameters which describe the scale factors to be multiplied to a 4.2 Core-Weighted XOR
predefined ‘standard’ size of each body part. Motion parameters
are composed of 6 global transformation and total 24 joint angles. Our goal is to find the motion parameter set β that minimizes the
Each joint has anywhere from one to three degrees-of-freedom. total penalty
Given N frames from a video sequence, our human body model
at a specific frame i is represented by M (i) = M(α , β (i) ), where (i)
E(β (i) ) = f (Sinput , Smodel (β (i) )) (1)
α = {α1 , α2 , . . . , αm } (m = 12) is the shape parameter vector and
(i) (i) (i) (i)
β (i) = {β1 , β2 , . . . , βn } (n = 30) is the motion parameter vector. for a suitable cost function f , where Sinput and Smodel (β (i) ) are the
ith input silhouette image and a silhouette image generated by M (i)
(i)
respectively. For the sake of clarity, we instead denote Sinput and
4 T HE O PTIMIZATION S TRATEGY
Smodel (β (i) ) simply as Sinput and Smodel respectively.
The use of a single view makes the reconstruction problem ill- How does one design a viable and robust cost function f as de-
posed, as stated earlier. We now describe our optimization strat- scribed in Eq.(1) ? The easiest way to measure the difference of
egy to fit the motion parameters to a single stream of silhouette two binary images is the number of ‘on’ pixels when pixel-wise
images. First, we describe how we extract silhouettes from a video XOR operation is applied to the two images [13]. In this case,
sequence. Then, we present an objective function based on the dif- H W
ference of area between model-generated silhouette and input sil- f (Sinput , Smodel ) = ∑ ∑ c(i, j) (2)
houette. Next, we discuss how we improve the performance of our i j
optimization algorithm. We achieve this by incorporating edge in-
0 if Sinput (i, j) = Smodel (i, j)
formation along with both physical and anatomical constraints into c(i, j) =
the objective function. Finally, we discuss a non-linear multidimen- 1 otherwise.
sional optimization algorithm we employ to minimize the proposed
objective function. where the double summation iterates over all pixel locations. If our
goal requires that f = 0, that is, if two silhouettes overlap exactly,
the optimal solution will be unique in terms of Smodel . However, if
our objective function f cannot be reduced to zero given inherent
4.1 Silhouette Extraction
characteristics of the problem, it is likely that there are multiple
optimal solutions. Any preference among those multiple optimal
The input to our motion synthesis system is a sequence of silhouette solutions should be incorporated into the cost function.
images that describe the gross motion of the human body. We avoid
using coloration or texture information in order to minimize the ef- Since limbs are features of particular importance in any articulated
fects of variable viewing conditions. In addition to using silhou- figure, we do not want to lose track of those features. Limbs in a
ettes we also employ high-contrast edges to further our optimiza- human body are often well characterized by their skeleton or medial
tion strategy as explained below. Given a video footage, there exist axis derived from the silhouette image. Therefore, whenever ambi-
several methods to obtain silhouette images. Although silhouettes guity occurs it would be better to choose the direction in parametric
are less sensitive to noise than edges, fine details of body structure shape space such that model-generated silhouette covers the core
and motion are often lost in their extraction. We employed the fol- area of the input silhouette. The core area includes the silhouette
lowing semi-automatic methods to extract the foreground human pixels close to the skeleton. This requirement can be incorporated
figures from the background. in the cost function by imposing higher penalty if Smodel (β (i) ) does
not overlap any region near the core area of the input silhouette
First, we identify frames wherein the human figure is absent and (i)
Sinput .
construct a statistical model of the background. The mean and stan-
dard deviation at all pixels over all frames (and for every color RGB Our new cost function replaces c(i, j) in Eq.(2) with
channel) is computed. If a pixel differs too much from the back-
ground on any color channel, the pixel is treated as a foreground 0 if Sinput (i, j) = Smodel (i, j)
pixel and thus the silhouette is recovered in a pixel-wise manner. In c(i, j) = (3)
d(i, j) otherwise
case, there exists no image with just the background, we identify
frames that describe motion of figures moving through the scene d(i, j) = D(Sinput )(i, j) + wD(S̃input )(i, j),
in a predictable fashion. We then compute a median image over where D(S) is the Euclidean distance transform of binary image S
the whole sequence followed by building a weighted-mean/variance and S̃ is the inverse image of S and w is a weighting factor that
background model to extract the silhouette out, if over time there controls the importance of coverage of core area relative to the mis-
will more non-motion at a pixel than there will be motion [9]. Oth- match in the region outside of silhouette area.
erwise, we select several frames with little overlap of the human
figures, extract the figures manually, and then combine the result- Note that image d represents a distance map from silhouette contour
ing images to obtain a composite background image. If there exists and can be computed once in a preprocessing step. Figure 1 depicts
background areas that cannot be recovered, they are considered as the coremap image d with w = 5.0. We call Eq. 2 along with Eq. 3
foreground anyway. core-weighted XOR.
Arms and the V shape are treated the same as the edge term,
which means we try to get a maximum matching between detected
arms/legs and those of model’s.
5 I MPLEMENTATION
[1] A.F. Bobick and J.W. Davis. The Representation and Recognition of
Action using Temporal Templates. In IEEE Transactions on Pattern
Analysis and Machine Intelligence, volume 23, No.3, pages 257–267,
2001.
[2] C. Bregler and J. Malik. Tracking People with Twists and Exponential
Maps. In Proceedings of IEEE CVPR, pages 8–15, 1998.
[3] J. Carranza, C. Theobalt, M. Magnor, and H-P Seidel. Free-Viewpoint
Video of Human Actors. In Proceedings of SIGGRAPH 2003, volume
22, No.3, pages 569–577, 2003.
[4] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Cap-
ture by Annealed Particle Filtering. In Proceedings of IEEE CVPR,
volume 2, pages 126–133, 2000.
Figure 7: Analyzing the source of error appeared at two peaks in the
[5] D.E. DiFranco, T. Cham, and J.M. Rehg. Reconstruction of 3-D Fig-
third plot of Figure 6: A,D and E are from deformation of clothes,
ure Motion from 2-D Correspondences. In Proc. Conf. Computer Vi-
B is from motion blur effect, and C is from the inherent difference
sion and Pattern Recognition, pages 307–314, 2001.
between model and real image.
[6] M. Gleicher and N. Ferrier. Evaluating Video-Based Motion Capture.
In Proceedings of Computer Animation, pages 75–80, 2002.
[7] I. Haritaoglu, D. Harwood, and L. Davis. Ghost: A Human Body Part
frame for similar operations. We use w = 1.0 from Eq. 3, we = 20 Labeling System Using Silhouettes. In International Conference on
from Eq. 4, wl = 20 from Eq. 5 and wal = 20 from Eq. 6 for the Pattern Recognition, volume 1, pages 77–82, 1998.
cost function for all motion synthesis. [8] J.K. Hodgins. Carnegie Mellon University Graphics Lab Motion Cap-
ture Database, http://mocap.cs.cmu.edu/.
[9] T. Horprasert, D. Harwood, and L.S. Davis. A statistical approach
for real-time robust background subtraction and shadow detection. In
7 C ONCLUSION AND F UTURE W ORK IEEE Frame-Rate Workshop, 1999.
[10] I. Kakadiaris and D. Metaxas. Model-Based Estimation of 3D Hu-
man Motion. In IEEE Transactions on Pattern Analysis and Machine
We presented a simple and robust technique to reconstruct 3D mo- Intelligence, volume 22, December 2000.
tion parameters of a human model using image silhouettes. A novel [11] J. Lee, J. Chai, P.S.A. Reitsma, J.K. Hodgins, and N.S. Pollard. In-
teractive Control of Avatars Animated with Human Motion Data. In
cost metric called core-weighted XOR was introduced and consis-
Proceedings of SIGGRAPH 2002, Computer Graphics Proceedings,
tently used for the automatic alignment, shape fitting and motion
Annual Conference Series, pages 491–500. ACM, ACM Press / ACM
tracking. The computation time of the cost function is directly re- SIGGRAPH, 2002.
lated to the overall performance of our method. Currently our im- [12] M.W. Lee, I. Cohen, and S.K. Jung. Particle Filter with Analytical
plementation does not exploit any hardware acceleration. In the fu- Inference for Human Body Tracking. In IEEE Workshop on Motion
ture, we intend to accelerate the weighted-XOR computation using and Video Computing, 2002.
features of a modern graphics hardware. Our work is closest to that [13] H.P.A. Lensch, W. Heidrich, and H. Seidel. Automated Texture Regis-
of Sminchisescu [20] and that of DiFranco [5]. We compare favor- tration and Stitching for Real World Models. In Proceedings of Pacific
ably in terms of generality to the results they report. By detecting Graphics 2000, 2000.
more human features like head, arms, hands, legs and feet, we can [14] M. Leventon and W. Freeman. Bayesian estimation of 3-d human
further improve the correctness of registering our model with the motion from an image sequence. In Technical Report TR-98-06, Mit-
all the images and recovering the 3d poses. subishi Electric Research Laboratory, Cambridge, MA, 1998.
[15] T.B. Moeslund and E. Granum. A Survey of Computer Vision-Based
One of the biggest challenges in 3D motion reconstruction from Human Motion Capture. In Computer Vision and Image Understand-
a single silhouette image is the inherent ambiguity caused by self- ing, 81(3), pages 231–268, 2001.
occlusion of different body parts. Usually internal edge information [16] V. Pavlovic, J.M. Rehg, T. Cham, and K.P. Murphy. A Dynamic
can be used to solve the ambiguity at certain degree. However, Bayesian Network Approach to Figure Tracking Using Learned Dy-
spurious edges caused by shadows and/or mostly by the patterns of namic Models. In Intl. Conf. on Computer Vision, pages 94–101, 1999.
clothes can result in incorrect reconstruction. Finally, we would like [17] W.H. Press, B.P. Flannery, S.A. Teukolosky, and W.T. Vetterling. Nu-
to study how we can exploit various spatial and temporal features merical Recipes in C: The Art of Scientific Computing. Cambridge
of silhouette image sequence to infer the correct motion of self- University Press, New York, 1988.
[18] L. Ren, G. Shakhnarovich, J. Hodgins, H. Pfister, and P. Viola. Learn-
occluded part in future work.
ing Silhouette Features for Control of Human Motion. In Proceedings
of the SIGGRAPH 2004 Conference on Sketches & Applications, Au-
gust 2004.
[19] C. Sminchisescu. Consistency and Coupling in Human Model Like-
ACKNOWLEDGEMENTS lihoods. In IEEE International Conference on Automatic Face and
Gesture Recognition, pages 27–32, 2002.
[20] C. Sminchisescu and B. Triggs. Kinematic Jump Processes For
The authors would like to thank the Advanced Computing Cen- Monocular 3D Human Tracking. In Proc. Conf. Computer Vision and
ter for Art and Design at Ohio State University for their support, Pattern Recognition, pages 69–76, 2003.
specifically for the use of their motion capture lab and software en- [21] Cristian Sminchisescu and Bill Triggs. Estimating Articulated Human
vironment. The authors would also like to thank the folks at CMU Motion with Covariance Scaled Sampling. International Journal of
for making available their motion captured database. The database Robotics Research, 2003.
was created with funding from NSF EIA-0196217. This work was [22] C.R. Wren, A. Azarbayejani, T. Darell, and A.P. Pentland. Pfinder:
supported, in part, by U.S. National Science Foundation Grant ITR- Real-Time Tracking of the Human Body. In Transactions on Pattern
IIS-0428249 and by the Secure Knowledge Management Program, Analysis and Machine Intelligence, 19(7), 1997.
Air Force Research Laboratory, Information Directorate, Wright-
Patterson AFB.
Figure 8: Tracking a dancer: displaying representative frames. The top row shows the footage, the middle row is the matching between
silhouette (white area) and our model (colored area), and the bottom row is the animation after rendering. The limb stretching term is used
here.
Figure 9: Tracking a badminton player: displaying representative frames. The limb stretching term for legs is used here.
Figure 10: Tracking an actor: displaying representative frames. We employ the keyframe technique on the arms to reconstruct this sequence.