Professional Documents
Culture Documents
Q. Cai, Human Motion Analysis A Review PDF
Q. Cai, Human Motion Analysis A Review PDF
1. INTRODUCTION
Human motion analysis is receiving increasing attention from
computer vision researchers. This interest is motivated by applications over a wide spectrum of topics. For example, segmenting the parts of the human body in an image, tracking the
movement of joints over an image sequence, and recovering the
underlying 3D body structure are particularly useful for analysis of athletic performance, as well as medical diagnostics. The
capability to automatically monitor human activities using computers in security-sensitive areas such as airports, borders, and
building lobbies is of great interest to the police and military.
With the development of digital libraries, the ability to automatically interpret video sequences will save tremendous human
effort in sorting and retrieving images or video sequences using content-based queries. Other applications include building
manmachine user interfaces, video conferencing, etc. This paper gives an overview of recent development in human motion
analysis from image sequences using a hierarchical approach.
In contrast to our previous review of motion estimation of a
rigid body [1], this survey concentrates on motion analysis of the
428
1077-3142/99 $30.00
c 1999 by Academic Press
Copyright
All rights of reproduction in any form reserved.
429
FIG. 1. Relationship among the three areas of human motion analysis addressed in the paper.
430
FIG. 3. Past research on tracking of human motion without using body parts.
431
FIG. 5. A stick-figure human model (based on Chen and Lees work [11]).
of joints between successive frames. These assumptions impose constraints on feature correspondence, decrease the search
space, and eventually, result in a unique match.
The simplest representation of a human body is the stick figure, which consists of line segments linked by joints. The motion
of joints provides the key to motion estimation and recognition
of the whole figure. This concept was initially considered by
Johansson [9], who marked joints as moving light displays
(MLD). Along this vein, Rashid [10] attempted to recover a connected human structure with projected MLD by assuming that
points belonging to the same object have higher correlations in
FIG. 6. A 2D contour human model (similar to Leung and Yangs model [26]).
432
multiple new models are produced to replace the old ones, with
each of them representing an emerging subpart. Joints are determined based on the relative motion and shape of two moving
subparts. These methods usually require small image motion
between successive frames which will not be a major concern as
the video sampling rate increases. Our main concern is still segmentation under normal circumstances. Among the addressed
methods, Shio and Sklansky [15] relied on motion as the cue
for segmentation, while the latter two approaches are based on
intensity or texture. To our best knowledge, robust segmentation
technologies are yet to be developed. Since this problem is the
major obstacle to most of the work involving low-level processing, we will not mention it repeatedly in later discussions.
Finally, we want to address the recent work by Rowley and
Rehg [20] which focuses on the segmentation of optical flow
fields of articulated objects. It is an extension to motion analysis of rigid objects using the expectation-maximization (EM)
algorithm [21]. Compared to [21], the major contribution of [20]
is to add kinematic motion constraints to each pixel data. The
strength of this work lies in its combination of motion segmentation and estimation in EM computation, i.e., segmentation is
accomplished in the E-step, and motion analysis in the M-step.
These two steps are computed iteratively in a forwardbackward
manner to minimize the overall energy function of the whole image. The motion addressed in the paper is restricted to 2D affine
transforms. We would expect to see its extension to 3D cases
under perspective projections.
2.2. Model-Based Approaches
In the above subsection, we examined several approaches to
motion analysis that do not require a priori shape models. In this
type of approach, which is necessary when no a priori shape
models are available, it is typically more difficult to establish
feature correspondence between consecutive frames. Therefore,
most methods for the motion analysis of human body parts apply
predefined models for feature correspondence and body structure recovery. Our discussion will be based on the representations of various models.
Chen and Lee [22] recovered the 3D configuration of a moving
subject according to its projected 2D image. Their model used 17
line segments and 14 joints to represent the features of the head,
torso, hip, arms, and legs (shown in Fig. 5). Various constraints
were imposed for the basic analysis of the gait. The method
was computationally expensive, as it searched through all possible combinations of 3D configurations, given the known 2D
projection, and required accurate extraction of 2D stick figures.
Bharatkumar et al. [7] also used stick figures to model the lower
limbs of the moving human body. They aimed at constructing
a general kinematic model for gait analysis in human walking.
Medial-axis transformations were applied to extract 2D stick
figures of the lower limbs. The body segment angle and joint
displacement were measured and smoothed from real image sequences, and then a common kinematic pattern was detected
for each walking cycle. A high correlation (>0.95) was found
between the real images and the model. The drawback to this
model is that it is view-based and sensitive to changes of the perspective angle at which the images are captured. Hubers human
model [23] is a refined version of the stick figure representation. Joints are connected by line segments with a certain degree
of relaxation as virtual springs. Thus, this articulated kinematic model behaves analogously to a mass-spring-damper system. Motion and stereo measurements of joints are confined to a
three-dimensional space called proximity space (PS). The human
head serves as the starting point for tracking all PS locations.
In the end, particular gestures were recognized based on the PS
states of the joints associated with the head, torso, and arms.
The key to solving this problem requires the 3D positions of the
joints in an image sequence. A recent publication by Iwasawa
[24] focused on real-time extraction of stick figures from monocular thermal images. The height of the human image and the
distance between the subject and the camera were precalibrated.
Then the orientation of the upper half of the body was calculated
as the principle axis of inertia of the human silhouette. Significant points, such as the top of the head and the tips of the hands
and feet, were heuristically located as the extreme points farthest
from the center of the silhouette. Finally, major joints such as the
elbows and knees were estimated, based on the positions of the
detected points through genetic learning. The drawback to this
method is, again, that it is view-oriented and gesture-restricted.
For example, if the human arms are placed in front of the body,
there is no way to extract the finger tips using this method and,
therefore, it fails to locate the elbow joints.
Niyogi and Adelson [8] pursued another route to estimate
the joint motion of human body segments. They first examined the spatial-temporal (XYT) braided pattern produced by
the lower limb trajectories of a walking human and conducted
gait analysis for coarse human recognition. Then the projection
of head movements in the spatial-temporal domain was located,
followed by the identification of other joint trajectories. These
joint trajectories were then utilized to outline the contour of a
walking human, based on the observation that the human body
is spatially contiguous. Finally, a more accurate gait analysis
was performed using the outlined 2D contour, which led to a
fine-level recognition of specific humans. The major concern
we have with this work is how to obtain these XYT trajectories
in real image sequences without attaching specific sensors to the
head and feet during image acquisition.
In Akitas work [25], both stick figures and cone approximations were integrated and processed in a coarse-to-fine manner.
A key frame sequence of stick figures indicates the approximate
order of the motion and spatial relationships between the body
parts. A cone model is included to provide knowledge of the
rough shape of the body parts, whose 2D segments correspond
to the counterparts of the stick figure model. The preliminary
condition to this approach is to obtain these key frames prior
to body part segmentation and motion estimation. Perales and
Torres also made use of both stick figure and volumetric representations in their work [26]. They introduced a predefined
433
Elliptical cylinders are another commonly used representations for modeling human forms in 3D volumes. Compared to
ORourke and Badlers elaborate model, an elliptical cylinder
could be well represented by fewer parameters: the length of the
axis and the major and minor axes of the ellipse cross section.
Hogg [29] and Rohr [30] used the cylinder model originated by
Marr and Nishihara [31], in which the human body is represented
by 14 elliptical cylinders. The origin of the coordinate system is
located at the center of the torso. Both Hogg and Rohr intended
to generate 3D descriptions of a human walking by modeling.
Hogg [29] presented a computer program (WALKER) which
attempted to recover the 3D structure of a walking person. Rohr
applied eigenvector line fitting to outline the human image and
then fitted the 2D projections into the 3D human model using a
similar distance measure. In the same spirit, Wachter and Nagel
[32] also attempted to establish the correspondence between a
3D human model connected by elliptical cones and a real image
sequence. Both edge and region information were incorporated
in determining body joints, their degrees of freedom (DOFs) and
orientations to the camera by an iterated extended Kalman filter.
A fair amount of work in human motion analysis has focused
on the motion of certain body parts. For example, Rehg et al. [33]
were interested in tracking the motion of self-occluded fingers
by matching an image sequence of hand motion to a 3D finger
model. The two occluded fingers were rendered, again, using
several cylinders. The major drawback to this work is that they
assumed an invariant visibility order of 2D templates of these
two occluded fingers to the viewing camera. This assumption
simplified the motion prediction and matching process between
the 3D shape model and its projection in the image plane. However, it also limits its application to image sequences captured
under normal conditions. Recent work by Goncalves et al. [34]
434
tain good estimation and prediction of body part positions. However, these algorithms are computationally expensive and do not
guarantee convergence if the initial conditions are ill defined.
3. TRACKING HUMAN MOTION WITHOUT
USING BODY PARTS
In the previous section, we discussed human motion analysis involving identification of joint-connected body parts. As
we know, the tasks of labeling body segments and then locating
their connected joints are certainly nontrivial. On the other hand,
applications such as surveillance require only tracking the movement of the subject of interest. It will be computationally more
efficient to track moving humans by directly focusing on the
motion of the whole body rather than its parts. Knowing that
the motion of individual body parts usually converges to that of
the whole body in the long run, we could employ the motion
of the whole body as a guide for predicting and estimating the
movement of body parts; i.e., we could use the the human body
as the coordinates for local motion estimation of its body parts.
In this section, we will review methods for tracking a moving
human without involving its body parts.
As is well known to computer vision researchers, tracking is
equivalent to establishing correspondence of the image structure
between consecutive frames based on features related to position, velocity, shape, texture, and color. Typically, the tracking
process involves matching between images using pixels, points,
lines, and blobs, based on their motion, shape, and other visual
information [41]. There are two general classes of correspondence models, namely iconic models which use correlation
templates and structural models which use image features
[41]. Iconic models are generally suitable for any objects, but
only when the motion between two consecutive frames is small
enough so that the object images in these frames are highly correlated. Since our interest is in tracking moving humans, which
retain a certain degree of nonrigidity, our literature review is
limited to the use of structural models.
Feature-based tracking typically starts with feature extraction,
followed by feature matching over a sequence of images. The
criteria for selecting a good feature are its brightness, contrast,
size, and robustness to noise. To establish feature correspondence between successive frames, well-defined constraints are
usually imposed to find a unique match. There is again a trade-off
between feature complexity and tracking efficiency. Lower-level
features such as points are easier to extract but relatively more
difficult to track than higher-level features such as lines, blobs,
and polygons.
We will organize our discussion based on two scenarios
from a single camera view or from multiple perspectives. In
both cases, various levels of features could be used to establish
matching between successive frames. The major difference is
that the features used for matching from multiple perspectives
must be projected to the same spatial reference, while tracking
from a single view does not have this requirement.
435
436
437
438
dynamic systems to represent the complex movement. Recognition is completed by maximizing the posterior probability of
the HMM model from trained data given the input image sequence. Most of these methods include low-level processing of
feature extraction. But again, all recognition methods relying on
2D features are view-based.
5. CONCLUSION
We have presented an overview of past developments in human motion analysis. Our discussion has focused on three major
tasks: (1) motion analysis of human body parts, (2) high-level
tracking of human motion from a single view or multiple perspectives, and (3) recognition of human movements or activities
based on successfully tracking features over an image sequence.
Motion analysis of the human body parts is essentially the 2D or
3D interpretation of the human body structure using the motion
of the body parts over image sequences, and involves low-level
tasks such as body part segmentation, joint location, and detection. Tracking human motion is performed at a higher level at
which the parts of the human body are not explicitly identified,
e.g., the human body is considered as a whole when establishing
matches between consecutive frames. Our discussion is categorized by whether the subject is imaged at one time instant from a
single view or from multiple perspectives. Tracking a subject in
image sequences from multiple perspectives requires that features be projected into a common spatial reference. The task
of recognizing human activity in image sequences assumes that
feature tracking for recognition has been accomplished. Two
types of techniques are addressed: state-space approaches and
template matching approaches. Template matching is simple to
implement, but sensitive to noise and the time interval of the
movements. State-space approaches, on the other hand, overcome these drawbacks but usually involve complex iterative
computation.
The key to successful execution of high-level tasks is to establish feature correspondence between consecutive frames, which
still remains a bottleneck in most of the approaches. Typically,
constraints on human motion or human models are imposed in
order to decrease the ambiguity during the matching process.
However, these assumptions may not conform to general imaging conditions or may introduce other tractable issues, such as
the difficulty in estimating model parameters from the real image data. Recognition of human motion is just in its infancy,
and there exists a trade-off between computational cost and motion duration accuracy for methodologies based on state-space
models and template matching. We are still looking forward to
new techniques to improve the performance and, meanwhile,
decrease the computational cost.
ACKNOWLEDGMENTS
The work was supported in part by the United States Army Research
Office under Contracts DAAH04-95-I-0494 and DAAG55-98-1-0230 and the
Texas Higher Education Coordinating Board Advanced Technology Project
439
21. A. Jepson and M. Black, Mixture models for optical flow computation, in
Proc. of Intl. Conf. on Pattern Recognition, New York City, 1993, pp. 760
761.
REFERENCES
6. W. Kinzel, Pedestrian recognition by modelling their shapes and movements, in progress in image analysis and processing III, in Proc. of the
7th Intl. Conf. on Image Analysis and Processing 1993, Singapore, 1994,
pp. 547554.
440
44. J. Segen and S. Pingali, A camera-based system for tracking people in real
time, in Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, August
1996, pp. 6367.
45. Y. Okawa and S. Hanatani, Recognition of human body motions by robots,
in Proc. IEEE/RSJ Intl. Conf. Intelligent Robots and Systems, Raleigh, NC,
1992, pp. 21392146.
46. M. Rossi and A. Bozzoli, Tracking and counting people, in 1st Intl. Conf.
on Image Processing, Austin, Texas, November 1994, pp. 212216.
47. S. S. Intille and A. F. Bobick, Closed-world tracking, in Proc. Intl. Conf.
Comp. Vis., 1965, pp. 672678.
48. S. S. Intille, J. W. Davis, and A. F. Bobick, Real-time closed-world tracking,
in Proc. IEEE CS Conf. on CVPR, Puerto Rico, 1997, pp. 697703.
49. C. Wren, A. Azarbayejani, T. Darrel, and A. Pentland, Pfinder: Real-time
tracking of the human body, in Proc. SPIE, Bellingham, WA, 1995.
50. A. Azarbayejani and A. Pentland, Real-time self-calibrating stereo person
tracking using 3D shape estimation from blob features, in Proc. of Intl.
Conf. on Pattern Recognition, Vienna, Austria, August 1996, pp. 627632.
51. B. Gunsel, A. M. Tekalp, and P. J. L. Beek, Object-based video indexing for
virtual studio production, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,
1997, pp. 769774.
52. B. Heisele, U. Kreel, and W. Ritter, Tracking non-rigid moving objects
based on color cluster flow, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,
1997, pp. 257260.
53. J. Yang and A. Waibel, A real-time face tracker, in Proc. of IEEE Computer
Society Workshop Applications on Computer Vision, Sarasota, FL, 1996,
pp. 142147.