You are on page 1of 13

Computer Vision and Image Understanding

Vol. 73, No. 3, March, pp. 428440, 1999


Article ID cviu.1998.0744, available online at http://www.idealibrary.com on

Human Motion Analysis: A Review


J. K. Aggarwal and Q. Cai
Computer and Vision Research Center, Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712
Received October 22, 1997; accepted September 25, 1998

Human motion analysis is receiving increasing attention from


computer vision researchers. This interest is motivated by a wide
spectrum of applications, such as athletic performance analysis,
surveillance, manmachine interfaces, content-based image storage
and retrieval, and video conferencing. This paper gives an overview
of the various tasks involved in motion analysis of the human body.
We focus on three major areas related to interpreting human motion: (1) motion analysis involving human body parts, (2) tracking a
moving human from a single view or multiple camera perspectives,
and (3) recognizing human activities from image sequences. Motion
analysis of human body parts involves the low-level segmentation
of the human body into segments connected by joints and recovers the 3D structure of the human body using its 2D projections
over a sequence of images. Tracking human motion from a single
view or multiple perspectives focuses on higher-level processing, in
which moving humans are tracked without identifying their body
parts. After successfully matching the moving human image from
one frame to another in an image sequence, understanding the human movements or activities comes naturally, which leads to our
discussion of recognizing human activities. c 1999 Academic Press

1. INTRODUCTION
Human motion analysis is receiving increasing attention from
computer vision researchers. This interest is motivated by applications over a wide spectrum of topics. For example, segmenting the parts of the human body in an image, tracking the
movement of joints over an image sequence, and recovering the
underlying 3D body structure are particularly useful for analysis of athletic performance, as well as medical diagnostics. The
capability to automatically monitor human activities using computers in security-sensitive areas such as airports, borders, and
building lobbies is of great interest to the police and military.
With the development of digital libraries, the ability to automatically interpret video sequences will save tremendous human
effort in sorting and retrieving images or video sequences using content-based queries. Other applications include building
manmachine user interfaces, video conferencing, etc. This paper gives an overview of recent development in human motion
analysis from image sequences using a hierarchical approach.
In contrast to our previous review of motion estimation of a
rigid body [1], this survey concentrates on motion analysis of the
428
1077-3142/99 $30.00
c 1999 by Academic Press
Copyright
All rights of reproduction in any form reserved.

human body, which is a nonrigid form. Our discussion covers


three areas: (1) motion analysis of the human body structure,
(2) tracking human motion without using the body parts from
a single view or multiple perspectives, and (3) recognizing human activities from image sequences. The relationship among
these three areas is depicted in Fig. 1. Motion analysis of the human body usually involves low-level processing, such as body
part segmentation, joint detection and identification, and the recovery of 3D structure from the 2D projections in an image
sequence. Tracking moving individuals from a single view or
multiple perspectives involves applying visual features to detect
the presence of humans directly, i.e., without considering the geometric structure of the body parts. Motion information such as
position and velocity, along with intensity values, is employed
to establish matching between consecutive frames. Once feature
correspondence between successive frames is solved, the next
step is to understand the behavior of these features throughout
the image sequence. Therefore, our discussion turns to recognition of human movements and activities.
Two typical approaches to motion analysis of human body
parts are reviewed, depending on whether a priori shape models
are used. Figure 2 lists a number of publications in this area
over the past years. In both model-based and nonmodel-based
approaches, the representation of the human body evolves from
stick figures to 2D contours to 3D volumes as the complexity
of the model increases. The stick figure representation is based
on the observation that human motion is essentially the movement of the supporting bones. The use of 2D contours is directly
associated with the projection of the human figure in images.
Volumetric models, such as generalized cones, elliptical cylinders, and spheres, attempt to describe the details of a human body
in 3D and, therefore, require more parameters for computation.
Various levels of representations could be used for graphical
animation with different resolutions [2].
With regard to the tracking of human motion without involving body parts, we differentiate the work based on whether the
subject is imaged at one time instant by a single camera or from
multiple perspectives using different cameras. In both configurations, the features to be tracked vary from points to 2D blobs
and 3D volumes. There is always a trade-off between feature
complexity and tracking efficiency. Lower-level features, such
as points, are easier to extract but relatively more difficult to

HUMAN MOTION ANALYSIS

429

FIG. 1. Relationship among the three areas of human motion analysis addressed in the paper.

track than higher-level features such as blobs and 3D volumes.


Most of the work in this area is listed in Fig. 3.
To recognize human activities from an image sequence, two
types of approaches were addressed: approaches based on a
state-space model and those using the template matching technique. In the first case, the features used for recognition could be
points, lines, and 2D blobs. Methods using template matching
usually apply meshes of a subject image to identify a particular
movement. Figure 4 gives an overview of the research in this
area. In some of the publications, recognition is conducted using only parts of the human figure. Since these methods can be

naturally extended to recognition of a whole body movement,


we also include them in our discussion.
This paper is an extension of the review in [3]. The organization of the paper is as follows: Section 2 reviews work on
motion analysis of the human body structure. Section 3 covers
research on the higher-level tasks of tracking a human body as a
whole. Section 4 extends the discussion to recognition of human
activity in image sequences based upon successfully tracking
the features between consecutive frames. Finally, section 5 concludes the paper and delineates possible directions for future research.

FIG. 2. Past research on motion analysis of human body parts.

430

AGGARWAL AND CAI

FIG. 3. Past research on tracking of human motion without using body parts.

2. MOTION ANALYSIS OF HUMAN BODY PARTS


This section focuses on motion analysis of human body parts,
i.e., approaches which involve 2D or 3D analysis of the human body structure throughout image sequences. Conventionally, human bodies are represented as stick figures, 2D contours,
or volumetric models [4]; therefore, body segments can be approximated as lines, 2D ribbons, and 3D volumes, accordingly.
Figures 5, 6, and and 7 show examples of the stick figure, 2D contour, and elliptical cylinder representations of the human body,
respectively. Human body motion is typically addressed by the
movement of the limbs and hands [58], such as the velocities
of the hand or limb segments, or the angular velocity of various
body parts.
Two general strategies are used, depending upon whether information about the object shape is employed in the motion
analysis, namely, model-based approaches and methods which
do not rely on a priori shape models. Both methodologies follow the general framework of: (1) feature extraction, (2) feature
correspondence, and (3) high-level processing. The major differ-

ence between the two methodologies is in establishing feature


correspondence between consecutive frames. Methods which
assume a priori shape models match the real images to a predefined model. Feature correspondence is automatically achieved
once matching between the real images and the model is established. When no a priori shape models are available, however,
correspondence between successive frames is based upon prediction or estimation of features related to position, velocity,
shape, texture, and color. These two methodologies can also be
combined at various levels to verify the matching between consecutive frames and, finally, to accomplish more complex tasks.
Since we have surveyed these methods previously in [4], we will
restrict out discussion to very recent work.
2.1. Motion Analysis without a priori Shape Models
Most approaches to 2D or 3D interpretation of human body
structure focus on motion estimation of the joints of body segments. When no a priori shape models are assumed, heuristic
assumptions are usually used to establish the correspondence

FIG. 4. Past work on human activity recognition.

HUMAN MOTION ANALYSIS

431

FIG. 7. A volumetric human model (derived from Hoggs work [15]).

FIG. 5. A stick-figure human model (based on Chen and Lees work [11]).

of joints between successive frames. These assumptions impose constraints on feature correspondence, decrease the search
space, and eventually, result in a unique match.
The simplest representation of a human body is the stick figure, which consists of line segments linked by joints. The motion
of joints provides the key to motion estimation and recognition
of the whole figure. This concept was initially considered by
Johansson [9], who marked joints as moving light displays
(MLD). Along this vein, Rashid [10] attempted to recover a connected human structure with projected MLD by assuming that
points belonging to the same object have higher correlations in

FIG. 6. A 2D contour human model (similar to Leung and Yangs model [26]).

projected positions and velocities. Later, Webb and Aggarwal


[11, 12] extended to 3D structure recovery of Johansson-type figures in motion. They imposed the fixed axis assumption, which
assumes that the motion of each rigid object (or part of an articulated object) is constrained so that its axis of rotation remains
fixed in direction. Therefore, the depth of the joints can be estimated from their 2D projections. Detailed review of [912] can
be found in [4]. All of these approaches inevitably demand a high
degree of accuracy in extracting body segments and joints. The
segmentation problem is avoided by directly using MLDs that
implies their restrictions to human images with natural clothing.
Another way to describe the human body is using 2D contours.
In such descriptions, the human body segments are analogous
to 2D ribbons or blobs. For example, Shio and Sklansky [15]
focused their work on 2D translational motion of human blobs.
The blobs were grouped based on the magnitude and direction of
the pixel velocity which were obtained using techniques similar
to the optical flow method [16]. The velocity of each part was
considered to converge to a global average value over several
frames. This average velocity corresponds to the motion of the
whole human body and leads to identification of the whole body
via region grouping of blobs with a similar smoothed velocity.
Kurakake and Nevatia [17] attempted to locate the joint locations
in images of walking humans by establishing correspondence
between extracted ribbons. They assumed small motion between
two consecutive frames, and feature correspondence was conducted using various geometric constraints. Joints were finally
identified as the center of the area, where two ribbons overlap.
Recent work by Kakadiaris et al. [18, 19] focused on body part
decomposition and joint location from image sequences of the
moving subject using a physics-based framework. Unlike [17],
where joints are located from shapes, here the joint is revealed
only when the body segments connected to it involve motion.
In the beginning, the subject image is assumed to be one deformable model. As the subject moves and new postures occur,

432

AGGARWAL AND CAI

multiple new models are produced to replace the old ones, with
each of them representing an emerging subpart. Joints are determined based on the relative motion and shape of two moving
subparts. These methods usually require small image motion
between successive frames which will not be a major concern as
the video sampling rate increases. Our main concern is still segmentation under normal circumstances. Among the addressed
methods, Shio and Sklansky [15] relied on motion as the cue
for segmentation, while the latter two approaches are based on
intensity or texture. To our best knowledge, robust segmentation
technologies are yet to be developed. Since this problem is the
major obstacle to most of the work involving low-level processing, we will not mention it repeatedly in later discussions.
Finally, we want to address the recent work by Rowley and
Rehg [20] which focuses on the segmentation of optical flow
fields of articulated objects. It is an extension to motion analysis of rigid objects using the expectation-maximization (EM)
algorithm [21]. Compared to [21], the major contribution of [20]
is to add kinematic motion constraints to each pixel data. The
strength of this work lies in its combination of motion segmentation and estimation in EM computation, i.e., segmentation is
accomplished in the E-step, and motion analysis in the M-step.
These two steps are computed iteratively in a forwardbackward
manner to minimize the overall energy function of the whole image. The motion addressed in the paper is restricted to 2D affine
transforms. We would expect to see its extension to 3D cases
under perspective projections.
2.2. Model-Based Approaches
In the above subsection, we examined several approaches to
motion analysis that do not require a priori shape models. In this
type of approach, which is necessary when no a priori shape
models are available, it is typically more difficult to establish
feature correspondence between consecutive frames. Therefore,
most methods for the motion analysis of human body parts apply
predefined models for feature correspondence and body structure recovery. Our discussion will be based on the representations of various models.
Chen and Lee [22] recovered the 3D configuration of a moving
subject according to its projected 2D image. Their model used 17
line segments and 14 joints to represent the features of the head,
torso, hip, arms, and legs (shown in Fig. 5). Various constraints
were imposed for the basic analysis of the gait. The method
was computationally expensive, as it searched through all possible combinations of 3D configurations, given the known 2D
projection, and required accurate extraction of 2D stick figures.
Bharatkumar et al. [7] also used stick figures to model the lower
limbs of the moving human body. They aimed at constructing
a general kinematic model for gait analysis in human walking.
Medial-axis transformations were applied to extract 2D stick
figures of the lower limbs. The body segment angle and joint
displacement were measured and smoothed from real image sequences, and then a common kinematic pattern was detected
for each walking cycle. A high correlation (>0.95) was found

between the real images and the model. The drawback to this
model is that it is view-based and sensitive to changes of the perspective angle at which the images are captured. Hubers human
model [23] is a refined version of the stick figure representation. Joints are connected by line segments with a certain degree
of relaxation as virtual springs. Thus, this articulated kinematic model behaves analogously to a mass-spring-damper system. Motion and stereo measurements of joints are confined to a
three-dimensional space called proximity space (PS). The human
head serves as the starting point for tracking all PS locations.
In the end, particular gestures were recognized based on the PS
states of the joints associated with the head, torso, and arms.
The key to solving this problem requires the 3D positions of the
joints in an image sequence. A recent publication by Iwasawa
[24] focused on real-time extraction of stick figures from monocular thermal images. The height of the human image and the
distance between the subject and the camera were precalibrated.
Then the orientation of the upper half of the body was calculated
as the principle axis of inertia of the human silhouette. Significant points, such as the top of the head and the tips of the hands
and feet, were heuristically located as the extreme points farthest
from the center of the silhouette. Finally, major joints such as the
elbows and knees were estimated, based on the positions of the
detected points through genetic learning. The drawback to this
method is, again, that it is view-oriented and gesture-restricted.
For example, if the human arms are placed in front of the body,
there is no way to extract the finger tips using this method and,
therefore, it fails to locate the elbow joints.
Niyogi and Adelson [8] pursued another route to estimate
the joint motion of human body segments. They first examined the spatial-temporal (XYT) braided pattern produced by
the lower limb trajectories of a walking human and conducted
gait analysis for coarse human recognition. Then the projection
of head movements in the spatial-temporal domain was located,
followed by the identification of other joint trajectories. These
joint trajectories were then utilized to outline the contour of a
walking human, based on the observation that the human body
is spatially contiguous. Finally, a more accurate gait analysis
was performed using the outlined 2D contour, which led to a
fine-level recognition of specific humans. The major concern
we have with this work is how to obtain these XYT trajectories
in real image sequences without attaching specific sensors to the
head and feet during image acquisition.
In Akitas work [25], both stick figures and cone approximations were integrated and processed in a coarse-to-fine manner.
A key frame sequence of stick figures indicates the approximate
order of the motion and spatial relationships between the body
parts. A cone model is included to provide knowledge of the
rough shape of the body parts, whose 2D segments correspond
to the counterparts of the stick figure model. The preliminary
condition to this approach is to obtain these key frames prior
to body part segmentation and motion estimation. Perales and
Torres also made use of both stick figure and volumetric representations in their work [26]. They introduced a predefined

HUMAN MOTION ANALYSIS

library with two levels of biomechanical graphical models. Level


one is a stick figure tree with nodes for body segments and arcs
for joints. Level two is composed of descriptions of surface and
body segments using 3D graphical primitives. Both levels of the
model are applied in various matching stages. The experiments
were conducted in a relative restricted environments with one
subject and multiple calibrated cameras. The distance between
the camera and the subject was also known. We would like to
see extension to their methods when the imaging conditions are
much more relaxed.
Leung and Yang [27] applied a 2D ribbon model to recognize
poses of a human performing gymnastic movements. They estimated motion solely from the outline of the moving human subject. This indicates that the proposed framework is not restricted
by the small motion assumption. The system consists of two major processes: extraction of human outlines and interpretation of
human motion. In the first process, a sophisticated 2D ribbon
model (shown in Fig. 6) was applied to explore the structural
and shape relationships between the body parts and their associated motion constraints. A spatial-temporal relaxation process
was proposed to determine if an extracted 2D ribbon belongs
to a part of the body or that of the background. In the end, a
description of the body parts and the appropriate body joints is
obtained, based on the structure and motion. Their work is one
of the most complete systems for human motion analysis, ranging from low-level segmentation to the high level task of body
part labeling. It would be very interesting to see how the system
reacts to more general human motion sequences.
As we know, the major drawback of 2D models is their restriction to viewing camera angles. Applying a 3D volumetric
model alleviates this problem, but it requires more parameters
and more expensive computation during the matching process.
In a pioneering work in model-based 3D human motion analysis
[28], ORourke and Badler applied a very elaborate volumetric
model consisting of 24 rigid segments and 25 joints. The surface
of each segment is defined by a collection of overlapping sphere
primitives. A coordinate system is embedded in the segments
along with various motion constraints of human body parts. The
system consists of four main processes: prediction, simulation,
image analysis, and parsing (shown in Fig. 8). First, the image
analysis phase accurately locates the body parts based on the
previous prediction results. After the range of the predicted 3D
location of the body parts becomes smaller, the parser fits these
location-time relationships into certain linear functions. Then,
the prediction phase estimates the position of the parts in the next
frame using the determined linear functions. Finally, a simulator
embedded with extensive knowledge of the human body translates the prediction data into corresponding 3D regions, which
will be verified by the image analysis phase in the next loop.
Our question is whether the computation converges in the four
process loop when the initial prediction is random and whether
the use of an elaborate representation of a human figure is necessary when we merely want to estimate the motion of body
parts.

433

FIG. 8. The overview of ORourke and Badlers system.

Elliptical cylinders are another commonly used representations for modeling human forms in 3D volumes. Compared to
ORourke and Badlers elaborate model, an elliptical cylinder
could be well represented by fewer parameters: the length of the
axis and the major and minor axes of the ellipse cross section.
Hogg [29] and Rohr [30] used the cylinder model originated by
Marr and Nishihara [31], in which the human body is represented
by 14 elliptical cylinders. The origin of the coordinate system is
located at the center of the torso. Both Hogg and Rohr intended
to generate 3D descriptions of a human walking by modeling.
Hogg [29] presented a computer program (WALKER) which
attempted to recover the 3D structure of a walking person. Rohr
applied eigenvector line fitting to outline the human image and
then fitted the 2D projections into the 3D human model using a
similar distance measure. In the same spirit, Wachter and Nagel
[32] also attempted to establish the correspondence between a
3D human model connected by elliptical cones and a real image
sequence. Both edge and region information were incorporated
in determining body joints, their degrees of freedom (DOFs) and
orientations to the camera by an iterated extended Kalman filter.
A fair amount of work in human motion analysis has focused
on the motion of certain body parts. For example, Rehg et al. [33]
were interested in tracking the motion of self-occluded fingers
by matching an image sequence of hand motion to a 3D finger
model. The two occluded fingers were rendered, again, using
several cylinders. The major drawback to this work is that they
assumed an invariant visibility order of 2D templates of these
two occluded fingers to the viewing camera. This assumption
simplified the motion prediction and matching process between
the 3D shape model and its projection in the image plane. However, it also limits its application to image sequences captured
under normal conditions. Recent work by Goncalves et al. [34]

434

AGGARWAL AND CAI

addressed the problem of motion estimation of a human arm in


3D using a calibrated camera. Both the upper and lower arm
were modeled as truncated circular cones, and the shoulder and
elbow joints were assumed to be spherical joints. They used perspective projection of a 3D arm model to fit the blurred image
of a real arm. Matching was conducted by recursively minimizing the error between the model projection and the real image
through dynamically adapting the size and orientation of the
model. Iterative searching for the global minimum might be a
concern for this approach.
Kakadiaris and Metaxas [35], on the other hand, consider the
problem of 3D model-based tracking of human body parts from
three calibrated cameras placed in a mutually orthogonal configuration. The 3D human model was extracted, based on its 2D
projections from the view of each camera using a physics-based
framework [19] discussed in the previous section. Cameras are
switched for tracking, based on the visibility of the body parts
and the observability of their predicted motion. Correspondence
between the points on the occluding contours to points on the
3D shape model is established using concepts from projective
geometry. The Kalman filter is used for prediction of the body
part motion. Since the authors in [19] claimed that their method
was general to multiple calibrated cameras, Gavrila and Davis
[36, 37] extended it to four calibrated cameras for 3D human
body modeling. The class of tapered super-quadrics, including
cylinders, spheres, ellipsoids, and hyper-rectangles, were used
for the 3D graphical model with 22 DOFs. The general framework for human pose recovery and tracking of body parts was
similar to that proposed in ORourke and Badlers [28]. Matching between the model view and the actual scene was based on
the similarity of arbitrary edge contours using a variant of chamfer matching [38]. Finally, we address the recent work by Hunter
et al. [39], who applied mixture density modeling for 3D human
pose recovery from a real image sequence. The 3D human model
consists of five ellipsoidal component shapes representing the
arms and the trunk with 14 DOFs. A modified EM algorithm was
employed to solve a constrained mixture estimation of the lowlevel segmented images. Bregler [40] also applied the mixture
of Gaussian estimation to 2D blobs in their low-level tracking
process. However, their ultimate goal was to recognize human
dynamics, which will be discussed in later sections.
All of these approaches encounter the problem of matching a
human image to its abstract representation by models of different complexity. The problem is itself nontrivial. The complexity
of the matching process is governed by the number of model parameters and the efficiency of human body segmentation. When
fewer model parameters are used, it is easier to match the feature
to the model, but it is more difficult to extract the feature. For
example, the stick figure is the simplest way to represent a human body, and thus it is relatively easier to fit the extracted lines
into the corresponding body segments. However, extracting a
stick figure from real images needs more care than searching for
2D blobs or 3D volumes. In some of the approaches mentioned
above, the Kalman filter and the EM algorithm were used to ob-

tain good estimation and prediction of body part positions. However, these algorithms are computationally expensive and do not
guarantee convergence if the initial conditions are ill defined.
3. TRACKING HUMAN MOTION WITHOUT
USING BODY PARTS
In the previous section, we discussed human motion analysis involving identification of joint-connected body parts. As
we know, the tasks of labeling body segments and then locating
their connected joints are certainly nontrivial. On the other hand,
applications such as surveillance require only tracking the movement of the subject of interest. It will be computationally more
efficient to track moving humans by directly focusing on the
motion of the whole body rather than its parts. Knowing that
the motion of individual body parts usually converges to that of
the whole body in the long run, we could employ the motion
of the whole body as a guide for predicting and estimating the
movement of body parts; i.e., we could use the the human body
as the coordinates for local motion estimation of its body parts.
In this section, we will review methods for tracking a moving
human without involving its body parts.
As is well known to computer vision researchers, tracking is
equivalent to establishing correspondence of the image structure
between consecutive frames based on features related to position, velocity, shape, texture, and color. Typically, the tracking
process involves matching between images using pixels, points,
lines, and blobs, based on their motion, shape, and other visual
information [41]. There are two general classes of correspondence models, namely iconic models which use correlation
templates and structural models which use image features
[41]. Iconic models are generally suitable for any objects, but
only when the motion between two consecutive frames is small
enough so that the object images in these frames are highly correlated. Since our interest is in tracking moving humans, which
retain a certain degree of nonrigidity, our literature review is
limited to the use of structural models.
Feature-based tracking typically starts with feature extraction,
followed by feature matching over a sequence of images. The
criteria for selecting a good feature are its brightness, contrast,
size, and robustness to noise. To establish feature correspondence between successive frames, well-defined constraints are
usually imposed to find a unique match. There is again a trade-off
between feature complexity and tracking efficiency. Lower-level
features such as points are easier to extract but relatively more
difficult to track than higher-level features such as lines, blobs,
and polygons.
We will organize our discussion based on two scenarios
from a single camera view or from multiple perspectives. In
both cases, various levels of features could be used to establish
matching between successive frames. The major difference is
that the features used for matching from multiple perspectives
must be projected to the same spatial reference, while tracking
from a single view does not have this requirement.

HUMAN MOTION ANALYSIS

3.1. Tracking from a Single View


Most methods focus on tracking humans in image sequences
from a single camera view. Features used for tracking have
evolved from points to motion blobs.
Polana and Nelsons work [42] is a good example of point feature tracking. They considered that the movements of arms and
legs converge to that of the torso. In [42], each walking subject
image was bounded by a rectangular box, and the centroid of the
bounding box was used as the feature to track. Tracking was successful even when the two subjects were occluded to each other
in the middle of the image sequence, as long as the velocity of the
centroids could be differentiated. The advantage of this approach
is its simplicity and the use of body motion to solve occlusion.
However, it only considers 2D translational motion and its tracking robustness could be further improved by incorporating other
features such as texture, color, and shape. Cai et al. [43] also
focused on tracking the movements of the whole human body
using a viewing camera with 2D translational movement. They
focused on dynamic recovery of still or changing background
images. The image motion of the viewing camera was estimated
by matching the line segments of the background image. Then,
three consecutive frames were adjusted into the same spatial reference with motion compensation. At the final stage, subjects
were tracked using the center of the bounding boxes and estimated motion information. Compared to [42], their work is able
to handle a moving camera. However, it also shares the strengths
and weaknesses of the previous approach. Segen and Pingalis
[44] people tracking system utilized the corner points of moving
contours as the features for correspondence. These feature points
were matched in a forwardbackward manner between two successive frames using a distance measure related to position and
curvature values of the points. The matching process implies
that a certain degree of rigidity of the moving human body and
small motion between consecutive frames was assumed. Finally,
short-lived or partially overlapped trajectories were merged into
long-lived paths. Compared to the one-trajectory approaches in
[42, 43], their work relies on multiple trajectories and thus is
more robust if all matches are correctly resolved. However, the
robustness of corner points for matching is questionable, given
the condition of natural clothing and nonrigid human shapes.
Another commonly used feature is 2D blobs or meshes. In
Okawa and Hanatanis work [45], background pixels were voted
as the most frequent value during the image sequence [15]. The
meshes belonging to a moving foot were then detected by incorporating low-pass filtering and masking. Finally, the distance
between the human foot and the viewing camera was computed
by comparing the relative foot position to the precalibrated CAD
floor model. Rossi and Bozzoli [46] also used moving blobs to
track and count people crossing the field of view of a vertically
mounted camera. Occlusion of multiple subjects was avoided
due to the viewing angle of the camera, and tracking was performed using position estimation during the period when the subject enters the top and the bottom of the image. Intille and Bobick

435

[47, 48] solved their tracking problem by taking advantage of


the knowledge of a so-called closed-world. A closed-world
is a space-time domain in which all possible objects present in
the image sequences are known. They illustrated their tracking algorithm using the example of a football game, where the
background and the rules for play are known a priori. Camera
motion was removed by establishing homographic transforms
between the football field model and its model using landmarks
in the field. The players were detected as moving blobs, and
tracking was performed by template matching the neighbor region of the player image between consecutive frames. These
approaches are well justified, given the restricted environments
of their applications. We would not recommend using them for
image sequences acquired under general conditions.
Pentland et al. [49, 50] explored the blob feature thoroughly.
In their work, blobs were not restricted to regions due to motion
and could be any homogeneous areas, such as color, texture,
brightness, motion, shading, or a combination of these. Statistics such as mean and covariance were used to model the blob
features in both 2D and 3D. In [49], the feature vector of a blob
is formulated as (x, y, Y, U, V ), consisting of a spatial (x, y)
and color (Y, U, V ) information. A human body is constructed
by blobs representing various body parts such as head, torso,
hands, and feet. Meanwhile, the surrounding scene is modeled as
a texture surface. Gaussian distributions were assumed for both
models of the human body and the background scene. In the end,
pixels belong to the human body were assigned to different body
part blobs using the log-likelihood measure. Later, Azarbayejani
and Pentland [50] recovered the 3D geometry of the blobs from
a pair of 2D blob features via nonlinear modeling and recursive estimation. The 3D geometry included the shape and the
orientation of the 3D blobs along with the relative rotation and
translation to the binocular camera. Tracking of the 2D blobs
was inherent in the recovery process which iteratively searched
the equilibria of the nonlinear state space model across image
sequences. The strength of these approaches is their comprehensive feature selection, which could result in more robust performance when the cost of computation is not a major concern.
Another trend is to use color meshes [51] or clusters [52]
for human motion tracking. In the work by Gunsel et al. [51],
video objects such as moving humans were identified as groups
of meshes with similar motion and color (YUV) information.
Nodes of the meshes were tracked in image sequences using
block matching. In particular, they focused on the application
of content-based indexing of video streams. Heisele et al. [52]
initially adopted a clustering process to categorize objects to
various prototypes based on color in RGB space and their corresponding image locations. These prototypes then serve as the
seeds for the parallel k-means clustering during which centroids
of the clusters are implicitly tracked in subsequent images. Color
is also applied for tracking human faces [53, 54] and identifying
nude pictures on the internet [55]. Since our interest is more on
motion analysis, we will not discuss these approaches in detail.
Overall, we believe color is a vital cue for segmentation and

436

AGGARWAL AND CAI

tracking of moving humans, as long as extra computation and


storage is not a major concern.
3.2. Tracking from Multiple Perspectives
The disadvantage of tracking human motion from a single
view is that the monitored area is relatively narrow due to the
limited field of view of a single camera. To increase the monitored area, one strategy is to mount multiple cameras at various locations of the area of interest. As long as the subject is
within the area of interest, it will be imaged from at least one
of the perspectives. Tracking from multiple perspectives also
helps to solve ambiguities in matching when subject images are
occluded to each other from a certain viewing angle. In recent
years, more work on tracking of human motion from multiple
perspectives has emerged [5659]. Compared to tracking moving humans from a single view, establishing feature correspondence between images captured from multiple perspectives is
more challenging because the features are recorded in different
spatial coordinates. These features must be adjusted to the same
spatial reference before matching is performed.
Recent work by Cai and Aggarwal [59, 60] uses multiple
points belonging to the medial axis of the human upper body as
the feature to track. These points are sparsely sampled and assumed to be independent of each other, which preserves a certain
degree of nonrigidity of the human body. Location and average
intensity of the feature points are used to find the most likely
match between two consecutive frames imaged from different
viewing angles. One feature point is assumed to match to its
corresponding epipolar line via motion estimation and precalibration of the cameras. Camera switching was automatically
implemented, based on the position and velocity of the subject
relative to the viewing cameras. Experimental results of tracking
moving humans in indoor environments using a prototype system equipped with three cameras indicated robust performance
and a potential for real-time implementation. The strength of this
approach is its comprehensive framework, and relative lower
computational cost given the degree of the problem complexity.
However, the approach still relies heavily on the accuracy of
the segmentation results, and more powerful and sophisticated
segmentation methods would improve its performance.
Sato et al. [56], on the other hand, treated a moving human as
a combination of various blobs of its body parts. All distributed
cameras are calibrated in a world coordinate system that corresponds to a CAD model of the surrounding indoor environment.
The blobs of body parts were matched over image sequences using their area, average brightness, and rough 3D position in the
world coordinates. The 3D position of a 2D blob was estimated,
based on height information, by measuring the distance between
the center of gravity of the blob and the floor. These small blobs
were then merged into an entire region of the human image using the spatial and motion parameters from several frames. Kelly
et al. [58, 57] adopted a similar strategy [56] to construct a 3D
environmental model. They introduce a feature called voxels,
which are sets of cubic volume elements containing information

such as which object belongs to this pixel and the history of


this object. The depth information of the voxel is also obtained
using height estimation. Moving humans are tracked as a group
of these voxels from the best angle of the viewing system.
A significant amount of effort is made in [57, 58] to construct
a friendly graphical interface for browsing multiple image sequences of a same event. Both methods require a CAD model of
the surrounding environment and tracking relies on finding the
3D positions of blobs in the world coordinate system.
4. HUMAN ACTIVITY RECOGNITION
In this section, we move on to recognizing human movements
or activities from image sequences. A large portion of literature
in this area is devoted to human facial motion recognition and
emotion detection, which fall into the category of elastic nonrigid
motion. In this paper, we are interested only in motion involving
a human body, i.e., human body motion as a form of articulated
motion. The difference between articulated motion and elastic
motion is well addressed in [4]. There also exists a body of work
on human image recognition by applying geometric features,
such as 2D clusters [15, 60], profile projections [61], texture
and color information [55, 53]. Since motion information was
not incorporated in the recognition process, we will not include
them for further discussion.
For human activity or behavior recognition, most efforts have
used state-space approaches [62] to understand the human motion sequence [5, 63, 13, 14, 64, 65, 40]. An alternative is to
use the template matching technique [42, 6668] to convert an
image sequence into a static shape pattern and then compare
it to the prestored prototypes during the recognition process.
The advantage of using the template matching technique is its
lower computational cost; however, it is usually more sensitive
to the variance of the movement duration. Approaches based on
state-space models, on the other hand, define each static posture
as a state. These states are connected by certain probabilities.
Any motion sequence is considered as a tour going through various states of these static poses. Joint probabilities are computed
through these tours, and the maximum value is selected as the
criterion for classifying activities. In such a scenario, duration of
motion is no longer an issue because each state can repeatedly
visit itself. However, approaches using these methods usually
apply intrinsic nonlinear models and do not have a closed-form
solution. As we know, nonlinear modeling typically requires
searching for a global optimum in the training process, which
requires expensive computing iterations. Meanwhile, selecting
the proper number of states and dimension of the feature vector
to avoid underfitting or overfitting remains an issue. In the
following subsections, both template matching and state-space
approaches [62] will be discussed.
4.1. Approaches Using Template Matching
The work [68] by Boyd and Little is perhaps the only one
which uses MLDs as the feature for recognition, based on

HUMAN MOTION ANALYSIS

template matching. They assumed that a single pedestrian is


walking parallel to the camera and that the movement of the
subject is periodic. First, they computed n optical flow images
from n + 1 frames and then chose m characteristics such as the
centroid of MLDs and moments from each optical flow images.
Thus the kth characteristic (1 k m) in n frames forms a timevarying sequence with a phase k . The differences between these
phases Fi = i m (1 k m 1) are grouped into the feature vector for recognition. As can be seen from the procedure,
this approach is restricted by the relative positions between the
walking subject and the viewing camera angle.
Perhaps the most commonly used feature for recognition of
human motion is 2D meshes. For example, Polana and Nelson
[42] computed the optical flow fields [16] between consecutive frames and divided each flow frame into a spatial grid in
both X and Y directions. The motion magnitude in each cell
is summed, forming a high-dimensional feature vector used for
recognition. To normalize the duration of the movement, they
assumed that human motion is periodic and divided the entire
sequence into a number of cycles of the activity. Motion in
a single cycle was averaged throughout the number of cycles
and differentiated into a fixed number of temporal divisions.
Finally, activities were recognized using the nearest neighbor
algorithm. Recent work by Bobick and Davis [66] follows the
same vein, but extracts the motion feature differently. They interpret human motion in an image sequence by using motionenergy images (MEI) and motion-history images (MHI). The
motion images in a sequence are calculated via differencing between successive frames and then thresholded into binary values. These motion images are accumulated in time and form
MEI, which are binary images containing motion blobs. The
MEI are later enhanced into MHI, where each pixel value is
proportional to the duration of motion at that position. Each
action consists of MEIs and MHIs in images sequences with
various viewing angles. Moment-based features are extracted
from MEIs and MHIs and employed for recognition using template matching. Cui and Weng [67] intended to segment hand
images with different gestures from cluttered backgrounds using a learning-based approach. They first used the motion cue
to fix an attention window containing the moving object. Then
they searched the training set to find the closest match, which
served as the basis for predicting the pattern in the next frame.
This predicted pattern is later applied to the input image for verification. If the pattern is verified, then a refinement process is
followed to obtain an accurate representation of the hand image with any cluttered backgrounds excluded. As can be seen
from these approaches, the key to the recognition problem is
finding the vital and robust feature sets. Due to the nature of 2D
meshes, all the features addressed above are view-based. There
are several alternatives to the solution. One is to use multiple
views and incorporate features from different perspectives for
recognition. Another is to directly apply 3D features, such as
the trajectories of 3D joints. Once the features are extracted, any
good method for machine learning and classification, such as

437

FIG. 9. The basic structure of a hidden Markov model.

nearest neighbor, decision tree, and regression models, could be


employed.
4.2. State-Space Approaches
State space models have been widely used to predict, estimate,
and detect time series over a long period of time. One representative model is the hidden Markov model (HMM), which is a
probabilistic technique for the study of discrete times series.
HMM has been used extensively in speech recognition, but only
recently has it been adopted for recognition of human motion
sequences in computer vision [5]. Its model structure could be
summarized as a hidden Markov chain and a finite set of output
probability distributions [69]. The basic structure of an HMM
is shown in Fig. 9, where Si represents each state connected by
probabilities to other states or its own, and y(t) is the observation
derived from each state. The main tool in HMM is the Baum
Welch (forwardbackward) algorithm for maximum likelihood
estimation of the model parameters. Features for recognition in
each state vary from points and lines to 2D blobs. We will organize our discussions according to the complexity of feature
extraction.
Bobick [13] and Campbell [14] applied 2D or 3D Cartesian
tracking data sensed by MLDs of body joints for activity recognition. To state more specifically, the trajectories of multiple
part joints form a high-dimensional phase space, and this phase
space or its subspace is employed as the feature to recognize.
Both researchers intend to convert the continuous motion into
a set of discrete symbolic representations. The feature vector in
each frame in a motion sequence is portrayed as a point in the
phase space belonging to a certain state. A typical gesture is defined as an ordered sequence of these states restricted by motion
constraints. In Campbell [14], the learning/training process is
accomplished by fitting the unique curve of a gesture into the
subset of the phase space with low-order polynomials. Gesture
recognition is based on finding the maximum correlation between the predictor and the current motion. Bobick [13], on the
other hand, applied the k-means algorithm to find the center for
each cluster (or state) from the sample points of the trajectories. Matching between trajectories of 3D MLDs to a gesture

438

AGGARWAL AND CAI

was accomplished through dynamic programming. The major


concern to these approaches is how to reliably extract these 3D
trajectories from general video streams.
Goddards human movement recognition [63] focused on the
low limb segments of the human stick figure. 2D projections of
the joints were used directly, without interpretation, as inputs,
and features for recognition were encoded by coarse orientation
and coarse angular speed of the line segments in the image plane.
Although Goddard [63] did not directly apply HMM in his work
for discrimination of human gaits, he also considered a movement as a composition of events linked by time intervals. Each
event is independent of the movement. The links between these
events are restrained by feature and temporal constraints. This
type of structure, called a scenario, is easily extended by assembling scenarios into composite scenarios. Matching real scene
kinematics to modeled movements involves visual motion features in the scene, a priori events matched to the scenario, and
the temporal constraints on the sequence and time intervals.
Another commonly used feature for identifying human movements or activities is 2D blobs, which could be obtained from any
homogeneous regions based on motion, color, texture, etc. The
work by Yamato et al. [5] is perhaps the first one on recognition
of human action in this category. Mesh features of binary moving
human blobs are used as the low-level feature for learning and
recognition. Learning was accomplished by training the HMMs
to generate symbolic patterns for each class. Optimization of the
model parameters is achieved using the BaumWelch algorithm.
Finally, recognition is based on the output of the given image
sequence using forward calculation. Recent work by Starner and
Pentland [64] applied a similar method to recognition of American sign language. Instead of directly using the low-level mesh
feature, they introduced the position of a hand blob, its angle of
axis of least inertia, and eccentricity of its bounding ellipse in
the feature vector. Brand et al. [65] applied the coupled HMMs
to recognize the actions by two associated blobs (for example,
the right- and left-hand blobs in motion). Instead of directly
composing the two interacting process into a single HMM process whose structural complexity is the product of that of each
individual process, they coupled the two HMMs in such a way
that the structural complexity is merely twice that of each individual HMM. Breglers paper [40] presents a comprehensive
framework for recognizing human movements using statistical
decomposition of human dynamics at various levels of abstractions. The recognition process starts with low-level processing,
during which blobs are estimated as mixture of Gaussian models based on motion and color similarity, spatial proximity, and
prediction from the previous frame using EM algorithm. During
this process, blobs of various body parts are implicitly tracked
over image sequences. In the mid-level processing, blobs with
coherent motion are fitted into simple movements of dynamic
systems. For example, a walking cycle is considered as composition of two simple movements, the one when the leg has the
group support and the one when the leg swings in the air. At the
highest level, HMMs are used as a mixture of these mid-level

dynamic systems to represent the complex movement. Recognition is completed by maximizing the posterior probability of
the HMM model from trained data given the input image sequence. Most of these methods include low-level processing of
feature extraction. But again, all recognition methods relying on
2D features are view-based.
5. CONCLUSION
We have presented an overview of past developments in human motion analysis. Our discussion has focused on three major
tasks: (1) motion analysis of human body parts, (2) high-level
tracking of human motion from a single view or multiple perspectives, and (3) recognition of human movements or activities
based on successfully tracking features over an image sequence.
Motion analysis of the human body parts is essentially the 2D or
3D interpretation of the human body structure using the motion
of the body parts over image sequences, and involves low-level
tasks such as body part segmentation, joint location, and detection. Tracking human motion is performed at a higher level at
which the parts of the human body are not explicitly identified,
e.g., the human body is considered as a whole when establishing
matches between consecutive frames. Our discussion is categorized by whether the subject is imaged at one time instant from a
single view or from multiple perspectives. Tracking a subject in
image sequences from multiple perspectives requires that features be projected into a common spatial reference. The task
of recognizing human activity in image sequences assumes that
feature tracking for recognition has been accomplished. Two
types of techniques are addressed: state-space approaches and
template matching approaches. Template matching is simple to
implement, but sensitive to noise and the time interval of the
movements. State-space approaches, on the other hand, overcome these drawbacks but usually involve complex iterative
computation.
The key to successful execution of high-level tasks is to establish feature correspondence between consecutive frames, which
still remains a bottleneck in most of the approaches. Typically,
constraints on human motion or human models are imposed in
order to decrease the ambiguity during the matching process.
However, these assumptions may not conform to general imaging conditions or may introduce other tractable issues, such as
the difficulty in estimating model parameters from the real image data. Recognition of human motion is just in its infancy,
and there exists a trade-off between computational cost and motion duration accuracy for methodologies based on state-space
models and template matching. We are still looking forward to
new techniques to improve the performance and, meanwhile,
decrease the computational cost.
ACKNOWLEDGMENTS
The work was supported in part by the United States Army Research
Office under Contracts DAAH04-95-I-0494 and DAAG55-98-1-0230 and the
Texas Higher Education Coordinating Board Advanced Technology Project

HUMAN MOTION ANALYSIS

439

95-ATP-442 and Advanced Research Project 97-ARP-275. Special thanks also


go to Ms. Debi Paxton and Mr. Shishir Shah for their generous help in editing
and commenting on the paper.

21. A. Jepson and M. Black, Mixture models for optical flow computation, in
Proc. of Intl. Conf. on Pattern Recognition, New York City, 1993, pp. 760
761.

REFERENCES

22. Z. Chen and H. J. Lee, Knowledge-guided visual perception of 3D human


gait from a single image sequence, IEEE Trans. Systems, Man, Cybernet.
22(2), 1992, 336342.

1. J. K. Aggarwal and N. Nandhakumar, On the computation of motion of


sequences of imagesA review, Proc. IEEE 76(8), 1988, 917934.
2. S. Bandi and D. Thalmann, A configuration space approach for efficient
anomation of human figures, in Proc. of IEEE Computer Society Workshop
on Motion of Non-Rigid and Articulated Objects, Puerto Rico, 1997, pp. 38
45.
3. J. K. Aggarwal and Q. Cai, Human motion analysis: A review, in Proc. of
IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated
Objects, Puerto Rico, 1997, pp. 90102.
4. J. K. Aggarwal, Q. Cai, W. Liao, and B. Sabata, Non-rigid motion analysis:
Articulated & elastic motion, CVGIP: Image Understanding, 1997.
5. J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in timesequential images using Hidden Markov Model, in Proc. IEEE Conf. CVPR,
Champaign, IL, June 1992, pp. 379385.

23. E. Huber, 3D real-time gesture recognition using proximity space, in


Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, August 1996,
pp. 136141.
24. S. Iwasawa, K. Ebihara, J. Ohya, and S. Morishima, Real-time estimation
of human body posture from momocular thermal images, in Proc. IEEE CS
Conf. on CVPR, Puerto Rico, 1997, pp. 1520.
25. K. Akita, Image sequence analysis of real world human motion, Pattern
Recognit. 17(1), 7383, 1984.
26. F. J. Perales and J. Torres, A system for human motion matching between
synthetic and real image based on a biomechanic graphical model, in Proc.
of IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated Objects, Austin, TX, 1994, pp. 8388.
27. M. K. Leung and Y. H. Yang, First sight: A human body outline labeling
system, IEEE Trans. PAMI 17(4), 1995, 359377.

6. W. Kinzel, Pedestrian recognition by modelling their shapes and movements, in progress in image analysis and processing III, in Proc. of the
7th Intl. Conf. on Image Analysis and Processing 1993, Singapore, 1994,
pp. 547554.

28. J. ORourke and N. I. Badler, Model-based image analysis of human motion


using constraint propagation, IEEE Trans. PAMI, 2:522536, 1980.

7. A. G. Bharatkumar, K. E. Daigle, M. G. Pandy, Q. Cai, and J. K. Aggarwal,


Lower limb kinematics of human walking with the medial axis transformation, in Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid
and Articulated Objects, Austin, TX, 1994, pp. 7076.

30. K. Rohr, Towards model-based recognition of human movements in image


sequences, CVGIP: Image Understanding 59(1), 94115, 1994.

8. S. A. Niyogi and E. H. Adelson, Analyzing and recognizing walking figures


in XYT, in Proc. IEEE CS Conf. on CVPR, Seattle, WA, 1994, pp. 469474.
9. G. Johansson, Visual motion perception, Sci. Am. 232(6), 1975, 7688.
10. R. F. Rashid, Towards a system for the interpretation of moving light display, IEEE Trans. PAMI, 2(6), 574581, November 1980.
11. J. A. Webb and J. K. Aggarwal, Visually interpreting the motion of objects
in space, IEEE Comput. August 1981, 4046.
12. J. A. Webb and J. K. Aggarwal, Structure from motion of rigid and jointed
objects, in Artif. Intell. 19, 1982, 107130.
13. A. F. Bobick and A. D. Wilson, A state-based technique for the summarization and recognition of gesture, in Proc. of 5th Intl. Conf. on Computer
Vision, 1995, pp. 382388.
14. L. W. Campbell and A. F. Bobick, Recognition of human body motion using
phase space constraints, in Proc. of 5th Intl. Conf. on Computer Vision, 1995,
pp. 624630.
15. A. Shio and J. Sklansky, Segmentation of people in motion, in Proc. of
IEEE Workshop on Visual Motion, IEEE Computer Society, October 1991,
pp. 325332.
16. B. K. P. Horn and B. G. Schunk, Determining optical flow, Artificial Intelligence, 17, 185204, 1981.
17. S. Kurakake and R. Nevatia, Description and tracking of moving articulated
objects, in 11th Intl. Conf. on Pattern Recognition, Hague, Netherlands,
1992, Vol. 1, pp. 491495.
18. I. A. Kakadiaris, D. Metaxas, and R. Bajcsy, Active part-decomposition,
shape and motion estimation of articulated objects: A physics-based approach, in Proc. IEEE CS Conf. on CVPR, Seattle, WA, 1994, pp. 980984.
19. I. A. Kakadiaris and D. Metaxas, 3d human body model acquisition from
multiple views, in Proc. of 5th Intl. Conf. on Computer Vision, 1995,
pp. 618623.
20. H. A. Rowley and J. M. Rehg, Analyzing articulated motion using
expectation-maximization, in Proc. of Intl. Conf. on Pattern Recognition,
Puerto Rico, 1997, pp. 935941.

29. D. Hogg, Model-based vision: a program to see a walking person, Image


and Vision Computing, 1(1), 520, 1983.

31. D. Marr and H. K. Nishihara, Representation and recognition of the spatial


organization of three-dimensional shapes, in Proc. R. Soc. London, 1978,
Vol. B, pp. 269294.
32. S. Wachter and H. H. Nagel, Tracking of persons in monocular image
sequences, in Proc. of IEEE Computer Society Workshop on Motion of
Non-Rigid and Articulated Objects, Puerto Rico, 1997, pp. 29.
33. J. M. Rehg and T. Kanade, Model-based tracking of self-occluding articulated objects, in Proc. of 5th Intl. Conf. on Computer Vision, Cambridge,
MA, 1995, pp. 612617.
34. L. Goncalves, E. D. Bernardo, E. Ursella, and P. Perona, Monocular tracking
of the human arm in 3D, in Proc. of 5th Intl. Conf. on Computer Vision,
Cambridge, MA, 1995, pp. 764770.
35. I. A. Kakadiaris and D. Metaxas, Model based estimation of 3D human
motion with occlusion based on active multi-viewpoint selection, in Proc.
IEEE CS Conf. on CVPR, San Francisco, CA, 1996, pp. 8187.
36. D. M. Gavrila and L. S. Davis, Towards 3-D model-based tracking and
recognition of human movement: a multi-view approach, in Proc. Intl.
Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, June 1995.
37. D. M. Gavrila and L. S. Davis, 3-D model-based tracking of humans in
action: A multi-view approach, in Proc. IEEE CS Conf. on CVPR, San
Francisco, CA, 1996, pp. 7380.
38. H. G. Barrow, Parametric correspondence and chamfer matching: Two new
trchniques for image matching, in Proc. IJCAI, 1977, pp. 659663.
39. E. A. Hunter, P. H. Kelly, and R. C. Jain, Estimation of articulated motion
using kinematically constrained mixture densities, in Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated Objects,
Puerto Rico, 1997, pp. 1017.
40. C. Bregler, Learning and recognizing human dynamics in video sequences,
in Proc. IEEE CS Conf. on CVPR, Puerto Rico, 1997, pp. 568574.
41. J. K. Aggarwal, L. S. Davis, and W. N. Martin, Correspondence process in
dynamic scene analysis, Proc. IEEE 69(5), 1981, 562572.
42. R. Polana and R. Nelson, Low level recognition of human motion (or how
to get your man without finding his body parts), in Proc. of IEEE Computer

440

AGGARWAL AND CAI

Society Workshop on Motion of Non-Rigid and Articulated Objects, Austin,


TX, 1994, pp. 7782.

CAD-Based Vision Workshop, Champion, PA, Feburary 1994, pp. 291


297.

43. Q. Cai, A. Mitiche, and J. K. Aggarwal, Tracking human motion in an indoor


environment, in Proc. of 2nd Intl. Conf. on Image Processing, Washington,
D.C., October 1995, Vol. 1, pp. 215218.

57. R. Jain and K. Wakimoto, Multiple perspective interactive video, in Proc.


of Intl. Conf. on Multimedia Computing and Systems, 1995, pp. 201211.

44. J. Segen and S. Pingali, A camera-based system for tracking people in real
time, in Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, August
1996, pp. 6367.
45. Y. Okawa and S. Hanatani, Recognition of human body motions by robots,
in Proc. IEEE/RSJ Intl. Conf. Intelligent Robots and Systems, Raleigh, NC,
1992, pp. 21392146.
46. M. Rossi and A. Bozzoli, Tracking and counting people, in 1st Intl. Conf.
on Image Processing, Austin, Texas, November 1994, pp. 212216.
47. S. S. Intille and A. F. Bobick, Closed-world tracking, in Proc. Intl. Conf.
Comp. Vis., 1965, pp. 672678.
48. S. S. Intille, J. W. Davis, and A. F. Bobick, Real-time closed-world tracking,
in Proc. IEEE CS Conf. on CVPR, Puerto Rico, 1997, pp. 697703.
49. C. Wren, A. Azarbayejani, T. Darrel, and A. Pentland, Pfinder: Real-time
tracking of the human body, in Proc. SPIE, Bellingham, WA, 1995.
50. A. Azarbayejani and A. Pentland, Real-time self-calibrating stereo person
tracking using 3D shape estimation from blob features, in Proc. of Intl.
Conf. on Pattern Recognition, Vienna, Austria, August 1996, pp. 627632.
51. B. Gunsel, A. M. Tekalp, and P. J. L. Beek, Object-based video indexing for
virtual studio production, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,
1997, pp. 769774.
52. B. Heisele, U. Kreel, and W. Ritter, Tracking non-rigid moving objects
based on color cluster flow, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,
1997, pp. 257260.

58. P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterjee, and


R. Jain, An architecture for multiple perspective interactive video, in Proc.
of ACM Conf. on Multimedia, 1995, pp. 201212.
59. Q. Cai and J. K. Aggarwal, Tracking human motion using multiple cameras,
in Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, August 1996,
pp. 6872.
60. Q. Cai and J. K. Aggarwal, Automatic tracking of human motion in indoor scenes across multiple synchronized video streams, in Intl. Conf. on
Computer Vision, Bombay, India, January 1998.
61. Y. Kuno and T. Watanabe, Automatic detection of human for visual surveillance system, in Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria,
August 1996, pp. 865869.
62. J. Farmer, M. Casdagli, S. Eubank, and J. Gibson, State-space reconstruction
in the presence of noise, Physics D 51, 1991, 5298.
63. N. H. Goddard, Incremental model-based discriminiation of articulated
movement from motion features, in Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated Objects, Austin, TX, 1994,
pp. 8995.
64. T. Starner and A. Pentland, Visual recognition of American sign language
using hidden Markov models, in Proc. Intl. Workshop on Automatic Face
and Gesture Recognition, Zurich, Switzerland, June 1995.
65. M. Brand, N. Oliver, and A. Pentland, Coupled hidden Markov models for
complex action recognition, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,
1997, pp. 994999.

53. J. Yang and A. Waibel, A real-time face tracker, in Proc. of IEEE Computer
Society Workshop Applications on Computer Vision, Sarasota, FL, 1996,
pp. 142147.

66. A. F. Bobick and J. Davis, Real-time recognition of activity using temporal


templates, in Proc. of IEEE Computer Society Workshop Applications on
Computer Vision, Sarasota, FL, 1996, pp. 3942.

54. P. Fieguth and D. Terzopoulos, Color-based tracking of heads and other


mobile objects at video frame rate, in Proc. IEEE CS Conf. on CVPR,
Puerto Rico, 1997, pp. 2127.
55. D. A. Forsyth and M. M. Fleck, Identifying nude pictures, in Proc. of IEEE
Computer Society Workshop Applications on Computer Vision, Sarasota,
FL, 1996, pp. 103108.
56. K. Sato, T. Maeda, H. Kato, and S. Inokuchi, CAD-based object tracking
with distributed monocular camera for security monitoring, in Proc. 2nd

67. Y. Cui and J. J. Weng, Hand segmentation using learning-based prediction


and verification for hand sign recognition, in Proc. IEEE CS Conf. on CVPR,
Puerto Rico, 1997, pp. 8893.
68. J. E. Boyd and J. J. Little, Global versus structured interpretation of motion:
Moving light displays, in Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated Objects, Puerto Rico, 1997, pp. 1825.
69. A. B. Poritz, Hidden Markov Models: A guided tour., in Proc. IEEE Intl.
Conf. on Acoust. Speech and Signal Proc., 1988, pp. 713.

You might also like