You are on page 1of 6

Model-based Human Posture Estimation for Gesture Analysis in an Opportunistic Fusion Smart Camera Network

Chen Wu and Hamid Aghajan Department of Electrical Engineering Stanford University, USA

Abstract
In multi-camera networks rich visual data is provided both spatially and temporally. In this paper a method of human posture estimation is described incorporating the concept of an opportunistic fusion framework aiming to employ manifold sources of visual information across space, time, and feature levels. One motivation for the proposed method is to reduce raw visual data in a single camera to elliptical parameterized segments for efcient communication between cameras. A 3D human body model is employed as the convergence point of spatiotemporal and feature fusion. It maintains both geometric parameters of the human posture and the adaptively learned appearance attributes, all of which are updated from the three dimensions of space, time and features of the opportunistic fusion. In sufcient condence levels parameters of the 3D human body model are again used as feedback to aid subsequent in-node vision analysis. Color distribution registered in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to rene color segments with observations from a single camera. Geometric conguration of the 3D skeleton is estimated by Particle Swarm Optimization (PSO).

of the network is the relatively low bandwidth. Therefore, for efcient collaboration between cameras, we expect concise descriptions instead of raw image data as outputs from local processing in a single camera. This process inevitably removes certain details in images of a single camera, which requires the camera to have some intelligence on its observations (smart cameras) , i.e., some knowledge of the subject. This derives one of the motivations for opportunistic data fusion between cameras, which compensates for partial observations in individual cameras. So the output from opportunistic data fusion (a model of the subject) is fed to local processing. On the other hand, outputs of local processing in single cameras enable opportunistic data fusion by contributing local descriptions from multiple views. It is the interactive loop that brings in the potential for achieving both efcient and adequate vision-based analysis in the camera network. An example of the communication model between ve cameras to reconstruct the persons model is shown in Fig. 1. The circled numbers represent the sequence of events. In our approach a 3D human body model embodies upto-date information from both current and historical observations of all cameras in a concise way. It has the following components: 1. Geometric conguration: body part lengths, angles. 2. Color or texture of body parts. 3. Motion of body parts. The three components are all updated from the three dimensions of space, time, and features of the opportunistic fusion. The 3D human model takes up two roles. One is as an intermediate step for high-level application-pertinent gesture interpretation, the other is to create a feedback path from spatiotemporal and feature fusion operations to low-level vision processing in each camera. It is true that for a number of gestures a human body model may not be needed to interpret the gesture. There is existing work for hand gesture recognition [1] where only part of the body is analyzed. Some gestures can also be detected through spatiotemporal motion patterns of some body parts [2, 3]. However, as the set of gestures to differentiate expands, it becomes increasingly difcult to devise methods for gesture recognition based on only a few cues. A 3D human body model provides a unied interface for 1

1. Introduction
In a multi-camera network, access to multiple sources of visual data often allows for making more comprehensive interpretations of events and gestures. It also creates a pervasive sensing environment for applications where it is impractical for the users to wear sensors. Example applications include surveillance, smart home care, gaming, etc. In this paper we propose a method of human posture estimation using an opportunistic fusion framework to employ manifold sources of information obtained from the camera network in a principled way. The framework spans three dimensions of space (different camera views), time (each camera collecting data over time), and feature levels (selecting and fusing different feature subsets). Our work aims for intelligent and efcient vision interpretations in a camera network. One underlying constraint

Fusion to update local knowledge of the subject


Vector of model parameters

CAM 5 3
Camera 5 wants to update its knowledge of the subject

1 CAM 1 5 4 2

Broadcast the request for collaboration

Updated knowledge of the subject is fedback

The other cameras send requested descriptions

5 5 CAM 4 CAM 2 Up-to-date knowledge of the subject is used in local processing CAM 3
Vector of descriptions from local processing

Figure 1: Communication for collaboration in the camera network.

prone to failure in some situations since lighting usually has big inuence in colors and skin color varies from person to person. In this paper, we rst introduce the opportunistic fusion framework as well as an implementation of its concepts through human gesture analysis in Section 2. In Section 3, image segmentation in a single camera is described in detail. Color distribution maintained in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to rene color segments with observations from a single camera, followed by a watershed algorithm to assign segment labels to all pixels based on spatial relationships. Finally, ellipse tting is used to parameterize segments in order to create concise segment descriptions for communication. In Section 4, Particle Swarm Optimization (PSO) is used for 3D model tting. Examples are shown to demonstrate capability of the elliptical segments for posture estimation.

a variety of gesture interpretations. On the other hand, instead of being a passive output to represent decisions from spatiotemporal and feature fusion, the 3D model implicitly enables more interactions between the three dimensions by being actively involved in vision analysis. For example, although predened appearance attributes are generally not reliable, adaptively learned appearance attributes can be used to identify the person or body parts. Those attributes are usually more distinguishable than generic features such as edges. Fitting human models to images or videos has been an interesting topic for which a variety of methods have been developed. Some reconstruct 3D representations of human models from a single cameras view [4, 5]. Due to the selfocclusive nature of human body, causing ambiguity from a single view, most of these methods rely on a restricted dynamic model of behaviors. But tracking can easily fail in case of sudden motions or other movements that differ much from the dynamic model. In 3D model reconstruction from multi-view cameras [6, 7], most methods start from silhouettes in different cameras, from which points occupied by the subject are estimated, and nally a 3D model with principle body parts is t in the 3D space [8]. This approach heavily relies on the silhouettes obtained from each image. It is also sensitive to the accuracy of camera calibration. However, in many situations background subtraction for silhouettes suffers for quality or is almost impossible due to clustered background or camouaged foreground. Another aspect of the human model tting problem is the choice of image features. All human model tting methods are based on some image features as targets to t the model. Most of them are based on generic features such as silhouettes or edges [9, 7]. Some use skin color but such methods are 2

2. Opportunistic Fusion for Human Gesture Analysis


We propose a framework of opportunistic fusion in multicamera networks in order to both employ the rich visual information provided by cameras and incorporate learned knowledge of the subject into active vision analysis. The opportunistic fusion framework is composed of three dimensions, space, time, and feature levels. In the rest of the paper, the problem of human gesture analysis is elaborated on to show how those concepts can be implemented.

2.1. The Fusion Framework Overview


The opportunistic fusion framework for gesture analysis is shown in Fig. 2. On the top of Fig. 2 are spatial fusion modules. In parallel is the progression of the 3D human body model. Suppose at time t0 we have the model with the collection of parameters as M0 . At the next instance t1 , the current model M0 is input to the spatial fusion module for t1 , and the output decisions are used to update M0 from which we get the new 3D model M1 . Now we look into a specic spatial fusion module (the lower part of Fig. 2) for the detailed process. In the lowest level of the layered gesture analysis, image features are extracted by local processing. No explicit collaboration between cameras is done in this stage since communication is not expected until images/videos are reduced to short descriptions. Distinct features (e.g. colors) specic for the subject are registered in the current model M0 and are used for analysis (arrow 1 in Fig. 2). The intuition here is, we adaptively learn what are the attributes distinguishing the subject, save them as marks in the 3D model, and then use those marks to look for the subject. After local processing, data is shared between cameras to derive a new estimate

3D human model time

local processing and spatial collaboration in the camera network

space

updating through model history and new observations

old model
3 Model -> gesture interpretations

Description Layers
Description Layer 4: Gestures
G

Decision Layers

updated model output of spatiotemporal fusion


2

Decision Layer 3: collaboration between cameras Description Layer 3: Gesture Elements


E1 E2 E3

Decision feedback to update the model (spatial fusion)

Decision Layer 2: collaboration between cameras Description Layer 2: Features


f11 f12 f21 f22

Active vision (temporal fusion)

F1

F2

F3

f31

f32

Decision Layer 1: within a single camera Description Layer 1: Images


R1 R2 R3

Figure 2: Spatiotemporal fusion framework for human gesture analysis. of the model. Parameters in M0 specify a smaller space of possible M1 s. Then decisions from spatial fusion of cameras are used to update M0 to obtain the new model M1 (arrow 2 in Fig. 2). Therefore, for every update of the model M , it combines space (spatial collaboration between cameras), time (the previous model M0 ), and feature levels (choice of image features in local processing from both new observations and subject-specic attributes in M0 ). Finally, the new model M1 is used for high-level gesture deductions in a certain scenario (arrow 2 in Fig. 2).

2.2. 3D Body Model Reconstruction Overview


An implementation of the 3D human body posture estimation is presented in this paper. Elements in the opportunistic fusion framework described above are incorporated in this algorithm as illustrated in Fig. 3. Local processing in a single camera includes segmentation and ellipse tting for a concise parameterization of segments. We assume the 3D model is initialized with a distinct color distribution for the subject. For each camera, the color distribution is rst rened using the EM (Expectation Maximization) algorithm and then used for segmentation. Undetermined pixels from EM are assigned labels through watershed segmentation. For spatial collaboration, ellipses from all cameras are merged to nd the geometric conguration of the 3D skeleton model. Candidate congurations are examined using PSO (Particle Swarm Optimization). Details and experiment results of the algorithm are presented in Section 3 and Section 4.

be efciently transmitted between the cameras. The output of the algorithm will be ellipses tted from segments and the mean color of the segments. As shown in the upper part of Fig. 3, local processing includes image segmentation for the subject and ellipse tting to the extracted segments. We assume the subject is characterized by a distinct color distribution. Foreground area is obtained through background subtraction. Pixels with high or low illumination are also removed since for those pixels chrominance may not be reliable. Then a rough segmentation for the foreground is done either based on K-means on the chrominance of the foreground pixels, or color distributions from the model. In the initialization stage when the model has not been well established, or when we do not have a high condence in the model, we need to start from the image itself and use a method such as K-means to nd color distribution of the subject. However, when a model with a reliable color distribution is available, we can directly assign pixels to different segments based on the existing color distribution. In practice, the color distribution maintained by the model may not be uniformly accurate for all cameras due to effects such as color map changes or illumination differences. Also the subjects appearance may change in a single camera due to the movement or lighting conditions. Therefore, the color distribution of the model is only used for a rough segmentation in initialization of segmentation. Then an EM algorithm is used to rene the color distribution for the current image. The initial estimated color distribution plays an important role because it can prevent EM from being trapped in local minima. Suppose the color distribution is a mixture of N Gaussian modes, with parameters = {1 , 2 , . . . , 3 }, where l = {l , l } are the mean and covariance matrix of the modes. Mixing weights of different modes are A = {1 , 2 , . . . , 3 }. We need to nd the probability of each pixel xi belonging to a certain mode l : P r(yi = l|xi ). From standard EM for Gaussian Mixture Models (GMM) we have the E step as: P r(k+1) (yi = l|xi ) l P(k) (xi ),
l

(k )

l = 1, . . . , N

P r(k+1) (yi = l|xi ) = 1


l=1

P r(k+1) (yi = l|xi ) (1) and the M step as: l


(k+1)

= =

3. In-Node Feature Extraction


The goal of local processing in a single camera is to reduce raw images/videos to simple descriptions so that they can 3

(k+1)

M (k ) ) i=1 xi P r (yi = l|xi , M ( k ) ) i=1 P r (yi = l|xi , (k ) (k ) T M i=1 (xi l )(xi l ) P r (yi = M (k ) ) i=1 P r (yi = l|xi ,

(2) l|xi , (k) ) (3)

Color segmentation and ellipse fitting in local processing Background subtraction Rough segmentation EM: refine color models Watershed segmentation

Ellipse fitting

Previous color distribution 3D human body model Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture, motion) Y Check stop criteria N Update each test configuration using PSO

Previous geometric configuration and motion

Score test configurations

Generate test configurations

Local processing from other cameras

Figure 3: Algorithm owchart for 3D human skeleton model reconstruction. and l


(k+1)

1 M

P r(k+1) (yi = l|xi )


xi

(4)

where k is the number of iterations, and the M step is obtained by maximizing the log-likelihood:
M N

(a)

(b)

(c)

(d)

L(x; ) =
i=1 l=1

P r(yi = l|xi )logP r(xi |l ).

(5)

Figure 4: Ellipse tting. (a) original image; (b) segments; (c) simple ellipse tting to connected regions; (d) improved ellipse tting.

However, this basic EM algorithm takes each pixel independently, without considering the fact that pixels belonging to the same mode are usually spatially close to each other. In [10] Perceptually Organized EM (POEM) is introduced. In POEM, inuence of neighbors is incorporated by a weighting measure: w(xi , xj ) = e

xi xj 2 1

approaches innity, the mixing weight for the mode with the largest vote will be 1. After renement of the color distribution with POEM, we set pixels with high probability (e.g., larger than 99.9%) that belong to a certain mode as markers for that mode. Then a watershed segmentation algorithm is implemented to assign labels for undecided pixels. Finally, in order to obtain a concise parameterization for each segment, an ellipse is tted to it. Note that a segment refers to a spatially connected region of the same mode. Therefore, a single mode can have several segments. When the segment is generally convex and has a shape similar to an ellipse, the tted ellipse well represents the segment. However, when the segments shape differs considerably from an ellipse, a direct tting step may not be sufcient. To address such cases, we rst test the similarity between the segment and an ellipse by tting an ellipse to the segment and comparing their overlap. If similarity is low, the segment is split into two segments and this process is carried out recursively on every segment until they all meet the similarity criterion. In Fig. 4, if we use a direct ellipse tting to every segment, we obtain Fig. 4(c). If we adopt the test-and-split procedure, correct ellipses are obtained as shown in Fig. 4(d). Experimental results are shown in Fig. 5. 4

s(xi )s(xj ) 2 2

(6)

where s(xi ) is the spatial coordinate of xi . Then, votes for xi from the neighborhood are given by Vl (xi ) =
xj

l (xj )w(xi , xj ), where l (xj )=P r(yj =l|xj )

(7) Based on this voting scheme, the following modica(k ) tions are made to the EM steps. In the E step, l is (k ) changed to l (xi ), which means that for every pixel xi , mixing weights for different modes are different. This is partially due to the inuence of neighbors. In the M step, mixing weights are updated by
(k ) l (xi )

eVl

(xi )

(x ) N Vk i k=1 e

(8)

in which controls the softness of neighbors votes. If is as small as 0, then mixing weights are always uniform. If

z z
The person
(a)

y
2

1
2

4
4

O
x

CAM 1

CAM 3

CAM 2

ellipses
CAM1

ellipses
CAM2

ellipses
CAM3

(b)

(a)

(b)

(c)

Figure 6: 3D skeleton model tting. (a) Top view of the experiment setting. (b) The 3D skeleton reconstructed from ellipses from multi-view cameras. be calculated from some known projective correspondences between the 3D subject and points in the images, without knowing exact locations of cameras or the subject. PSO is suitable for posture estimation as an evolutionary optimization mechanism. It starts from a group of initial particles. During the evolution the particles are directed to the good position while keeping some randomness to explore the search space. Suppose there are N particles (test congurations) xi , each being a vector of i s and i s. The velocity of xi is denoted by vi . Assume the best position of xi up to now is x i , and the global best position of all xi s up to now is g . The objective function is f () for which we wish to nd the optimal position x to minimize f (x). The PSO algorithm is as follows: 1. Initialize xi and vi . The value of vi is usually set to 0, and x i = xi . Evaluate f (xi ) and set g = argminf (xi ). 2. While the stop criterion is not satised, do for every xi : xi xi ) + c2 r2 (g xi ) vi vi + c1 r1 ( xi xi + vi xi ), x i = xi ; If f (xi ) < f (g ), If f (xi ) < f ( g = xi The stop criterion: After updating all N xi s once, the increase in f (g ) falls below a threshold, then the algorithm exits. Here is the inertia coefcient, while c1 and c2 are the social coefcients. r1 and r2 are random vectors with each element uniformly distributed on [0,1]. Choice of , c1 and c2 controls the convergence process of the evolution. If is big, the particles have more inertia and tend to keep their own directions to explore the search space. This allows for more chance of nding the true global optimal if the group of particles is currently around a local optimal. While if c1 and c2 are big, the particles are more social with the other particles and go quickly to the best positions known 5

Figure 5: Experiment results of local processing. (a) original images; (b) segments; (c) tted ellipses.

4. Collaborative Posture Estimation


Human posture estimation is essentially treated as an optimization problem, in which we aim to minimize the distance between the posture and ellipses from the multiple cameras. There can be several different ways to nd the 3D skeleton model based on observations from multi-view images. One method is to directly solve for the unknown parameters through geometric calculation. In this method one needs to rst establish correspondences between points / segments in different cameras, which is itself a hard problem. Common observations for points are rare for human problems, and body parts may take on very different appearances from different views. Therefore, it is difcult to resolve ambiguity in the 3D space based on 2D observations. A second method would be to cast this as an optimization problem, in which we nd optimal i s and i s to minimize an objective function (e.g., difference between projections due to a certain 3D model and the actual segments) based on properties of the objective function. However, if the problem is highly nonlinear or non-convex, it may be very difcult or time consuming to solve. Therefore, searching strategies which do not explicitly depend on the objective function formulation are desired. Motivated by [11, 12], Particle Swarm Optimization (PSO) is used for our optimization problem. The lower part of Fig. 3 shows the estimation process. Ellipses from local processing of single cameras are merged together to reconstruct the skeleton (Fig. 6). Here we consider a simplied problem in which only arms change in position while other body parts are kept in the default location. Elevation angles (i ) and azimuth angles (i ) of the left/right upper/lower parts of the arms are specied as parameters (Fig. 6(b)). The assumption is that projection matrices from 3D skeleton to 2D image planes are known. This can be achieved either from locations of cameras and the subject, or it can

and using more information from local features of cameras to better initialize the search space. In the current method calibrated cameras are assumed. However, we found that tting is sensitive to accuracy of calibration. Since accurate camera calibration is not always practical in applications of posture recognition, we are exploring solutions based on uncalibrated cameras.

References
[1] Andrew D. Wilson and Aaron F. Bobick, Parametric hidden markov models for gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp. 884900, 1999. [2] Yanxi Liu, Robert Collins, and Yanghai Tsin, Gait sequence analysis using frieze patterns, in Proceedings of the 7th European Conference on Computer Vision (ECCV02), May 2002. [3] Y. Rui and P. Anandan, Segmenting visual actions based on spatio-temporal motion patterns, in CVPR00.

Figure 7: Experiment results for 3D skeleton reconstruction. Original images from 3 camera views and the skeletons are shown. by the group. In our experiment, N = 16, = 0.3 and c1 = c2 = 1. Similar to other search techniques, PSO will be likely to converge to local optimum without carefully choosing the initial particles. In the experiment we assume that the 3D skeleton will not go through a big change in a time interval. Therefore, at time t1 the search space formed by the particles is centered around the optimal solution of the geometric conguration at time t0 . That is, time consistency in postures is used to initialize particles for searching. Some examples showing images from 3 views and the posture estimates are shown in Fig. 7.

[4] Hedvig Sidenbladh, Michael J. Black, and Leonid Sigal, Implicit probabilistic models of human motion for synthesis and tracking, in ECCV 02: Proceedings of the 7th European Conference on Computer Vision-Part I, London, UK, 2002, pp. 784800, Springer-Verlag. [5] J. Deutscher, A. Blake, and I.D. Reid, Articulated body motion capture by annealed particle ltering, in CVPR00, 2000, pp. II: 126133. [6] Kong Man Cheung, Simon Baker, and Takeo Kanade, Shape-from-silhouette across time: Part ii: Applications to human modeling and markerless motion tracking, International Journal of Computer Vision, vol. 63, no. 3, pp. 225 245, August 2005. [7] Cl ement M enier, Edmond Boyer, and Bruno Rafn, 3d skeleton-based body pose recovery, in Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission, Chapel Hill (USA), june 2006. [8] Ivana Mikic, Mohan Trivedi, Edward Hunter, and Pamela Cosman, Human body model acquisition and tracking using voxel data, Int. J. Comput. Vision, vol. 53, no. 3, pp. 199 223, 2003. [9] H. Sidenbladh and M.J. Black, Learning the statistics of people in images and video, IJCV, vol. 54, no. 1-3, pp. 183209, August 2003. [10] Y. Weiss and E. Adelson, Perceptually organized em: A framework for motion segmentaiton that combines information about form and motion, Tech. Rep. 315, M.I.T Media Lab, 1995. [11] S. Ivecovic and E. Trucco, Human body pose estimation with pso, in IEEE Congress on Evolutionary Computation, 2006, pp. 12561263. [12] C. Robertson and E. Trucco, Human body posture via hierarchical evolutionary optimization, in BMVC06, 2006, p. III:999.

5. Conclusion and Future Work


There are two main motivations for our work for gesture analysis in a multi-camera network. One is to reduce image data to short descriptions by local processing in each camera for efcient communication among cameras, the other one is to explore consistency and distinctiveness of the subject by opportunistic fusion of information across space, time, and feature levels. We studied the use of a 3D human model to keep both geometric and appearance parameters of the subject. In some of our experiments the problem of PSO converging to a local minimum still exists, especially when there is a sudden move which causes the initial search space to be relatively far away from the new posture. Future work includes dening a more versatile model, 6

You might also like