Professional Documents
Culture Documents
Chen Wu and Hamid Aghajan Department of Electrical Engineering Stanford University, USA
Abstract
In multi-camera networks rich visual data is provided both spatially and temporally. In this paper a method of human posture estimation is described incorporating the concept of an opportunistic fusion framework aiming to employ manifold sources of visual information across space, time, and feature levels. One motivation for the proposed method is to reduce raw visual data in a single camera to elliptical parameterized segments for efcient communication between cameras. A 3D human body model is employed as the convergence point of spatiotemporal and feature fusion. It maintains both geometric parameters of the human posture and the adaptively learned appearance attributes, all of which are updated from the three dimensions of space, time and features of the opportunistic fusion. In sufcient condence levels parameters of the 3D human body model are again used as feedback to aid subsequent in-node vision analysis. Color distribution registered in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to rene color segments with observations from a single camera. Geometric conguration of the 3D skeleton is estimated by Particle Swarm Optimization (PSO).
of the network is the relatively low bandwidth. Therefore, for efcient collaboration between cameras, we expect concise descriptions instead of raw image data as outputs from local processing in a single camera. This process inevitably removes certain details in images of a single camera, which requires the camera to have some intelligence on its observations (smart cameras) , i.e., some knowledge of the subject. This derives one of the motivations for opportunistic data fusion between cameras, which compensates for partial observations in individual cameras. So the output from opportunistic data fusion (a model of the subject) is fed to local processing. On the other hand, outputs of local processing in single cameras enable opportunistic data fusion by contributing local descriptions from multiple views. It is the interactive loop that brings in the potential for achieving both efcient and adequate vision-based analysis in the camera network. An example of the communication model between ve cameras to reconstruct the persons model is shown in Fig. 1. The circled numbers represent the sequence of events. In our approach a 3D human body model embodies upto-date information from both current and historical observations of all cameras in a concise way. It has the following components: 1. Geometric conguration: body part lengths, angles. 2. Color or texture of body parts. 3. Motion of body parts. The three components are all updated from the three dimensions of space, time, and features of the opportunistic fusion. The 3D human model takes up two roles. One is as an intermediate step for high-level application-pertinent gesture interpretation, the other is to create a feedback path from spatiotemporal and feature fusion operations to low-level vision processing in each camera. It is true that for a number of gestures a human body model may not be needed to interpret the gesture. There is existing work for hand gesture recognition [1] where only part of the body is analyzed. Some gestures can also be detected through spatiotemporal motion patterns of some body parts [2, 3]. However, as the set of gestures to differentiate expands, it becomes increasingly difcult to devise methods for gesture recognition based on only a few cues. A 3D human body model provides a unied interface for 1
1. Introduction
In a multi-camera network, access to multiple sources of visual data often allows for making more comprehensive interpretations of events and gestures. It also creates a pervasive sensing environment for applications where it is impractical for the users to wear sensors. Example applications include surveillance, smart home care, gaming, etc. In this paper we propose a method of human posture estimation using an opportunistic fusion framework to employ manifold sources of information obtained from the camera network in a principled way. The framework spans three dimensions of space (different camera views), time (each camera collecting data over time), and feature levels (selecting and fusing different feature subsets). Our work aims for intelligent and efcient vision interpretations in a camera network. One underlying constraint
CAM 5 3
Camera 5 wants to update its knowledge of the subject
1 CAM 1 5 4 2
5 5 CAM 4 CAM 2 Up-to-date knowledge of the subject is used in local processing CAM 3
Vector of descriptions from local processing
prone to failure in some situations since lighting usually has big inuence in colors and skin color varies from person to person. In this paper, we rst introduce the opportunistic fusion framework as well as an implementation of its concepts through human gesture analysis in Section 2. In Section 3, image segmentation in a single camera is described in detail. Color distribution maintained in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to rene color segments with observations from a single camera, followed by a watershed algorithm to assign segment labels to all pixels based on spatial relationships. Finally, ellipse tting is used to parameterize segments in order to create concise segment descriptions for communication. In Section 4, Particle Swarm Optimization (PSO) is used for 3D model tting. Examples are shown to demonstrate capability of the elliptical segments for posture estimation.
a variety of gesture interpretations. On the other hand, instead of being a passive output to represent decisions from spatiotemporal and feature fusion, the 3D model implicitly enables more interactions between the three dimensions by being actively involved in vision analysis. For example, although predened appearance attributes are generally not reliable, adaptively learned appearance attributes can be used to identify the person or body parts. Those attributes are usually more distinguishable than generic features such as edges. Fitting human models to images or videos has been an interesting topic for which a variety of methods have been developed. Some reconstruct 3D representations of human models from a single cameras view [4, 5]. Due to the selfocclusive nature of human body, causing ambiguity from a single view, most of these methods rely on a restricted dynamic model of behaviors. But tracking can easily fail in case of sudden motions or other movements that differ much from the dynamic model. In 3D model reconstruction from multi-view cameras [6, 7], most methods start from silhouettes in different cameras, from which points occupied by the subject are estimated, and nally a 3D model with principle body parts is t in the 3D space [8]. This approach heavily relies on the silhouettes obtained from each image. It is also sensitive to the accuracy of camera calibration. However, in many situations background subtraction for silhouettes suffers for quality or is almost impossible due to clustered background or camouaged foreground. Another aspect of the human model tting problem is the choice of image features. All human model tting methods are based on some image features as targets to t the model. Most of them are based on generic features such as silhouettes or edges [9, 7]. Some use skin color but such methods are 2
space
old model
3 Model -> gesture interpretations
Description Layers
Description Layer 4: Gestures
G
Decision Layers
F1
F2
F3
f31
f32
Figure 2: Spatiotemporal fusion framework for human gesture analysis. of the model. Parameters in M0 specify a smaller space of possible M1 s. Then decisions from spatial fusion of cameras are used to update M0 to obtain the new model M1 (arrow 2 in Fig. 2). Therefore, for every update of the model M , it combines space (spatial collaboration between cameras), time (the previous model M0 ), and feature levels (choice of image features in local processing from both new observations and subject-specic attributes in M0 ). Finally, the new model M1 is used for high-level gesture deductions in a certain scenario (arrow 2 in Fig. 2).
be efciently transmitted between the cameras. The output of the algorithm will be ellipses tted from segments and the mean color of the segments. As shown in the upper part of Fig. 3, local processing includes image segmentation for the subject and ellipse tting to the extracted segments. We assume the subject is characterized by a distinct color distribution. Foreground area is obtained through background subtraction. Pixels with high or low illumination are also removed since for those pixels chrominance may not be reliable. Then a rough segmentation for the foreground is done either based on K-means on the chrominance of the foreground pixels, or color distributions from the model. In the initialization stage when the model has not been well established, or when we do not have a high condence in the model, we need to start from the image itself and use a method such as K-means to nd color distribution of the subject. However, when a model with a reliable color distribution is available, we can directly assign pixels to different segments based on the existing color distribution. In practice, the color distribution maintained by the model may not be uniformly accurate for all cameras due to effects such as color map changes or illumination differences. Also the subjects appearance may change in a single camera due to the movement or lighting conditions. Therefore, the color distribution of the model is only used for a rough segmentation in initialization of segmentation. Then an EM algorithm is used to rene the color distribution for the current image. The initial estimated color distribution plays an important role because it can prevent EM from being trapped in local minima. Suppose the color distribution is a mixture of N Gaussian modes, with parameters = {1 , 2 , . . . , 3 }, where l = {l , l } are the mean and covariance matrix of the modes. Mixing weights of different modes are A = {1 , 2 , . . . , 3 }. We need to nd the probability of each pixel xi belonging to a certain mode l : P r(yi = l|xi ). From standard EM for Gaussian Mixture Models (GMM) we have the E step as: P r(k+1) (yi = l|xi ) l P(k) (xi ),
l
(k )
l = 1, . . . , N
= =
(k+1)
M (k ) ) i=1 xi P r (yi = l|xi , M ( k ) ) i=1 P r (yi = l|xi , (k ) (k ) T M i=1 (xi l )(xi l ) P r (yi = M (k ) ) i=1 P r (yi = l|xi ,
Color segmentation and ellipse fitting in local processing Background subtraction Rough segmentation EM: refine color models Watershed segmentation
Ellipse fitting
Previous color distribution 3D human body model Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture, motion) Y Check stop criteria N Update each test configuration using PSO
1 M
(4)
where k is the number of iterations, and the M step is obtained by maximizing the log-likelihood:
M N
(a)
(b)
(c)
(d)
L(x; ) =
i=1 l=1
(5)
Figure 4: Ellipse tting. (a) original image; (b) segments; (c) simple ellipse tting to connected regions; (d) improved ellipse tting.
However, this basic EM algorithm takes each pixel independently, without considering the fact that pixels belonging to the same mode are usually spatially close to each other. In [10] Perceptually Organized EM (POEM) is introduced. In POEM, inuence of neighbors is incorporated by a weighting measure: w(xi , xj ) = e
xi xj 2 1
approaches innity, the mixing weight for the mode with the largest vote will be 1. After renement of the color distribution with POEM, we set pixels with high probability (e.g., larger than 99.9%) that belong to a certain mode as markers for that mode. Then a watershed segmentation algorithm is implemented to assign labels for undecided pixels. Finally, in order to obtain a concise parameterization for each segment, an ellipse is tted to it. Note that a segment refers to a spatially connected region of the same mode. Therefore, a single mode can have several segments. When the segment is generally convex and has a shape similar to an ellipse, the tted ellipse well represents the segment. However, when the segments shape differs considerably from an ellipse, a direct tting step may not be sufcient. To address such cases, we rst test the similarity between the segment and an ellipse by tting an ellipse to the segment and comparing their overlap. If similarity is low, the segment is split into two segments and this process is carried out recursively on every segment until they all meet the similarity criterion. In Fig. 4, if we use a direct ellipse tting to every segment, we obtain Fig. 4(c). If we adopt the test-and-split procedure, correct ellipses are obtained as shown in Fig. 4(d). Experimental results are shown in Fig. 5. 4
s(xi )s(xj ) 2 2
(6)
where s(xi ) is the spatial coordinate of xi . Then, votes for xi from the neighborhood are given by Vl (xi ) =
xj
(7) Based on this voting scheme, the following modica(k ) tions are made to the EM steps. In the E step, l is (k ) changed to l (xi ), which means that for every pixel xi , mixing weights for different modes are different. This is partially due to the inuence of neighbors. In the M step, mixing weights are updated by
(k ) l (xi )
eVl
(xi )
(x ) N Vk i k=1 e
(8)
in which controls the softness of neighbors votes. If is as small as 0, then mixing weights are always uniform. If
z z
The person
(a)
y
2
1
2
4
4
O
x
CAM 1
CAM 3
CAM 2
ellipses
CAM1
ellipses
CAM2
ellipses
CAM3
(b)
(a)
(b)
(c)
Figure 6: 3D skeleton model tting. (a) Top view of the experiment setting. (b) The 3D skeleton reconstructed from ellipses from multi-view cameras. be calculated from some known projective correspondences between the 3D subject and points in the images, without knowing exact locations of cameras or the subject. PSO is suitable for posture estimation as an evolutionary optimization mechanism. It starts from a group of initial particles. During the evolution the particles are directed to the good position while keeping some randomness to explore the search space. Suppose there are N particles (test congurations) xi , each being a vector of i s and i s. The velocity of xi is denoted by vi . Assume the best position of xi up to now is x i , and the global best position of all xi s up to now is g . The objective function is f () for which we wish to nd the optimal position x to minimize f (x). The PSO algorithm is as follows: 1. Initialize xi and vi . The value of vi is usually set to 0, and x i = xi . Evaluate f (xi ) and set g = argminf (xi ). 2. While the stop criterion is not satised, do for every xi : xi xi ) + c2 r2 (g xi ) vi vi + c1 r1 ( xi xi + vi xi ), x i = xi ; If f (xi ) < f (g ), If f (xi ) < f ( g = xi The stop criterion: After updating all N xi s once, the increase in f (g ) falls below a threshold, then the algorithm exits. Here is the inertia coefcient, while c1 and c2 are the social coefcients. r1 and r2 are random vectors with each element uniformly distributed on [0,1]. Choice of , c1 and c2 controls the convergence process of the evolution. If is big, the particles have more inertia and tend to keep their own directions to explore the search space. This allows for more chance of nding the true global optimal if the group of particles is currently around a local optimal. While if c1 and c2 are big, the particles are more social with the other particles and go quickly to the best positions known 5
Figure 5: Experiment results of local processing. (a) original images; (b) segments; (c) tted ellipses.
and using more information from local features of cameras to better initialize the search space. In the current method calibrated cameras are assumed. However, we found that tting is sensitive to accuracy of calibration. Since accurate camera calibration is not always practical in applications of posture recognition, we are exploring solutions based on uncalibrated cameras.
References
[1] Andrew D. Wilson and Aaron F. Bobick, Parametric hidden markov models for gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp. 884900, 1999. [2] Yanxi Liu, Robert Collins, and Yanghai Tsin, Gait sequence analysis using frieze patterns, in Proceedings of the 7th European Conference on Computer Vision (ECCV02), May 2002. [3] Y. Rui and P. Anandan, Segmenting visual actions based on spatio-temporal motion patterns, in CVPR00.
Figure 7: Experiment results for 3D skeleton reconstruction. Original images from 3 camera views and the skeletons are shown. by the group. In our experiment, N = 16, = 0.3 and c1 = c2 = 1. Similar to other search techniques, PSO will be likely to converge to local optimum without carefully choosing the initial particles. In the experiment we assume that the 3D skeleton will not go through a big change in a time interval. Therefore, at time t1 the search space formed by the particles is centered around the optimal solution of the geometric conguration at time t0 . That is, time consistency in postures is used to initialize particles for searching. Some examples showing images from 3 views and the posture estimates are shown in Fig. 7.
[4] Hedvig Sidenbladh, Michael J. Black, and Leonid Sigal, Implicit probabilistic models of human motion for synthesis and tracking, in ECCV 02: Proceedings of the 7th European Conference on Computer Vision-Part I, London, UK, 2002, pp. 784800, Springer-Verlag. [5] J. Deutscher, A. Blake, and I.D. Reid, Articulated body motion capture by annealed particle ltering, in CVPR00, 2000, pp. II: 126133. [6] Kong Man Cheung, Simon Baker, and Takeo Kanade, Shape-from-silhouette across time: Part ii: Applications to human modeling and markerless motion tracking, International Journal of Computer Vision, vol. 63, no. 3, pp. 225 245, August 2005. [7] Cl ement M enier, Edmond Boyer, and Bruno Rafn, 3d skeleton-based body pose recovery, in Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission, Chapel Hill (USA), june 2006. [8] Ivana Mikic, Mohan Trivedi, Edward Hunter, and Pamela Cosman, Human body model acquisition and tracking using voxel data, Int. J. Comput. Vision, vol. 53, no. 3, pp. 199 223, 2003. [9] H. Sidenbladh and M.J. Black, Learning the statistics of people in images and video, IJCV, vol. 54, no. 1-3, pp. 183209, August 2003. [10] Y. Weiss and E. Adelson, Perceptually organized em: A framework for motion segmentaiton that combines information about form and motion, Tech. Rep. 315, M.I.T Media Lab, 1995. [11] S. Ivecovic and E. Trucco, Human body pose estimation with pso, in IEEE Congress on Evolutionary Computation, 2006, pp. 12561263. [12] C. Robertson and E. Trucco, Human body posture via hierarchical evolutionary optimization, in BMVC06, 2006, p. III:999.