Hand Detection and Tracking in an Active Vision System

Yuliang Zhu

A thesis submitted to the Faculty of Graudate Studies in partial fufillment of the requirements for the degree of

Master of Science

Graduate Program in Computer Science York University North York, Ontario June, 2003

Copyright by Yuliang Zhu 2003

Hand Detection and Tracking in an Active Vision System

Approved by Supervising Committee:

We use a number of methods to direct the visual attention of those with whom we interact. and display technologies progress. In most current teleconferencing or distance learning systems. facial expression and body language play important roles.Sc. gesture. in natural communication between people.Hand Detection and Tracking in an Active Vision System Yuliang Zhu. York University. the camera is fixed or is controlled by an operator. humancomputer interaction (HCI) has become more and more important in our daily lives. A potential avenue for natural interaction is the use of human gesture and gaze. 2003 Supervisor: Prof. One very common tool is ‘to point’ with a finger to items of interest. Motivated by the above ideas. limit the speed and naturalness of our interaction and may become a bottleneck in the effective usage of computers. as the computing. the existing HCI techniques. such as mice and keyboards. In fact. which detects and tracks a hand in a pointing gesture by using the CONDENSATION algorithm iv . John Tsotsos As the impact of modern computer systems on every day life increases. M. However. this thesis presents a visual hand tracker. One domain of application is video-conferencing. communication.

and the stereo cameras move actively. Due to the errors in calibration of the active stereo cameras. The average error in estimation of rotation in the vertical plane is less than 7 degrees. The tracker estimates the translation. the resolution in depth is about 10cm at a distance of 1 meter. By utilizing the parameters of the camera system. It achieves a best tracking accuracy of 12dB measured by signal noise ratio. rotation and scaling of the hand contour in the two image sequences captured from a pair of active cameras mounted on a robotic head.applied to image sequences. The background may be highly cluttered. the 3D orientation of the hand is calculated using the epipolar geometry. v .

NSERC and PRECARN for funding this project. This thesis is dedicated to my parents. Bill Kapralos. A huge thanks to my friends from the online outdoor club. Professor Minas Spetsakis and Professor Doug Crawford. Yuliang Zhu York University June 2003 vi . Special thanks to IRIS. Without their support and encouragement. Yueju Liu.Acknowledgments The author wishes to thank Professor John Tsotsos. who made my graduate school life much more enjoyable. etc. I would not be able to make this far. Andrei Rotenstein. Albert Rothenstein. Markus Latzel. Professor Richard Wildes. Marc Pomplun. Erich Leung. who gave me helpful advice. who directed and supported all the research work on my thesis. Kunhao Zhou. Jack Gryns. I really appreciate the supports from all the lab members such as Kosta Derpanis. Yongjian Ye.

. . . . . . . . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . Extended Kalman Filter . . Thesis Outline . . . . . . . . . . . . . . . . . . . iv vi x xiv 1 1 4 4 7 8 . . . . . . . . . . 8 10 11 13 14 16 17 19 20 Chapter 2 Review of Related Work 2. . . . . . . . . . Mean Shift . . . . . . . . . . . .6 Detecting Motion with an Active Camera Skin Blob Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . .1 2. . . . . . . . . . . . . Goals . . . . . . . . . . . . . . .Contents Abstract Acknowledgments List of Figures List of Tables Chapter 1 Introduction 1. . . . .7 CONDENSATION Algorithm . . . . . . . . . . . . . . . . . Optical Flow . . . . . . . . . .3 2. . . . . . . . 2. . . . . . . . . . . . .3 1. . . . .4 Motivation . . . . . . . . . . . .5 2.2 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . . . .1 2. . . . . . . . . . . . . . Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. .2 Standard Kalman Filter . . . . .6. . . . . . . .4 2. . . . . . .1 1. Active Contour . . . . . . . . . . . . . . . . . . . . . . .2 2. . . . . . . .

1 3. . . . . . . . . . . . . . . . . . . . . . .2 4. .7. . . .7. . . . . . . . .3 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factored Sampling . . . .4 3. . .2 3. . . . . . . viii . . . . . . . . . . . . . . .8 Refining the Result . . . . . . . . . . . . . .3 Accuracy of Tracking . .2 2. . . . . . . . . . . . . .3 3. . . . . . . . . .8 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . .5 2. . . . . . . . . . . . . . . . . . . . . . . . . . .7. . . . . Correspondence .7. . . .1 Performance of Tracker with Low Cluttered Background . . . . .1 4. . . . . . . . . . . . . 3D Orientation of the Hand . . . 3D Orientation . . . . .4 2. . .7. . . . . . . . . . . . . . . . State Space . . . . . . .7. . . . . . . . . . . . . System Architecture and Implementation . .1 3. . . . . . . . . . Measurement Model . Summary . . . . . . . . . . . . . . . . . . . . . . . Epipolar geometry . . . . . . . . . . . . . . . . . . . . . .2. . . . . . Chapter 3 CONDENSATION Hand Tracker 3. Propagation of state density .7. 21 22 22 23 25 26 29 31 31 39 40 43 52 58 58 58 62 65 68 70 70 74 76 76 Stochastic Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Model . . . . . . .4 3. Hand Detection . .7 Initialization . . . . . . . . . . Shape Representation . . . .7.1 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2. . . . . . . Experimental Results of Tracking on Real Images . . . . . . .2 3. . . . . . . . . . Measurement Model . . . . . . .7. . . . . . . . . . . . Chapter 4 Experiments and Discussion 4. . . . . . . . . . . . .6 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . .5 3. . . . . .3. . Computational Complexity . . . . . . .

. . . . . . . . . . . . . . . . 81 85 89 99 101 Experiments on 3D orientation Summary . .2 4. . . . . . Chapter 5 Discussion and Future Work 5. . . . . . . . .4 4. . . . . . . . . . . . . . . . Performance of Tracker with Highly Cluttered Background . . . . .4. . . . . . . . .3 4. .3. . . . . . . . 103 ix . . . . . .1 Future Work . . . . . . . . . .5 Performance of Tracker with Lightly Cluttered Background . .3. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . The process of CONDENSATION algorithm . . . . . . . . . . .3 3. . .2 1. . . . . . . . . . . . . . .List of Figures 1.1 1. . . . . . . . Detection of the skin color edge . . .10 State space parameters . . . . . . . . . . . Representation of the hand contour after initialization: the dots represent the control points of the curve. . . .3 2. . . .11 The distribution of hypotheses in translation (the points on the top and left indicate are samples on the distribution of translation on x. .4 3. . . . . . . . . . . . Substraction of the two frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skin color distribution in HS space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . evolved from previous iteration). . . . . . . . . . Skin color in RGB space . . . . Image filtered by skin color model .9 The binocular head (TRISH-2) in GestureCAM . . . . . . . . . . . . . . . . .6 3. . . . . . . . y axis. .1 3. . . . . . . . . . . . . . .5 3. 40 42 3. . . . . 3. . . . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . . . .1 3. . . . The degrees of freedom of the camera system . . . . . . . . the dashed curve is constructed from the points . . 46 x . . . . . . . Raw image taken from camera in RGB . . . . 5 6 6 27 33 34 35 35 37 37 38 38 System diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 3. . Skin color distribution in normalized RG space . Skin color in HSV space .8 3. . .

. .18 The normals along the contour for measurement of the features . . . . . . . . . . . . . .) 3. . . . . . . . .when the cameras fixate on the object .20 Epipolar geometry of the camera system . . . . . .12 The distribution of hypotheses on rotation (points on the bottom are samples evolved from previous iteration). . . . . xi . . . . The points on the bottom are samples on distribution of parameter rotation. . . . . . . . . . . . . . . . . . evolved from previous iteration. . the arrows shows the direction from interior to exterior of the hand shape . . . . .14 The distribution of the state in translation. . . . . . . 3. . . 3. . . . . . 53 55 51 52 49 48 47 3. 3. y axis. . .15 Distance from the object to each of the cameras can be maintained roughly the equal. . The dashed line with two arrows measures the nearest feature point to the hypothetical contour. . .17 The normals along the hand contour. 3. . . . . . . . . . . . . . .3. . . . .16 The normals along the hand contour . . . when there are no changes in other parameters. . The solid line with one arrow shows the measurement taken from interior to exterior portion of the contour. . . . . . . 3. . . .19 Measurement line along the hypothetical contour. . The points on the top and left indicate are samples on the distribution of translation on x. . . . . . . .13 The distribution of the scaling (points on the right are samples evolved from previous iteration). . . . rotation and scaling.21 Finding correspondence along the epipolar line . . . . . . . . . . . . when there are no changes in other parameters. . . . . . . . . . . . . . . 56 60 63 3. The points on the right are samples on distribution of parameter scaling. . . . The shaded part illustrates the real hand region. . . 3. . . while the black curve indicates the hypothetical contour which is measured.

. . . . . . . 3. .23 Transformations to the head coordinate system . . . . . . . . . .13 Frame 60 . . . . . . . . . . . . .74% of the pixels are skin color). . . . . . . . . 4. . . 64 66 67 69 73 75 77 77 78 78 79 79 80 81 82 82 83 83 4. . . . . . .25 Tracker Diagram . 3. . . . . . .6 4. . xii .5 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 3D orientation of the hand . . 4. . Frame 45 . . . . .12 Frame 50 . . . . . .10 Frame 30 .8 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35% of the pixels are skin color). . . . Frame 15 . number of samples: the error bars are the standard deviation of the experimental results . . . . . 4. . . . . . . . . Frame 25 . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. the thick green curve shows the one with highly cluttered background (9. 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . number of samples: the solid curve shows the accuracy of tracking hand with no cluttered background (0. . . . . . . . . . . . . . . . .03% of the pixels are skin color). . .11 Frame 40 . . the red curve shows the one with light cluttered background (3. .22 View overlapping when cameras verge . . . . . . . . . . . . . . . . . . . . . . . .2 Computational complexity vs. 4. . . . . . . . . .1 Accuracy vs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Frame 5 . . . . . . . . . . . the error bars are the standard deviation in the results in the experiments.14 Frame 70 . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . Frame 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 35 . . . . . . . . . . . . . . . . . . . . . . . . . . .7 4. . . . . Frame 55 . . . . . 3. . . . . .3 4. . . . . . . . . .

. . . 4. . . . . . Measurements are taken at   860mm (red). . . . . . . .16 Frame 90 . . . . . . . . . . . . . . . . and -45 (blue). estimated distance . . 4. . . . . . . 4. . . . . . . . .5 (yellow) and 45 (pink). . . . . . . . . . . . . . .23 Frame 80 . . . 4. 4. . . . . . . . . 4. . . . . . . . . . . . . 106mm (black) and 125mm (green). . . .5 (yellow) and 45     (pink). . . . . . . . . . . . .15 Frame 80 . . . . .21 Frame 60 . . . . . . . 4. . . . . . . . . . . . . .24 System setup for experiment on 3D orientation . 4. . . . . . . . . . . . . . . . . . Measurements are taken at 860mm (red). . . . . . . . . . . . . . . . . . . . . . .28. . . . .19 Frame 40 . . . . . . . . . 4. . . . . 4. . . . .5 (cyan). . . . . . 22. . . . . 106mm (black) and 125mm (green). . . . . . . . . . . . . . . . . . .28 Arm orientation projected in xz plane. . 22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -22. . . . . . . . . . . . . . .5 (cyan). . . 4. . The vertex in each color 97 represents the position of the elbow in each experiment. . . . . . . . . . .20 Frame 50 . .         98 xiii . . . . . . . . . . . . . . . . . . . .17 Frame 20 . . . . . . . . . . . . . . . . . . . . . . . . . .27 The orientation of the arm vertical . . .4. . . . . . . . .22 Frame 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 Orientation in xz plane . . . . . . . . . . . . . -22. . and -45   (blue). . . . . . . . . . . 4. . . . . . . .18 Frame 30 . . . . . . . . . . . . . 84 84 85 86 86 87 87 88 88 90 92 94 96 4. . . . . . . . . . . . . . . 4.29 A 3D view of the experimental results on tracking showed in Figure 4. . . . . . .25 Real distances vs.

. . .List of Tables 4. . . . . . 72 74 xiv . . . . . . . . .2 Experimental result on accuracy . . . . . . . . . . . . . . . . . . Experimental result on complexity . . . . .1 4. .

mice and joysticks. However as the complexity of the applications increases some requirements that the conventional interfaces cannot satisfy are emerging. computer supported cooperative work (CSCW) applications have made rather significant progress during the past decades. [12]. sign language recognition (Starner and Pentland [46]. [47]) and emotion recognition (Cowie et al. [20]. Starner et al. [36]). The primary means of input for Human Computer Interaction (HCI) are keyboards. almost everyone can enjoy such new technology almost everywhere. One of the important issues in CSCW is the human-computer interface. Moreover 1 . Gesture recognition (Gutta et al. Importing natural means of human communication and interaction into HCI is an approach to design easier to use.Chapter 1 Introduction 1. With the development of the Internet. Rosenblum et al. [40]) are examples of many research areas that can be utilized in communicating with the computer in a structured manner. A trend in HCI enhancement is importing human body based communication and interaction methods into the interfaces.1 Motivation In order to satisfy the increasing need to permit groups of people to communicate quickly and efficiently over distance. more effective input methods. Pavlovic et al.

[36].g. [38]. Hand pose data is analyzed for mainly two purposes: communication which interprets hand shape and motion as gestures such as Pavlovic et al. which are collectively called virtual reality in Benford et al. There are two types of constraints applied on the input: 1) Background constraints. this technique has some limitations and is too expensive for most applications. and manipulation which interprets hand shape and motion as a manipulation tool such as Rauterberg et al. One has to apply segmentation to locate the hand and/or fingers. in Rehg and Kanade [39]. However. These techniques require a precise estimate of the human body pose. Another important issue related to the design of the system is the set of constraints applied on the input of the system. which corresponds to estimating the pose of the hand using one or more image sequences that captures the hand shape and motion in real time . and animation of a copy of the human body at different scales on the display devices. This requires measurement of a set of parameters related to hand kinematics and dynamics to make decisions about the interaction between the hand and virtual objects. Electro-Mechanical sensing using gloves provided a first solution to the hand pose estimation and tracking problem. e. or feature extraction to detect fingertip and/or joint locations in the input images. are emerging. Background constraints refer to the constraints on the environment in which the hand will be tracked. Usually a 2 . A more recent solution used non-contact computer vision techniques.with the advancement in processing speed and display technology more sophisticated interaction methods like immersive virtual reality and telepresence. Most of the human hand based HCI techniques require the estimation and tracking of hand pose over time. [3]. 2) Foreground constraints.

. [35]. and the lecturer should be able to see who may have a question and select one to respond. [31].g. the lecturer could walk around on the platform. Meanwhile all the others know who is asking what question. which will make hand localization much easier. Normally. in Oka et al. Dorner [16] and Dorfmuller-Ulhaas and Schmalstieg [15] use gloves with markers. In applications such as video-conferencing. in a distance learning system. These constraints make such systems less flexible and robust to the applied environment. In a common class room. even today’s highend teleconferencing systems only provide remote manual camera control or audio based adjusting. Furthermore. the audience follows the speaker’s body and what is being pointed out. people in the audience put up their hand when they have a question. however. pan and tilt settings of the camera or cameras regardless of the action of the speaker and audience. To increase the robustness of feature extraction. Currently. The most advanced applications of CSCW are trying to make the communication or collaboration more realistic and intelligent. distance learning and telemedicine systems. another choice is a uniform background. Without advanced features such as active tracking or zooming based on changing circumstances. e. no system is known to actively respond to visual cues for 3 .. e. Foreground constraints refer to the hand itself. it has been found very useful that the participants can share pointing and other gestures over shared documents or objects. current video subsystems within the above systems simply capture a scene from fixed zoom. For example. the collaborative environment designed by Leung et al. Some of them can be adjusted by an operator or participant.static background where no other object is moving is used. the function is very limited and far from intelligent.g. point out some words on the blackboard and so on.

simple background subtraction algorithms for detecting a moving object are useless while tracking. it can compute the 3D direction of the pointing finger. GestureCAM was designed to address this challenge.3 Contributions This implementation of a hand tracker works in a highly cluttered visual environment (for example. The hand tracker extracts the translation.attention. i. The dynamic model of the object maybe non-linear. and the model becomes even more complicated. furniture). especially when camera motion is considered. a lab with lots of books.2 Goals GestureCAM is an active stereovision system that is able to detect and track faces and hands. 1. on a highly cluttered background.. the distribution of the states could be affected by noise. 1. Additionally. where the hand is pointing. and interpret gestures in a real world environment. i. With the help of depth information.e.. The 3D orientation of the hand is based 4 . It acts like an active observer or like a virtual cameraman. tracking a hand in a pointing gesture is not easy because the background may be highly cluttered and the cameras are always active (under the control of a tracking program). the states of the moving object are ambiguous and multi-modal. tracking a hand in a pointing gesture. In an active stereovision system like GestureCAM. Therefore. devices.e. rotation and scaling parameters.

1 and 1. There are 4 mechanical degrees of freedom: head pan. Each color camera mounted on the robot head can be controlled independently.2 ). and guides the camera to that direction. The system consists of a robotically controlled binocular head called TRISH-2 (Figure 1. respectively. The client computer. connected to a Dual Pentium II PC platform. Two sets of video are captured by Imaging Technologies S-Video frame grabber. where the application is running. Figure 1. with a resolution of 512x480 pixels and color depth of 24 bits. The cameras can be used independently or as a stereo pair.on the tracking result. The server computer directly connects with the motors and cameras (control part) of the robotic head through motor control cards and serial port.1: The binocular head (TRISH-2) in GestureCAM 5 . so that the head is controlled by the application ( as in figure 1. which acts as its server. independent eye vergence. has two video inputs from the cameras and can send TCP/IP packets to the server to set and get the parameters of the motors and cameras. and head tilt.3 ).

3: System diagram 6 .2: The degrees of freedom of the camera system Stereo Video Network (TCP/IP) Motor & Camera Control Setting and getting parameters of Camera & Motors Figure 1.Figure 1.

4 Thesis Outline The following chapters will provide more detailed descriptions of the hand tracker: Chapter 2 provides a review of related works in detection and tracking. Analysis         of algorithms such as skin color blob. dynamic model. Chapter 4 presents the experimental result of the tracker working in different conditions. Chapter 5 provides the conclusions and future research work. hand shape representation. 7 .2. it gives the reasons why the CONDENSATION algorithm is chosen so that the tracker can achieve the goal introduced in section 1.1. occlusion. In the summary. measurement model and calculation of 3D orientation. such as different lighting. clutter. Kalman filter. Chapter 3 presents detailed models and algorithm descriptions of the CONDENSATION tracker. More detailed experiments on the depth and orientation calculations are shown last. including hand detection. active contour. conditional density propagation and so on are presented. mean shift.

2. The value observed for each pixel in a new frame is compared to the current 8 . gradient of intensity in each image. Building probabilistic models to describe the likely motion and appearance of an object of interest is a promising approach. and the information over multiple consecutive images can be helpful to track individual objects or to perform a more general motion segmentation.1 Detecting Motion with an Active Camera When the background is uniform or does not change. the whole process of the motion of the object can be done in a similar way. edge. The problem is that if the object stops moving for a while. background pixel values are modelled as multi-dimensional Gaussian distributions in HSV color space. detection of the moving object can be easily done by subtracting two frames. Especially. There are large numbers of applications which apply different techniques to track different targets in different conditions. The features such as color. In the system presented by Francois and Medioni [18]. More details of some of these techniques are given in the following sections. the tracking strategy may lose it.Chapter 2 Review of Related Work Tracking has been studied extensively in the computer vision literature. when the initial background is remembered.

a Gaussian shape prior is chosen to specifically develop a near real-time tracker for vehicle tracking in aerial videos. At the intermediate level. Principal Component Analysis (PCA) was used to analyze the training set and detect the shape in the tracking. such as autonomous vehicle navigation. A similar framework was proposed by Tao et al. The pixels on the moving object in the image then are grouped into connected components. In order to apply this technique to real-time tasks. The distribution is updated using the latest observation. In Philomin et al. a sequence of ‘focal probes’ examines the motion in different parts of the region. a shape model and CONDENSATION algorithm was employed to track pedestrians from a moving vehicle. A class of training shapes were represented by a Point Distribution Model. motion. [8] applied a dynamic motion analysis technique which was based on optical flow. it needs special hardware to compute the motion. In this work. segmentation and shape using the expectation maximization algorithm over time. a global motion model is built and updated continuously. Burt et al. the whole background changes from frame to frame. because both the object (foreground) and the background change together. There is no reference frame for eliminating the background pixels. Then. The algorithm just described cannot work. Finally. At the early level of the analysis. For a moving camera. [37]. which are used for the constructing the background distribution. flow vectors are computed between frames. The assumption is that the object could not appear in the first frames. [49].corresponding distribution. at the highest level. which implemented a dynamic motion layer tracker by modelling and estimating the layer representation of appearance. If the shape of the 9 .

YUV (Kampmann [29]) and so on. It achieved a accuracy about 80% in detecting head contour using a ellipse model. 2. [54]). and related them to the contour models of different head appearances. Sigal et al. The computations of the skin color detection module are based on HSV color values transferred from the 24-bit RGB video signals. A radial scan line detection algorithm was developed for real-time tracking. [21. such as RGB (Jones and Rehg [26]). HSV (Herpers et al. When the object moves in a wide area. It scans outward from the center of a region of 10 .object varies significantly. the tracker utilized parameters of the camera. This allowed them to adaptively select the model to deal with the variations in the head appearance due to the human activities. In Herpers et al. normalized RGB (Yang et al. Sandeep and Rajagopalan [41]. a large training contour should be used leading to increased computation in the tracking process. [53]. In Yachi et al. [42]). Almost all of them are transformed from raw RGB space to obtain robustness against changes in light conditions. the skin color blobs are detected by a method using scan lines and a Bayesian algorithm. 22]. Different color spaces were applied in tracking and detecting face or hand systems. 22].2 Skin Blob Tracking Skin color of the hand and face has been used as a good feature for tracking for a long time. its appearance changes significantly with respect to a relatively fixed camera. [21. Jones and Rehg [26] defined a generic color distribution model in RGB space for skin and non-skin classes by using sets of photos on the web.

The algorithm was implemented in normalized RGB color space. β(s). The insertion of a new scan line is iterated whenever the distance between two neighboring scan points is above the threshold. Then connected blobs were merged while too small blobs are eliminated. 11 . jumps. 2.interest along concentric circles with a particular step width. the snake is initialized near the object of interest and attracted toward the contour of the object by forces depending on the intensity gradient. The energy function of a contour c = c(s) is given by: ε= (α(s)Econt + β(s)Ecurv + γ(s)Eimage )ds. γ(s) control the rel- ative influence of the corresponding energy term. an adaptation technique estimated the new parameters for the mean and covariance of the multivariate Gaussian skin color distribution by using a linear combination of previous parameters. Econt . where α(s).3 Active Contour Active contours or “Snakes” have been used in deformable contour tracking and segmenting rigid or non-rigid objects. Ecurv and Eimage are the energy terms that encourage continuity. If the arc between two radial scan points is higher than a particular threshold a new radial scan line positioned between them and with intermediate orientation is introduced. smoothness and edge influence. It can track a person’s face in 30 frames/second while the person walks. Usually. image energy and external energy. In the system of Yang et al. sits and rises. The tracking can be done by the minimization of the Snake energies such as internal energy. respectively. [54].

The boundary between the two domains is a curve. the descriptor is proportional to the difference of statistical fits to objects and background. which gives rough information on the direction and magnitude of the moving objects by a correlation process between two images.. that is. the pixel in the previous frame is projected to the new position in the current frame in order to compensate for the camera motion. Based on such a model. The minimization of the total energy gives the boundary. Jehan-Besson et al. 12 . In Kim and Lee [30]. It is hard to distinguish the motion of the object of interest when it moves in a similar speed as the camera. The success of tracking is largely based on the calculation of the image flow. the situation with moving cameras. it could become complicated in active vision.. The image domain is made up of two parts: the foreground part.e. containing the objects to segment and the background. the contour of the object. The camera motion is modelled by 6 parameters in rotation and translation. to make the snake “jump” to the new location.e. Normally.The classic snake algorithm will not operate well if there are large differences in the position or form of the object between successive images. the contour of the object. [25] proposed a general framework for region-based active contours to segment moving objects with a moving camera. i. the tracker utilizes the image flow. The snake may fall into local minima while moving to new positions. The whole image is moving including both the foreground and background. Unfortunately. For each of the parts there is a descriptor of the energy. i.

The tracker moves and resizes the search window until its center converges with the center of mass. e. which can give the orientation of the object.g.2. In Comaniciu and Ramesh [10].. Without describing the moving objects by states and a mechanism of prediction and correction. Their statistical distributions characterize the object of interest. is required instead of a box surrounding the object. [11] the spatial gradient of the statistical measurement is exploited. The Mean Shift algorithm depends on the lower level feature detections. texture and gradient.4 Mean Shift The mean shift algorithm is a simple iterative procedure that shifts each data point to the average of data points in its neighborhood. a combination of Mean Shift and Kalman filter does a better job. a modified Mean Shift algorithm named Continuously Adaptive MeanShift is applied. a contour or shape. Highly cluttered background may distract the tracker from the object. In Bradski [7]. The data could be visual features of the object such as color. in Comaniciu et al. The probability is created via a histogram model of the skin color or other specific colors. which finds the center and size of the color object on a color probability image of the frame. the tracker cannot distinguish occluded objects. In my case. 13 .

y.5 Optical Flow Optical flow has long been used as a way both to estimate dense motion fields over the entire visible region of an image sequence (Beauchemin and Barron [2]). I(x. After expanding the intensity function in a Taylor series and ignoring the higher order terms. it is the temporal image gradient). y + dy. [28]). t). t + dt) = I(x. y + dy. y) and time t. The term It is the rate of change of the grey level image function with respect to time for a given image point (i. t) then I(x + dx.2.1) is known as the optical flow constraint equation (also called the image brightness constancy equation) and may be written as −It = ∂I I = ( ∂x . we obtain I(x + dx. ∂t and v = (u. t). − where u = dx dt (2.e. y + dy. t + dt) is really a translation of the brightness value at (x. ∂I ∂I ∂I dx + dy + dt = 0 ∂x ∂y ∂t i. t + dt) = I(x. The equation (2. the image sequence is modelled by an intensity function. So it must follow that. In order to explore how the optical flow may be estimated.. which varies continuously with position (x. y. I. y.. v) is the optical flow vector with components (u. dt They are the x and y components of the optical flow. where It = ∂I . y. and It may all be measured from the images in an image 14 . t) + ∂I ∂I ∂I dx + dy + dt ∂x ∂y ∂t If the brightness value at (x + dx. Iy = ∂I ) ∂y I · v . The spatial and temporal gradients. v).e.1) ∂I ∂I ∂I = u+ v ∂t ∂x ∂y and v = dy . and to segment areas of consistent flow into discrete objects (Kalafatic et al.

However. 44]. v⊥. assuming change in motion is smooth over an image region. This is the problem that arises when using only the local spatial and temporal derivatives to estimate the optical flow. as there is only one equation and two unknowns. This is referred to as the aperture problem and may be understood by considering the edge of an object moving below a small aperture. of the edge has a component along the edge. The velocity. Only the component of optical flow in the direction of the maximum spatial derivative. Smith [43. The equation implies that the time rate of change of the intensity of a point in the image is the spatial rate of change in the intensity multiplied by the velocity with which the brightness point is moving in the image plane. v⊥. they may cause large global motion. it is not possible to estimate both components of the optical flow from the local spatial and temporal derivatives. v. these assumptions may not be satisfied in an active vision system with moving cameras.sequence for a particular pixel.2) In order to solve the optical flow constraint equation it is necessary to either apply regularization. and a component perpendicular to the edge. Smith and Brady [45] built a system based on feature-based image 15 . But only the component of the velocity in the direction perpendicular to the edge (parallel to the spatial gradient) can be observed and estimated. from the fundamental flow constraint equation. or parameterize the motion in an entire region using a low-dimensional model. v . (the two components of optical flow). However. may be estimated and. When the cameras follow the object. for example an affine model. it can be shown to be given by: v⊥ = −It I I 2 (2.

6 Kalman Filter The behavior of a dynamic system can be described by the evolution of a set of variables. In practice. we usually find that the measurements that we make are functions of the state variables and that these measurements are corrupted by random noise. avoid local minima.3) 16 . The tracker was implemented on a set of PowerPC based image processing system. The clusters of flow vectors which are spatially and temporally significant provide the object motion information. instead. wt ) (2. a dynamic system (in discrete-time form) can be described by xt = f (xt−1 . often called state variables. 2. The system itself may also be subjected to random disturbances. ut . If we denote the state vector by xt . Skin colour was used to restrict the region of support to image data that arises from the hand. Based on the analysis of idealized gesture movements Derpanis [14] modelled the optical flow parametrically as an affine transformation. Using robust hierarchical motion estimator to capture the unknown parameters. it can handle motion larger than one pixel. It is then required to estimate the state variables from the noisy observations.motion estimation. 2D features such as corners and edges are extracted to compute the optical flow. It tracks vehicles in a video take from a moving platform. which makes the real-time performance possible. the measurement vector by zt and an optional control input by ut . the individual state variables of a dynamic system cannot be determined exactly by direct measurements.

where the random variables wt and vt represent the process and measurement noise respectively. that is E[wt vt ] = 0. A priori and a posteriori estimate errors are defined as e− = xt − x− . lim Kt = 0. − R→ 0 Pt →0 17 .e. In general. vt ). The difference (zt − H x− ) is called the measurement innovation. Specifically. ˆ The a priori estimate error covariance is then Pt = E[e− e−T ] and the a posteriori t t estimate error covariance is Pt = E[et eT ]. In practice..p(v) ∼ N (0. i. One of ˆt the popular forms of K is given by Kt = Pt− H T (HPt− H T + R)−1 . where K is the gain ˆt ˆ ˆt ˆt t or lending factor matrix that minimizes the a posteriori error covariance Pt . the system noise covariance Q and measurement noise covariance R matrixes are usually determined on the basis of experience and intuition. They are assumed to be independent. if the process is linear. The a posteriori error e− can be calculated t t by the function e− = K(zt − H x− ).with a measurement that is zt = h(xt . Gaussian distributed: p(w) ∼ N (0. ˆt ˆ t where x− is the a priori state estimate given the knowledge of the process prior to ˆt time t. We assume then there is no correlation between T the noise process of the system and that of the observation. xt = x− + K(zt − H x− ). R). lim Kt = H −1 . and xt is the a posteriori state estimate at time t. Q). and the measurement is given as zt = Hxt + vt . white.6. and et = xt − xt . 2. these noise levels are determined independently. given measurement zt .1 Standard Kalman Filter In Welch and Bishop [51]. or the residual. it can be described as: xt = Axt + But + wt−1 .

then the residual contains considerable information about errors in the state estimate and strong correction should be made to the state estimate. In their experiments. xt = x− + Kt (zt − H x− ) and Pt = (I − Kt H)Pt− . Thus. and measurement update equations (corrector) Kt = Pt− H T (HPt− H T + ˆt R)−1 . a finger and lip tracking system is developed by Blake and Isard [5] to estimate coefficients in a B spline. and minimum error variance algorithm to optimally estimate the unknown state of a linear dynamic system from noisy data taken at discrete real-time intervals. the gain matrix is “proportional” to the uncertainty in the estimate and “inversely proportional” to that in the measurement. Based on Kalman filtering. If the measurement is very uncertain and the state estimate is relatively precise. On the other hand. then the residual is dominated mainly by the measurement noise and little change in the state estimate should be made. The background clutter affects the tracking result significantly. A Kalman filter was applied in the system of Martin et al. [33]. ˆ ˆt The Kalman filter gives a linear. where hand shape and position are tracked 18 . These measurements are used as the next input to the Kalman filter. The equations for the Kalman filter fall into two groups: time update equations (predictor) x− = Aˆt−1 + But and Pt− = ˆt x APt−1 AT +Q. if the uncertainty in the measurement is small and that in the state estimate is big. Measurements are made to find the minimum distance to move the spline so that it lies on a maximal gradient portion of the image. unbiased. In order to be robust to clutter the parameters of the motion model are trained from examples.The Kalman filter estimates state of a discrete-time controlled process by using a form of recursive solution: the filter estimates the process state at some time and then obtains feedback in the form of measurements. almost all the motions are oscillatory rigid motion.

with the cue of skin color and basic hand geometrical features. The resulting system provides robust and precise tracking which operates continuously at approximately 5 frames/second on a 150 megahertz Silicon Graphics Indy.

2.6.2

Extended Kalman Filter

If the process function (2.3) is not linear or a linear relationship between x and z cannot be written down, the so-called Extended Kalman Filter (EKF for abbreviation) can be applied (Azoz et al. [1], Dellaert and Thorpe [13]). The EKF approach is to apply the standard Kalman filter (for linear systems) to nonlinear systems with additive white noise by continually updating a linearization around the previous state estimate, starting with an initial guess. In other words, we only consider a linear Taylor series approximation of the system function at the previous state estimate and that of the observation function at the corresponding predicted position. This approach gives a simple and efficient algorithm to handle a nonlinear model. However, convergence to a reasonable estimate may not be obtained if the initial guess is poor or if the disturbances are so large that the linearization is inadequate to describe the system. To estimate the state of a non-linear process, the Extended Kalman Filter (EKF) can be used to give out an approximation to optimal non-linear estimation. It has a fundamental flaw that the distributions of the random variables are no longer normal after undergoing non-linear transformation. So large errors maybe introduced into the posterior mean and covariance of the transformed Gaussian.

19

UKF, Unscented Kalman Filter, which was used in Stenger et al. [48], uses the unscented transformation algorithm proposed by Julier and Uhlmann [27] to approximate a Gaussian random variable, which is accurate to at least second order of the distribution. The tracker estimated the pose of a hand in 3D model in front of a dark background at a frame rate of 3 frames/second. The uni-modal Gaussian distribution assumption in Kalman filters, including EKF and UKF, maybe a great disadvantage in some tracking problem, for example, multimodal object tracking. The computational complexity of a Kalman filter increases sharply, when the number of tracked objects increases. In active vision systems, motion of both object and camera makes the distribution of the state more complicated and unpredictable.

2.7

CONDENSATION Algorithm

The Conditional Density Propagation algorithm presented in Isard [23], Isard and Blake [24], is a Bayesian filtering method that uses a factor sampling based density representation. It samples and propagates the posterior density of the states over time. There is no assumption on the state probability density function. In other words, it can work with arbitrary probability density functions. For example, based on the CONDENSATION algorithm, the system of Meier and Ade [34] tracks multiple objects with multiple hypotheses in range images. In Isard and Blake [24], an importance sampling function was introduced as an extension of the standard CONDENSATION algorithm, to improve the efficiency of the

20

factored sampling. In order to robustly track sudden movement, the process noise of the motion model could be very high, so that the probability of each predicted cluster in state space becomes higher. Therefore, to populate these larger clusters with enough samples to permit effective tracking, the sample set size must be increased, thus also increases the computational complexity. Importance sampling applies when auxiliary knowledge is available in the form of an importance function describing which areas of the state space contain most information about the posterior. In the sampling stage, two given probabilities were set to determine the method from standard sampling, importance sampling and reinitialization. The hand blobs in Isard and Blake [24] were detected by using a Gaussian prior in RGB color space. The importance function, which was a mixture of Gaussians, gave more weight to the predictions that were near the center of the hand blob. It used a second order auto regressive process for the motion model.

2.7.1

Probability distribution

At time t, an object is characterized by a state vector Xt . Its history is Xt = {X1 , ..., Xt }. The set of features in the image is denoted by Zt with history Zt = {Z1 , ..., Zt }. There is no assumption on the density distribution, i.e., p(Xt ) can be a non-Gaussian or a multi-modal function, which cannot be described in closed form.

21

The dynamics of the evolution are described by a stochastic differential equation.2 Stochastic Dynamics The object dynamics are assumed to be a temporal Markov chain. which means that the current state Xt only depends on the immediately preceding state Xt−1 and not on any distribution prior to t − 1. for example.5).7. ∀t > 1..4) changes to p(Zt−1 |Xt−1 ) = t i=1 p(Zi |Xi ). (2. both mutually and with respect to the dynamics.5) (2. This is expressed as t−1 p(Zt−1 . while the stochastic part. Zt−1 |Xt ) = p(Zt |Zt−1 . defined by A.6) p(Zt |Xt ) = i=1 p(Zi |Xi ) p(Zt |Xt ) = p(Zt . defined by BWt .3 Measurement Model The observations of the features Zt are assumed to be independent.2. Xt )p(Zt−1 |Xt ) Integrating over Zt on both sides of equation (2. models the uncertainties caused by factors such as noise.e.7.4) After integrating over Xt . models the system knowledge. p(Xt |Xt−1 ) = p(Xt |Xt−1 ). (2. i. we get t t−1 p(Zt |Xt ) = Zt Zt i=1 p(Zi |Xi ) = Zt p(Zt |Xt ) i=1 p(Zi |Xi ) 22 . Xt |Xt−1 ) = p(Xt |Xt−1 ) i=1 p(Zi |Xi ) t−1 (2. 2. Xt = AXt−1 + BWt . so that. The deterministic part of the equation.

Zt−1 )p(Xt |Zt−1 ) = kt p(Zt |Xt )p(Xt |Zt−1 ) 23 (2. p(Zt |Xt .7. Zt−1 )p(Xt |Zt−1 ) p(Zt |Zt−1 ) p(Xt |Zt ) = = kt p(Zt |Xt . we get t−1 p(Zt−1 |Xt ) = i=1 p(Zi |Xi ) (2.11) . the conditional state density is given by p(Xt |Zt ).8).6) with (2.6) with (2.7) and the left side of (2. Xt−1 ) = p(Zt−1 .10) 2.10) p(Xt |Zt−1 . (2.9) by using the Markov assumption.9) finally is given as (2. Xt−1 ) = p(Xt |Xt−1 ) = p(Xt |Xt−1 ) (2. Xt ) i=1 p(Zi |Xi ) (2.Further. Xt |Xt−1 ) = p(Xt |Xt−1 ) p(Zt−1 |Xt−1 ) (2.7) Substituting the second term in (2. Xt |Xt−1 ) = p(Xt |Xt−1 )p(Zt−1 |Xt−1 ) p(Xt |Zt−1 . we can derive the formula of calculating the p(Xt |Zt ).4 Propagation of state density According to the assumptions that the process is a Markov chain and the observations are independent of the state.8) p(Zt |Xt ) = p(Zt |Zt−1 . Following Bayes’ rule and using (2. Xt ) From (2.5) t t−1 i=1 p(Zi |Xi ) = p(Zt |Zt−1 . we know p(Zt−1 .4).

so that p(Xt |Zt ) = p(Xt |Zt ) Xt−1 By integrating the right of the equation (2.kt is a normalization factor. p(Xt |Zt−1 ) = Xt−1 (2.11) over Xt−1 . By integrating the left of the equation (2.14) p(Xt |Xt−1 )p(Xt−1 |Zt−1 ) Xt−1 = 24 . we derive the following equation: p(Xt |Zt−1 ) = Xt−1 Xt−2 p(Xt |Xt−1 )p(Xt−1 |Zt−1 ) (2. Zt−1 )p(Xt−1 |Zt−1 ) Xt−1 = Substituting the first term on the right side of the equation by (2.12) p(Xt |Zt−1 ) (2.13) p(Xt |Xt−1 . p(Xt |Zt ) = kt p(Zt |Xt )p(Xt |Zt−1 ) The second term in equation (2.11) over Xt−1 .10).12) is calculated by as follows. we get p(Zt |Xt )p(Xt |Zt−1 ) = p(Zt |Xt )p(Xt |Zt−1 ) Xt−1 Thus.

The factor sampling method is used to find an approximation to a probability density.14) give out the propagation of the conditional state density from p(Xt−1 |Zt−1 ) to p(Xt |Zt−1 ). The probability or weight of the samples after the measurement is given by p(Zt |Xt = St ). because of the background.Equation (2. a new set of samples is generated for the time step t: St . the state space is multi-dimensional (in our case.12) and (2.7. 4 dimensions). Additionally. [19]. After normalizing weights.5 Factored Sampling One of the key techniques in the CONDENSATION algorithm is factored sampling introduced in Grenander et al. According to the dynamic model density p(Xt |Xt−1 ). The dynamics of the objects could be driven by a non-linear process and the system noise could also be non-Gaussian.12) the state density given observation Zt is also generally non-Gaussian. 2. the density of the time (n) (n) (n) 25 . Generally the density of p(Xt |Zt ) can not be evaluated simply in closed form. A set of samples St−1 is drawn from the density of the previous time step t − 1. So when it is applied to (2. The observation model density p(Zt |Xt ) is normally non-Gaussian. which is superimposed by the dynamical model p(Xt |Xt−1 ).

The whole process of sampling and propagation is shown in figure 2. The Kalman filters. they track the object shape in rectangle or oval. This may disable it in some tracking problems. In our tracking system. which is too coarse to estimate a hand in a certain gesture. we found the CONDENSATION algorithm can handle the hand tracking in our system. 26 . this problem can be solved by obtaining camera status from the server.step t is P (Xt |Zt ) = P (Zt |Xt = St )P (Xt |Zt−1 ) n (n) P (Zt |Xt = St ) (n) N n=1 n (n) = P (Zt |Xt = St ) P (Xt |Xt−1 = St−1 )P (Xt−1 |Zt−1 ) (n) (n) (2.15) P (Zt |Xt = St ) The entire sample set St (n) is going to be used for the next iteration.8 Summary From the above analysis and comparison of current tracking techniques. Image-based tacking such as background subtraction.1. for example. The large displacement of pixels in the images due to the moving camera makes the assumption of small motion in optical flow unsatisfiable. but in most of the cases they have to feed into a high level model to estimate the motion of objects. including EKF and UKF. Normally. skin color blob or mean shift tracker can be easily implemented. assume uni-modal Gaussian distributions in the state space. 2.

p(xt−1 | Zt −1) x p( xt | Zt −1 ) propagation x p(zt | xt ) observation x p(xt | Zt ) =kt p(zt | xt )p(xt | Zt−1) x Figure 2.1: The process of CONDENSATION algorithm 27 .

multiple object tracking. The computational complexity of a Kalman filter increases sharply. Based on factored sampling of the state probability density function. when the number of tracked objects increases. 28 . the more samples that are taken from the previous distribution the more accurate the tracking. a CONDENSATION tracker can estimate the position of the hand contour on skin color filtered image. The detailed design and implementation of the tracker will be presented in the next chapter. Theoretically.

and these parameters can be retrieved from the server over the network. verge and fixate the object during the tracking. in which there is fluorescent 29 . which makes it suitable to deal with the tracking problem in an active vision system. The tracker works in a normal lab room environment. It tracks a hand in a rigid pointing gesture using a binocular sequence of images taken from the active vision system described in the Chapter 1. We want to make no assumptions about how the camera is moving or about the viewing angle. however. and the hand moves according to unpredictable/unknown motion models. we find that the CONDENSATION algorithm has no assumption on the state density function nor on the object motion model. Hence it is not feasible to break up the dynamics (motion model) into several different motion classes introduced in Blake et al. is complicated because there is significant camera motion. The cameras can pan and tilt. [6] and learn the dynamics of each class and the class transition probabilities.Chapter 3 CONDENSATION Hand Tracker From the previous chapter on related work. a hand tracker based on the CONDENSATION algorithm is presented. We need a general model that is able to cope with the wide variety of motions exhibited by both the camera and the hand. This tracking problem. In the following sections.

so that the distribution of the state is reshaped according to the observations. The motion of the camera can be estimated from the parameters of the stereo camera system. i. detects the motion of the hand. and without the extra light source the object is too dark. the tracker stops the movement of the cameras. derived by sampling the previous state space. the hand is assumed to point to the object in the same side of space with respect to the body. In section 3. hypotheses of the hand state. and as a result the tracking will fail. right/left hand always points to the objects on the right/left side. Otherwise.e. One reason to set up a secondary light source is that the automatic cameras have no setting for backlighting (the primary source of light comes from behind the subject). so that relative motion of the camera to the hand can be 30 . there is no distinct finger appearance in the image. the subject waves his/her hand to “tell” the tracker there is a gesture to be tracked. Each hypothesis is measured on the skin color map of the image. pointing to an object with their index finger in a rigid gesture. In other words. which is assumed to be the only moving object at the beginning detailed in section 3. The subjects normally face the cameras with their hand stretched out. Moreover. In the initialization stage.1. propagate through a dynamic model. In each iteration.2 a Gaussian normalized distribution of skin color is built from samples of pixels on the hand. for example. when the finger points to the camera. The hand pose projected onto the image plane is assumed to show obvious finger and palm part.. the tracker can not estimate the hand state evolved from the initial state.lighting from the ceiling and an incandescent lamp in front of the hand. When the hand changes its pointing direction from the right to the left.

Meanwhile. we make an assumption that the only motion is due to a hand. y0 . r0 . The dynamic. This means that the hand is assumed to be the moving object with skin color within the first several frames. the template of the hand should be initialized without much human intervention.5 and 3.2 Hand Detection Human hands usually have a similar skin color as the face of their owner. Thus. natural sunlight and artificial indoor 31 .2. 3. But in GestureCAM. the templates for the trackers are initialized by hand.cancelled out. setting the value of X0 = [x0 . a template of the hand can be extracted within the region by applying the skin color filter introduced in section 3. measurement model will presented in detail in sections 3. But different lighting. for example. s0 ]T . 3. At the beginning. A general color model generated by statistics of the human skin color may work in most situations.6. the position of the hand in the images is used to initialize the state space. when the cameras are static.1 Initialization In the original CONDENSATION algorithm proposed by Isard [23]. To get the initial position of the hand and bootstrap the tracker. we freeze the cameras for a moment and take the difference of two successive frames.

past research Bergasa et al. Cheng et al. A threshold obtained from experiments is always applied to remove this effect. [9]. [4].4) has similar properties. Based on an analysis of distribution of skin color in different color spaces. [54]. In this implementation. G.1.1 shows in RGB space a typical aggregated color occurrence distribution from a set of skin color pixels. Figure 3. The transformation from RGB to normalized RGB space is simple and fast. B). Therefore. In this space the individual color components are independent of the brightness of the image and robust to changes in illumination. 32 .3 and 3. Fang and Tan [17] came to the conclusion that normalized RGB space is suitable for skin color detection. Each point in the figure designates the presence of a color with coordinate (R. the 3-dimension color space RGB is converted to 2-dimension normalized RGB color space. pixels in RGB color space are transformed to a normalized color space by the equations in 3. Another commonly used color space HSV (distribution shown in figure 3. but the conversion from standard RGB costs more that the conversion to normalized RGB. b can be represented by r and g. One of its disadvantages is that it is very noisy at low intensities due to nonlinear transformation. Since r + g + b = 1. may weaken the result of such color detection. It has been observed that human skin colors cluster in a small region in a color space and differ more in intensity than in color. A skin color distribution can be characterized by a multivariate normal distribution in the normalized color space Yang et al.lighting.

and both of them).R .1: Skin color in RGB space 33 .2.1) By taking sample pixels from the pictures of 11 different subjects’ hand under 3 different lighting conditions (fluorescent lamp. we found that they cluster in normalized RGB space as in figure 3. incandescent lamp. R+G+B B b= R+G+B r= (3. R+G+B G g= . 250 200 150 B 100 50 0 250 200 150 100 50 G 0 0 50 R 100 150 200 250 Figure 3.

05 Normalized G 0 0 0.2 Normalized R 0.1 0.2: Skin color distribution in normalized RG space 34 .05 0.25 0.1 0.05 0.3 0.2 0.5 2 Probability −3 1.2 0.1 0.5 1 0.15 0.1 0.15 0.2 0.05 0 0 0.3 0.x 10 3 2.25 0.25 0.5 0 0.25 Normalized G 0.3 Figure 3.15 0.15 Normalized R 0.3 0.

045 0.4 0.02 0.4 0.4 0.2 0 1 0.1 0.2 S 0 0 0.4 0.015 0.04 0.025 0.005 0 1 0.6 0.8 1 Figure 3.6 0.8 0.8 0.035 0.3: Skin color in HSV space 0.6 0.4: Skin color distribution in HS space 35 .2 H 0.2 S 0 0 50 100 150 200 250 300 350 H Figure 3.01 0.6 V 0.03 Probablity 0.8 0.

Equation 3.The mean vector m and covariance matrix Σ of both the R and G channels of the skin color can be calculated by selecting a region where the hand is located. we assume that at the initial stage.6) is the result of back projecting the distribution to the raw image (figure 3. Then a bivariate normal distribution model of the skin color is constructed by N (m. As shown in figure (3. g = ¯ i=1 1 N N gi (N is the number of pixels) and σrg = ρσr σg (ρ is the i=1 correlation of r and g). so that it can be detected easily by subtracting the first two frames. g) = σr σg The skin color map or probability of skin color image (figure 3. Since the robotic head can be fully controlled by the application. To minimize the search region for the hand. the box is the region where hand motion occurred between two frames. g ] r ¯   2  σr σrg  Σ=  2 σrg σg where r = ¯ 1 N N ri . Σ). cameras can be stopped for the operation of subtraction whenever it is necessary (initialization or re-initialization).2) ¯ ¯ (r − r)2 2ρ(r − r)(g − g ) (g − g )2 ¯ ¯ − + 2 2 σr σr σg σg σrg ρ ≡ cor(r. m = [¯. 36 . g) = where z≡ 1 2πσr σg 1 − ρ2 exp − z 2(1 − ρ2 ) (3.5). the only moving object in the scene is the hand. p(r.2 gives such density function.7).

The threshold τ can be computed by substituting the z in (3. A binary image of the hand is segmented out by using threshold τ . if hand moves against a background in skin color. Figure 3. at the end of each iteration the mean value of the r and g are updated by sampling the pixels within the tracked contour. 3σr ] and g ∈ [−3σg . In order to deal with the changes in lighting during tracking. when r ∈ [−3σr . the probability of encountering a point outside ±3σ is less than 0.2).5: Raw image taken from camera in RGB Figure 3. For a normal distribution.3% (σ is the standard deviation).8.Then the skin color filter is applied to this region. 3σg ] the minimum of z is 18(1 − ρ). Therefore. After applying the morphological operation ‘close’ to remove small ‘holes’ in the hand.6: Image filtered by skin color model 37 . we assume that there is no large continuous skin color area in the background with respect to the size of the hand shape. the largest connected component within that box is extracted as the initial hand shape shown in figure 3.2). From the distribution function (3. Because the segmentation depends on the color filter. the tracker can not distinguish the it from the background.

Figure 3.8: Detection of the skin color edge 38 .7: Substraction of the two frames Figure 3.

known as the knot e vector. p−1 (t) ti+p − ti ti+p+1 − ti+1 n   0 Then the curve defined by C(t) = i=0 Pi Ni. 1]. Second. First..tm } where T is a nondecreasing sequence with ti ∈ [0. p (t) = ti+p+1 − t t − ti Ni.3 Shape Representation After the hand is segmented from the skin color map of the raw image. t2 ... otherwise Ni.. if a control point is moved.3. be defined T = {t0 . Let a vector. .. So a cubic B-spline. only the segments around this control points are affected. and define control points P0 . a parametric curve that smoothly fits the contour of the shape could be a good representation.. To track a hand with rigid pointing gesture.p−1 (t) + Ni+1. Define the basis functions as Ni. tm−p−1 are called internal knots. Third. In this application. A B-spline is a generalization of the B´zier curve. . where the hand contour is supposed to be. 0 (t) =    1 if ti ≤ t < ti+1 and ti < ti+1 . the curve is completely controlled by the control points. The knots tp+1 . the shape should be represented in a certain way so that it can be fed into the tracker. t1 . There are many important properties inherent in a B-spline curve. the tracker needs to find the nearest edge in the skin color map of the image. Define the degree as p = m − n − 1.p (t) is a B-spline.Pn . . which 39 .. the curve can have different degrees without affecting the number of control points.

9: Representation of the hand contour after initialization: the dots represent the control points of the curve. give the estimated position of the hand. To help the measurement model put more weight on the finger part of the contour. introduced in section 3. the dashed curve is constructed from the points 3. The tracker generates hypotheses of 40 . The curve passing through the points is then generated. because the index finger gives more information about the orientation.3. a sequence of control points on the contour is extracted.is closer to its control polygon. Figure 3. A hand shape is extracted by filtering the image with the skin color model. By scanning the hand shape extracted from the top to the bottom at a given interval of pixels. the shape will fit the finger better than the palm part of the hand. the selection of the control points is taken unevenly.4 State Space For a given time t the control points of the contour curve. which will be used to measure the distance to the closest edge with skin color. In other words. is a better choice.

pointing gesture. relative to the first detected contour. the moving hand is in a rigid. The state at time t is given by Xt = [xt . rt . the template represented in B-spline curve. A tracker could conceivably be designed to allow arbitrary variations in control point positions over time. r0 .the points to match to the underlying raw image features. noted by Xt (xt . y0 . respectively. i.y). rotation r and scaling s. st ]T . Therefore. 41 . The process model of the system specifies the likely dynamics of the curve over time. This would allow maximum flexibility in deformation to accommodate moving shapes.    1 0  R0 =   0 1 The scaling parameter is st with the initial value s0 = 1. st ). O0 and Ot are the origin of the raw image. The rotation matrix in image plane at time t is   sin θt   cos θt Rt =   − sin θt cos θt where θt is the rotation from the initial place. All these parameters are estimated and measured in the image. particularly for complex shapes requiring many control points to describe them. the state vector in a given time t. However. OI . At the beginning. For each point on the curve. yt .e.. The control points on the contour of the hypothesis are calculated as follows: The initial state is given byX0 = [x0 . this is known to lead to instability in tracking. yt . rt . s0 ]T . template and prediction coordinates. there is a transformation from the initial state X0 . describes the freedom respectively in translation (x. In my application.

10: State space parameters This is based on the assumption that the components of the hand motion including translation. tt = [∆x.The translation parameter is tt = [∆x. st ) = p(xt )p(yt )p(rt )p(st ). ∆y]T . ∆y]T = Ot − O0 . i. and scaling are independent to each other. 42  © ( @  BA0  0  0 © ¥ 498¤76¦0  0  0 © § ¥0 5¤41321   ( &  £  £ © ¥ )'%£ "$#¤"!£  @ 0  © £  £ © § ¥£ ¤¤¨¦¤  £   ^ ^ ¡ ¢  . Finally. Figure 3. So the probability density of the state at time t is given by p(Xt ) = p(xt . yt .. yt ]T . rotation. where ut = [xt .10 shows these definitions. Figure 3. ut = u0 Rt st + tt .e. rt . the transformation from coordinates with origin O0 to coordinates with origin O0 is given by formula. ∆x and ∆y are the translation of the origin Ot from O0 .

σy ). the state of the hand motion is represented in a 4-dimension space. [37]. σy . It could be defined as Xt = AXt−1 + Bwt . In their system there was no assumption about how the camera moves. σx ). σs ). a 4-variate normal distribution is set at the beginning. When the background is too noisy and the motion of the camera is introduced. σr ) and s0 ∈ N (0. The initial probability density is given as p(X0 ) = p(x0 )p(y0 )p(r0 )p(s0 ) where x0 ∈ N (0. The dynamic model of the system describes the features of the motion. a more general model is applied as Philomin et al. Without pre-knowledge of the state distribution. y0 ∈ N (0. A and B can be learned by experiment. Complicated motion can be modelled by extending the model to higher orders. The σx . To deal with the wide variety of motions exhibited by both the camera and the object. r0 ∈ N (0. 3. wt is the random variable of noise. A simple linear model could work with smooth motion of the object. which is used by the tracker to predict the next state. the density of the state is sampled.Therefore. σr and σs are initialized with experiential values.5 Dynamic Model The hypothesis in the state space is propagated from the previous time step according to the system model. the standard deviation of the noise variable should be set to a larger value. In each iteration of the tracking. where Xt is the object state vector. By using a zero-order motion model with large process noise and 43 .

quasi-random sampling. which is a little more than the speed multiplied by the time interval. Intuitively the setting of the sigma is such that the next step of the translation should be within a circle. The sampling and propagation algorithm may change the shape of the distribution after iterations. respectively. the A = 1. rotation and scaling. the tracker concentrates the samples in large regions around highly probable locations at the previous time step. K is the coefficient for all the state parameters. An adaptive parameter of the distribution changes during accelerating (including both increasing and decreasing speed).11 shows the distribution in translation on x and y axis when the hand does 44 . and the B is given by Bt = KBt−1 . So in the equation Xt + Ct = A(Xt−1 + Ct−1 ) + Bwt . The following images shows the distribution of the hypotheses in state space. K= |Xt + Ct − A(Xt−1 + Ct−1 )| |Xt−1 + Ct−1 − A(Xt−2 + Ct−2 )| where Ct is the state of camera at time t. All the distributions are initialized as Gaussian. Quasi-random sampling technique generates the points that span the sample space so that the points are maximally far away from each other. which has better sampling results than the standard random algorithm. The deviation of the predictions are related to the speed of the changes in translation. Figure 3. where Ct is the relative motion of the object introduced by the camera motion. When the object moves in a constant speed the sigma of the distribution can be fixed.

The black shape is the real hand position in the image. The middle point represents zero.not changes its orientation and scale. Figure 3. respectively. The scaling distribution is computed over the ratio of the hypothetic contour size to the initial one. Finally.14 shows the distribution of all the samples in state space. while the contours are samples of the state space propagated from the previous iteration. while negative and positive values mean counter clockwise and clockwise rotation. The envelope of the distribution is no longer Gaussian after iteration. The rotation distribution is shown over the angles rotated from the initial pose.13 show the distribution of the samples in rotation and scaling respectively.12 and 3. when the rest of the state parameters are assumed constant. 45 . Figures 3.

y axis.Figure 3. 46 . evolved from previous iteration).11: The distribution of hypotheses in translation (the points on the top and left indicate are samples on the distribution of translation on x.

47 .12: The distribution of hypotheses on rotation (points on the bottom are samples evolved from previous iteration).Figure 3. when there are no changes in other parameters.

13: The distribution of the scaling (points on the right are samples evolved from previous iteration). when there are no changes in other parameters.Figure 3. 48 .

The points on the right are samples on distribution of parameter scaling. rotation and scaling.) 49 .Figure 3.14: The distribution of the state in translation. evolved from previous iteration. The points on the bottom are samples on distribution of parameter rotation. The points on the top and left indicate are samples on the distribution of translation on x. y axis.

except in the situation that hand moves toward one camera suddenly more quickly than the movement of the system motors. but also translation of the image plane. the motor driving the camera) may not intersect with the optical axis of the lens. there are still problems. it is one of the camera system’s goals: fixating the target. However. we can get the status of the motors. especially.e. Because the robotic head can pan at the ’neck’ and each of the cameras (eyes) can also pan independently. so we can utilize this to cancel out the motion of the camera. the distance from the object to each of the cameras can be maintained roughly equal (see figure 3. Actually. For our specific system. So. One is that the rotation axis of the camera (i. changes together at similar speed. a global movement is presented in each image. generally the size of the hand in both images. In normal cases.In an active vision system. the scaling parameter in the tracker. the rotation of the motor causes not only rotation of the camera.. the mean of the scaling is almost the same in both of the cameras. i. Therefore. The errors in the estimation may introduce errors in depth calculation.15). when the camera fixates on object and follows its motion. the hand is at the same distance to both of the cameras. The second is that the distance from the rotation center to the image plane is estimated by experiment. So the distribution of the scaling. 50 .e. We assume that this situation rarely happens..

object

left camera O neck of the robotic head

right camera

Figure 3.15: Distance from the object to each of the cameras can be maintained roughly the equal,when the cameras fixate on the object

51

3.6

Measurement Model

The likelihood between the observation and the hypothesis can be evaluated by taking the normals along the hypothesized contour (see figure 3.16) and calculating the distance of the nearest skin color edge (edge between skin and non-skin color). The observation process defined by p(Zt |Xt ) is hard to estimate from data, so a reasonable assumption is made that p(Z|X) is specified as a static function assuming that any true target measurement is unbiased and normally distributed, the observation density is given as 1 p(Zt |Xt ) ∝ exp(− 2 × min|z(sm ) − r(sm )|) 2σ m=1
M

(3.3)

where min|(z(sm )−r(sm )| is the distance between the hypothesis point to the nearest edge, σ is the deviation proportional to the size of the search window along the normal.

Figure 3.16: The normals along the hand contour This approach is similar to what Blake and Isard [5] used in the implementation of their trackers. The only difference is that we introduce skin color information, while in their implementation, the edges are extracted from grey scale images. The points inside the closed hand contour contain more information about the hand than those outside the contour, which maybe introduced by the image background. Normal hand 52

middle point between edges

center

Figure 3.17: The normals along the hand contour, the arrows shows the direction from interior to exterior of the hand shape shape is illustrated in figure 3.17. The points which are inside the palm, should be closer to the center of the palm than the exterior ones. Similarly, the points which are inside the finger region, should be closer to the middle points between the two edges. We augment the measurement model by probing the skin color edge along the normals from the point inside the hand shape to the outer part. We measure the nearest edge introduced by skin color probability sharply changing from high to low, that is, the probing along the normal now carrying not only the information of the position of the edge but also direction of the normal. In our case, the direction is from the region within the hand to the outside surrounding area. Usually, part or all of a measurement line lies in the interior of a target. Many extensions and modifications have been applied to the CONDENSATION algorithm to improve performance and expand its usefulness. MacCormick and Blake [32] have

53

This provides a more accurate model for the probability of features. The contour discriminant is a ratio of likelihoods that indicates how much more “target-like” a configuration is than “clutter-like”. For a hand in a pointing gesture. The idea behind the contour discriminant is that each sample represents some contour configuration in the image and that there is some likelihood that each configuration matches the true target and some other likelihood that the configuration matches clutter in the image.proposed the use of a “contour discriminant”. The main difference between the method proposed by Isard and Blake [24] and the contour discriminant method involves the assumptions made regarding the distribution of observed features in the image. Isard and Blake [24] assume that features along these normals in the interior of the contour are distributed similarly to the features on the exterior of the contour along the normals. Therefore. The contour discriminant is much more computationally expensive. which is a metric associated with each sample. but MacCormick and Blake [32] claim that it gives much better performance. since the interior feature distribution can be determined by making measurements of the target interior in the first image of the sequence. Both methods calculate likelihoods by defining normals to the contour under consideration and searching for edges along these 1-D normals. the finger indicates more information about the orientation. we measure the likelihood by probing along unevenly distributed normals on the contour (Figure 54 . small changes in the shape of the palm will distract the tracker. MacCormick and Blake [32] assume different distributions for the interior and the exterior of a contour. disregarding any knowledge of the interior. This effectively treats the contour as a wire-frame. given the observations. If the measurements taken along the contour are uniformly distributed. assuming that interior features are due to the target and exterior features are due to clutter.

a simple measurement was applied.19 as dashed bidirectional line. 55 . which makes the matching of the finger more important. If the estimation of likelihood is done along the measurement line from the hypothetical contour shown in Figure 3. Based on the assumption that all the parts of the hand within the contour are in skin color. i. This means we take more measurements on the finger than the palm of the hand. Figure 3. a measurement line from the interior to the exterior will first pass the contour of the hand (shown in Figure 3. it is much easier to be picked up by a bidirectional measurement line. In the figure. the clutter which is in skin color may affect the measurement. or in other words there is bias on the finger. palm and part of the wrist do not change. where the clutter is closer to the hypothetical contour.3. the relative position of the index finger.18: The normals along the contour for measurement of the features Under the assumption that the hand tracked in our system is in rigid pointing gesture.e.18)..19 as a solid unidirectional line.

hand shape Figure 3. The shaded part illustrates the real hand region.  ¢ ¤¥ ¡£ clutter hypothetical contour 56 . The solid line with one arrow shows the measurement taken from interior to exterior portion of the contour. while the black curve indicates the hypothetical contour which is measured.19: Measurement line along the hypothetical contour. The dashed line with two arrows measures the nearest feature point to the hypothetical contour.

background clutter distracts the tracking easily. The first iteration of the tracking process right after the initialization gives the best estimation of the state (Pinit ). if the threshold is set too low. the result is going to be more accurate. In other words.The tracker reinitializes itself by sampling the whole image with the help of the skin color map.2. On the order hand. So we set 0. it means that the confidence on this hypothesis is so low that the object is out of sight or has been lost by the tracker. background clutter and hand motion introduce noise in the result.When all samples of P (Zt |Xt ) are less than a certain probability. 57 . This threshold determines tolerance to the error in tracking. By analyzing experimental results on different lighting and background conditions.6 · Pinit as the threshold. The length of the normals used in measurement model is set to 10. whose middle point is on the hypothetical contour. which is extracted by the skin color filter presented in section 3. When the tracker follows the hand. but reinitializing may be too frequent. there are errors between such curves and the real contour due to the interpolation. the hypothetical contours never perfectly match the real hand contour. Therefore. Because the hand contour is represented by spline curves.6 of the largest initial probability. A simple way to reinitialize is stop the cameras and detect the motion of the hand again with same constraints as the initialization stage of the tracker. So the width of the finger part in the image has to be more than 5 pixels. If it is set too high. we found that when the probability of states are below 0. the P (Zt |Xt ) can not reach 1. it lose tracking of the hand.

non-rigid hand movement. It operates similarly to the process of the measurement phase in the CONDENSATION tracker. After refining the hand contour.20. A refining process is initialized which localizes the nearest hand contour to the tracking result. Based on epipoloar geometry in Xu [52]. or tracking errors. Precision can be degraded due to inaccuracy in color edge detection.3. 3. All these factors affect the result.2 Epipolar geometry The setting of the two cameras in the system is shown in the figure 1. By searching the edge within a predefined region along the contour. Π is the 58 . changes in lighting. followed by calculation of the position of the points on the hand in 3D space. the more precise the finger is located the more accurate the result can be computed.7 3D Orientation of the Hand The hypothesis with the highest evaluation after the measurement step gives the position of the tracked hand. I and I are the image planes. Its epipolar geometry is shown in figure 3. a refined contour of the hand can be detected. 3.2. focus. Using the perspective projection model. pairs of correspondence are found by correlation.7. C and C are the optical centers of left and right camera.1 Refining the Result For the upcoming step of calculating the depth. a more precise shape of the hand is extracted.7.

A point in the first camera coordinate system can be expressed by a point in the second camera through rotation R followed by a translation T . m = [x.4)  r11 r12 r13  where R =  r21 r22 r23   r31 r32 r33     and T = [tX . The focal lengths of the two cameras are f and f .. Z ]T . that is. we have M T [T × (RM )] = 0.4). 59 . y . M and M − T . z]T . m = [x . The above points can be denoted as following vectors: M = [X. Because of the coplanarity of the vectors T . Z]T . z = f and z = f . C and C lie. The two lines lm and lm are epipolar line of m and m . y. thus. M = [X . the projection of the ray through MC and MC . m and m are the projected points of M on both image planes. i. Y .   Similarly to equation (3. .epipolar plane where the object point M. tZ ]T . Z]T = R[X . e and e are the epipoles which are the points where the baseline CC intersects with the left and right image plane. They are also the intersection of the epipolar plane and the two image planes. Y. Y . z ]T where M and m are the points expressed in the first camera coordinate system while M and m are in second camera coordinate system. [X. Y. And R satisfies RRT = 1. The space point is projected to m = f M Z and m = f Z M in the two image planes.e. we have M = RM + T . Z ]T + T  (3. tY .

20: Epipolar geometry of the camera system 60   M lm y’ .lm’ C m e e’ m’ C’ I I’ lm M m y z z’ T O x baseline R m’ x’ O’ Figure 3.

noted by tx . noted by θ. which is constructed by previous experiments (not be discussed in this thesis). which is also called Essential Matrix is a function   Dividing equation (3. gives mT Em = 0.5) by ZZ . The intrinsic matrix.    R. to calculate the epipolar line in a pixel coordinate system within the reference frame. which transforms the normalized coordinates to pixel coordi- 61 . we can get the epipolar line in the other image. Em is the projective line in the first image that goes through the point m.5) tX 0 the translation between the two cameras. Therefore. and relative rotation between the two optical axes. So given a point in one of the images.  0  0 −tZ  E =  tZ 0 −tX   0 tX 0    cos θ 0 sin θ   0 1 0   − sin θ 0 cos θ       Since the essential matrix relates corresponding image points expressed in the camera coordinate system. In our system geometry shown in figure 1.The above equation can be rewritten as M T EM = 0  0  where E =  tZ   −tY of the rotation and  −tZ tY 0 −tX  (3. in which the focal length is fixed. we need the intrinsic matrix of the camera. because the two cameras tilt synchronously.2 there is translation between the center of two cameras along the baseline.

21. (v0 . 3. m2 ) V ar(m1 )V ar(m2 ) 62 (3. as shown in figure 3.3 Correspondence The corresponding pixels on the hand in both images are found by calculating the correlation within the region where the epipolar line and the hand overlap. respectively. The formula is given as Score(m1 . m2 ) = Cov(m1 . u0 ) is the principal point in pixel image coordinates and α is the angle between the two axes.7. ˜ ˜ So the fundamental matrix F is given as F = A −T EA−1 . m = Am and m = A m are the points in pixel image coordinates. (3. can be expressed as  f ku f ku cos α u0     A= 0 f kv / sin α v0      0 0 1 where kv and ku are the ratios between the units of the camera coordinates and the pixel coordinates. They satisfy the ˜ ˜ epipolar equation so mT A−T EA −1 m = 0. We used a normalized correlation algorithm to compute the correlation coefficient or score of two pixels on left and right image.7) .nates.6)  The projective epipolar line lm corresponding to the point m in the left reference frame can be calculated by equation lm = F m. The intrinsic matrices of the right and left camera are noted as A and A .

vk + j)] Ik (uk . v2 + j) − I2 (u2 .where n −m Cov(m1 . The pixel m2 in image I2 whose correlation score is the maximum within the search window along the epipolar line l1 is the corresponding pixel m1 in I1 . v2 )] [Ik (uk + i. vk + j) − Ik (uk . vk ) = i=−n j=−m (2n + 1)(2m + 1) n −m i=−n j=−m V ar(mk ) = [Ik (uk + i. m2 ) = i=−n j=−m n −m [I1 (u1 + i. epipolar line l1 of m1 u2 u1 v2 v1 m1 m2 I1 I2 correlation window search window Figure 3. v1 + j) − I1 (u1 . v1 )][I2 (u2 + i.21: Finding correspondence along the epipolar line 63 . vk )] (2n + 1)(2m + 1) .

22. This makes it hard to select correspondence from the candidates. this situation seldom appears and the images from both of the cameras are similar. as show in figure 3.Because of the inherent properties of the two cameras. Furthermore. iris and so on. in a lab or lecture room. when there is back light appearing in area 1. 2 £¤¥  ¡¢ ¥  ¡¢  ¡  ¡¢  ¡¢  ¡  ¡  ¡  ¡ 1 3 only in the right image only in the left image in the both images Figure 3. Normally. Another problem is caused by the high vergence of the cameras in that the images from each camera shows different views of the object (in area 3) and such views contain different information about the object.22: View overlapping when cameras verge 64 . the right camera adapts to the change. the images are different in the sensitivity to the same color. exposure. when the hand is too close to the robotic head. the camera is set with automatic focus. which makes two images differ from each other very much. For example.

the lines connecting the correspondence and the optical center in each image intersect the corresponding one in other image at the point of the object. In figure 3. Occlusion and unequal lighting on the object in each image can make the error of the correspondence detection even worse. With the information of pairs of correspondences in each image. Normally the variation is within 10 cm. Since we have the pair of correspondences in the two images from the previous step. By solving the line equations. But the correspondences can not be detected perfectly. regardless of the pan and tilt. line mOl and m Or intersect at M. For the majority of hand gestures. the depth of the points on the hand do not vary too much.Based on the assumption that the relative distance among neighboring points do not change dramatically on the hand. the depth of each point of the hand can be calculated.7.23. which includes the vergence and tilt of the cameras and the rotation of the neck. we can get the 3D position in the coordinate system C. the epipolar geometry of the cameras system is constructed. 65 . 3.4 3D Orientation Based on the parameters of the robotic head and cameras. the feedback from the depth calculation introduced in the next section can be used to cancel out part of the unreliable correspondence.

the coordinate is transformed to coordinate system A.left camera bt y Ol y y O x zp yt zt z z m M bt y O x ap z O xp right camera A x bt B y ' Or z x z ' m’ x ' C Figure 3.23: Transformations to the head coordinate system  1 0 0     After applying two rotation matrices (tilt Rt =  0 cos bt sin bt  and pan Rp =     0 − sin bt cos bt    cos ap 0 sin ap     0 1 0 ). whose     sin ap 0 cos ap origin is located at the intersect of the baseline and the ’neck’ of the head.  66 .

the orientation of the hand is given by H = F − P as show in figure 3. Y H F P Z F H P X O Figure 3.24: 3D orientation of the hand 67 . the orientation of the hand in 3D space is computed by a simple line equation. If the vector formed by the average location of the finger tip pixels is F and the vector of the average location of the palm pixels is P .24.Furthermore.

8 System Architecture and Implementation The tracker is intended to operate in the GestureCAM system introduced in section 1.25 shows the main process of the system. they are implemented by two separate threads which sample. The process of the tracker for right and left camera are symmetric. These two threads were synchronized before the start of calculating the 3D orientation.2GHz CPU. The operating system is Windows 2000. Due to the independency of the trackers for both cameras. The tracker runs on a normal desktop computer which is equipped with AMD 1. propagate and measure the density of the hand state distribution in each image over time. 512 MB memory. The program is developed under Visual C++ 6. and two frame grabbers used to capture images from the stereo cameras. 68 . Figure 3. The whole process works iteratively beginning with reading image data and ending with a 3D orientation of the hand.0 and Intel Image Processing Library.3. In the figure only shows the details of the one of them which is circled by a dashed oval.

25: Tracker Diagram 69 .System Initialization Initializing State and Shape Model Dynamic Model Left Image Right Image Propagation Skin color filter Hand Tracker Hand Tracker Measurement Searching Correspondence No Yes lost track? 3D Orientation Figure 3.

The hand tracker was run in different lighting and background image clutter conditions to show its robustness. It tracked a hand moving over a dark background. Accuracy and complexity of estimating the 3D orientation are presented in the following sections. the number of samples used in the tracker ranges from 10 to 5000. followed by analysis and discussion of the results. It other words.Chapter 4 Experiments and Discussion In order to analyze the major factors which affect the performance of the hand tracker a series of experiments are shown in this chapter.1 shows the relation between tracking accuracy and number of samples used in the CONDENSATION tracker. while other parameters were kept unchanged. rotation and scaling were examined. In this controlled situation. the factors affect the accuracy are the models and number of samples in CONDENSATION algorithm. Three kinds of hand motion such as translation. In this set of experiments. An image processing specific measure is employed to assess the accuracy of the track- 70 . 4.1 Accuracy of Tracking The figure 4. there was almost no background clutter that could distract the tracker.

The accuracy of shape. which was obtained by applying the same skin color filter on the image frame and manually marking the hand region. The scale factor of 2 in the signal value was chosen so that a Signal to Noise Ratio (SNR) of 0 (i. The hand contour resulting from the tracking process is rendered flat filled in by the ‘foreground’ color (colored with white) into the image Itrack . the accuracy is improved significantly. It means that from this point having more samples of the distribution of the state does not yield major improvement.y [Iref (x. The output SNR (in dB) denoted as out SNR is calculated by using the following equation. The signal and noise are calculated using the following quantities.ing process (i. SN Rout (dB) = 10 log signal noise (4. position and orientation of the tracked contour) in Tissainayagam and Suter [50]. y) − Itrack (x.1) 2 noise = images x. This is the ‘worst case’ scenario where the tracker has completely failed to track the object. Especially. y)] Iref is the pixel value at (x.e. Theoret71 . The pixel value for a ‘background’ pixel is 0. Thus the error measure is independent of the contour representation. It is more appropriate to measure the signal in terms of the area of ‘foreground’ pixels in the ground truth image. y)]2 (4.e.. the accuracy asymptotically converges.y [Iref (x. when the number of samples goes from 1 to 1000. y) for the ground truth image. signal = 2 images x. signal = noise) would occur if the tracker silhouette consisted of a shape of the same area as the ground truth shape but inaccurately placed so that there is no overlap between the two.. we can see that the percentage of successfully tracked frames increases with respect to the increasing number of samples.2) From the graph. After 1000.

especially skin color. Due to the use of skin color filter.82 8.80 8.21 9. The solid curve in figure 4. In fact.59 10. the more clutter in the background.1 shows the accuracy of tracking hand with clutter-less background while the dotted curve shows the one with higher cluttered background.8 500 1000 2000 3000 5000 10.03 6. In this experiment. the more likely tracker is distracted by the noise. causing SN R in (4.73 11. This experiment shows that increasing the number of samples in CONDENSATION algorithm can improve the tracking accuracy to a certain extent.2) to go to infinity.74 8. number of samples SN Rnoclutter (dB) SN Rhighclutter (dB) 20 50 100 6.ically.14 4.76 11.1: Experimental result on accuracy 72 . because the hand contour is represented by the spline curve which is not exactly the hand shape edge.61 8. Another factor affecting the tracking is the noise in the image. a perfect tracking result will make the noise equal to 0. Five image sequences of different combination of hand motion were tested.75 11.16 7.74 6. there is inherent noise introduced by the shape representation. The extent of the clutter was measure by the average percentage of pixel in skin color. different light sources made different background illumination.73 9.75 Table 4.

03% of the pixels are skin color).35% of the pixels are skin color). the red curve shows the one with light cluttered background (3.74% of the pixels are skin color).1: Accuracy vs. number of samples: the solid curve shows the accuracy of tracking hand with no cluttered background (0.13 11 accuracy measureed by SNR (dB) 9 7 5 3 0 500 1000 1500 2000 2500 3000 3500 number of samples used in tracker 4000 4500 5000 5500 Figure 4. 73 . the error bars are the standard deviation in the results in the experiments. the thick green curve shows the one with highly cluttered background (9.

number of samples time of computation (sec) 20 50 100 200 500 1000 2000 3000 0.04 Table 4. The computation of the tracker in each frame was timed with respect to the different number of samples used in the tracker.4.98 2. the computation of each sample can be parallelized.36 0. The graph in figure 4.1.1 shows why there is always a trade-off between accuracy and complexity. The graph in figure 4. The curve in the graph shows the roughly linear relation between complexity and number of samples.67 12. the computation cost of tracking is stable.2 shows the relation between number of samples used in the tracker and the time consumption in calculation using the same image sequences in section 4. Thus.21 4.2: Experimental result on complexity 74 .54 0. This result is independent of the complexity of the image or the hand motion. N is the number of samples for each iteration and log N is the cost for randomly picking up a sample from the base sample set by using binary subdivision.7 0. Since the samples go through the filter independently. and the images from each camera can also be processed independently.11 7.2 Computational Complexity In Isard [23] the use of the random-sampling algorithm causes one iteration of the CONDENSATION algorithm to have formal complexity O(N log N ).

14 12 10 time of computation (sec) 8 6 4 2 0 0 500 1000 1500 2000 number of samples 2500 3000 3500 Figure 4. number of samples: the error bars are the standard deviation of the experimental results 75 .2: Computational complexity vs.

4. The tracker keeps track of the hand successfully. left.3 Experimental Results of Tracking on Real Images Given sequences of images taken from the stereo cameras with different lighting and background clutter. the hand motion can be tracked in both cameras more accurately. towards the cameras.03% of the background pixels are skin color.4.3. 76 . The frame rate is 10 frame/second.1. There is no back lighting in the scene. experiments on tracking show how these two factors affect the result. Finally. rotation. 1000 samples were used in the tracking algorithm. In this experiment.9 show tracking using a black curtain as background. the distribution of state (translation in x and y. The following pairs of images in Figures 4. The accuracy of the tracking result is shown by the solid curve in figure 4. which introduces error for a rigid contour tracker.1 Performance of Tracker with Low Cluttered Background When there is no skin color clutter in the background.3 to 4. bottom and right boundary. In each following image. The factors that affect the result are the lighting condition and vergence of the camera. These may change the hand shape appearance in each camera during tracking. scaling) is shown respectively in top. an average of 0. a set of experiments show the performance of estimating 3D orientation. The hand moved up and down.

4: Frame 15 77 .3: Frame 5 Figure 4.Figure 4.

6: Frame 35 78 .5: Frame 25 Figure 4.Figure 4.

7: Frame 45 Figure 4.8: Frame 55 79 .Figure 4.

9: Frame 65 80 .Figure 4.

74% the background pixels in skin color.16.4. There are about 3. the clutter in the background increases.3. Figure 4.10: Frame 30 81 .1. the cameras moved when the tracker followed the hand. The accuracy of the tracking result is shown by the dotted curve in figure 4. the tracker loses the tracking of the hand. 1500 samples were used in the tracking algorithm. that is. an object on the background with similar color distribution.10 to 4.2 Performance of Tracker with Lightly Cluttered Background Without the black curtain. In the pairs of images from 4. the tracker works well until it moves in front of the face. but using the only one main light source. In the following images. When the hand occludes the face.

Figure 4.11: Frame 40 Figure 4.12: Frame 50 82 .

Figure 4.14: Frame 70 83 .13: Frame 60 Figure 4.

16: Frame 90 84 .15: Frame 80 Figure 4.Figure 4.

4. The accuracy of the tracking result is shown by the thick dashed curve in figure 4.35% of the background pixels are in skin color.23 are frames during the tracking of hand in a lab situation. in which background is cluttered and both of the cameras move actively.17: Frame 20 85 .1. 1500 samples were used in the tracking algorithm. The light sources are the fluorescent lamp from the ceiling and a incandescent lamp in front of the subject in order to reduce the backlight effect.17 to 4. Figure 4.3.3 Performance of Tracker with Highly Cluttered Background The following images from figure 4. From the result we found that the motion of cameras almost did not affect the tracking. Approximately 9.

Figure 4.19: Frame 40 86 .18: Frame 30 Figure 4.

20: Frame 50 Figure 4.21: Frame 60 87 .Figure 4.

Figure 4.22: Frame 70

Figure 4.23: Frame 80

88

4.4

Experiments on 3D orientation

The tracking of the hand from the images of the cameras is in two dimensions. By using epipolar geometry with the help of the intrinsic information of the camera system, we can get the 3D location of the object in the view. There are mechanical errors introduced by the motors and the cameras system, that is, the information we get from the server indicating the state of the motor might not be the true rotation of the camera system, when the images were captured. When the object is far from the cameras, the error of estimating the location might become very large. The experimental setup is shown in figure 4.24. The thick arrow in the lower image represents the hand and arm while the point at the end of the arrow indicates where the elbow is located. The arm moves against the plane A that is perpendicular to the xz plane. In other words, the hand moves in a vertical plane. The dashed arrow and planes indicate the possible position of the arm in the experiments. Plane A changes its orientation and distance to the cameras as shown in the lower diagram. 4 different orientations of rotation with respect to the axis y, and 5 angles within the plane A were tested. Furthermore, three positions (z= 860 mm, 1060 mm and 1250 mm) of the plane parallel to the xy plane were taken to test the accuracy of the depth. To minimize the effect of other noise during tracking, the background was pure black and light source was from the front of hand.

89

A

Rotation in vertical plane

rotation in xz plane

Plane A y

z x

Figure 4.24: System setup for experiment on 3D orientation

90

The slope is about 1. The hand moved in a vertical plane parallel to the plane xy. 0 degree in tilt and pan. errors in tracking results may aggravate the inaccuracy in searching corresponding points.3. There are several major sources of the error in depth estimation. Mismatching a pair of points may introduces large error in calculating depth. The cameras were fixed with 10 degree in vergence. First.e. the rotation angle of the motor which drives the camera is not very accurate due to mechanical errors.The figure 4. the corresponding points search method is based on estimating correlation of the pixels along epipolar line. Additionally. Second. Third. the error in the calibration of the stereo camera is a dominant factor affecting the accuracy. calculating the distance from the camera to the hand in axis z.25 shows the experimental result on distance estimation. 91 . i. The deviation of the estimation gets larger when the distance from camera to the hand increases.. We found that the estimation is linear with the real distance. which gives a resolution of less than 10cm within a distance of 1 meter. The figure also shows that the depths of the points on frontoparallel plane cluster around the estimated value.

00 1800.25: Real distances vs.00 1200.00 1000.00 1060.2000.00 1088.00 200.00 1600.80 1743.20 1439.00 860. estimated distance 92 .00 0.00 400.00 800.00 Estimated depth (mm) 1400.40 Figure 4.00 600.00 Read depth (mm) 1250.

4. 0 . The cameras were fixed during the experiment with 10 degree in vergence. where (xh . The deviation of the estimation becomes larger when the plane turning away from the plane parallel to the xy plane. and the (xe . 0 degree in tilt and pan.       -22. They are -45 . 22. yh . 93 . the distortion of hand shape was more significant than in 0 .5 . The large error in estimation of point’s z coordinate worsens this calculation. the vertical plane was placed in 5 different angles.5     and 45 . ye .There are 5 experimental results on estimating the rotation of the arm projected on z xz plane shown in figure 4. When     the hand moved in the 45   or -45 plane. by using formula arctan[ xh −ze ]. ze ) is the position of the elbow. In the experiment. while the position of elbow was 1 meter from the camera in the z axis. zh ) is h −xe the average position of the points on hand.

00 -40.2096 -40 -60 -80 -45.63 Estimated Angle (degree) 20 12.00 -22.50 45.00 22.26: Orientation in xz plane 94 .80 -20 -22.00 Real angle (degree) Figure 4.60 40 30.16 0 21.50 0.

yh . The angle is computed by anglev = arctan[ √ (yh −ye ) (xh −xe )2 +(zh −ze )2 ]. Comparing to the previous graph showing high deviation in estimating rotation in xz plane. where (xh . The maximal standard deviation is less than 2 degree. ye . 95 . The maximal error from real value is less than 10 degree.27 shows relation between the estimated and the real rotation of the arm within a vertical plane. due to the smaller accuracy in depth estimation. it shows that tracking motion in a vertical plane is much more accurate than tracking horizontal motion.The Figure 4. ze ) is the position of the elbow. zh ) is the average position of the points on hand. and the (xe .

                                7UXH DQJHO GHJUHH.

27: The orientation of the arm vertical (VWLPDWHG DQJHO GHJUHH. HVWLPDWHG WUXH Figure 4.

96 .

28: Arm orientation projected in xz plane.28 shows the estimation of the arm orientation projected on the xz plane.5 (cyan). 106mm (black) and 125mm (green). All the lines converge at a point where the elbow is located. and -45 (blue). Figure 4.Figure 4. -22.         97 § § ¥  © 1200 § § ¥ £ ¡ ©¨¦¤¢  § § ¥  ¥ ©¨¢  R R 6 5 4 2 1 0 ) 4 4 ©$3¢¤) ¦( # # $&$"©  ' % # !  U T R VSSQ B B E C FDB IH PG @ 8 A97 . 22. The vertex in each color represents the position of the elbow in each experiment. 2200 2000 1800 1600 Z 1400 1000 800 -150 -100 -50 0 50 X 100 150 200 250 300 Figure 4. The hand moved in the vertical plane at 3 depth positions and 4 horizontal rotations.29 shows a 3D view of the experiment result.5 (yellow) and 45 (pink). The arm rotating in the vertical plane is illustrated as a line. Measurements are taken at 860mm (red).

28.5 (yellow) and 45 (pink). and -45 (blue).29: A 3D view of the experimental results on tracking showed in Figure 4.         98 . Measurements are taken at 860mm (red). 22. 106mm (black) and 125mm (green).5 (cyan).300 200 300 200 100 0 −100 −200 −100 −300 2200 2000 1800 1600 1400 Z 1200 −200 1000 800 X 0 Y 100 Figure 4. -22.

the noisy depth calculation caused by errors in camera calibration and searching corresponding points.5 Summary From the above experiments.3 show the tracking results with the distribution of each dimension in state space. To some extent increasing the number of samples using in CONDENSATION algorithm can improve the accuracy. The linear relation between the number of samples and computational cost implies that there is trade-off of accuracy and complexity. the tracker estimates the translation. The points on 99 . 4. 1. The active moving cameras introduces more noise in the motion model. The images in 4. The lighting condition changes the clutter in the background as well as the distribution of the skin color model. The clutter in the background also reduces the accuracy. 2.4. rotation and scale of the hand contour in the stereo images. 5. enlarges the errors in estimating the position. Although initialized as a Gaussian distribution (a reasonable general guess). As we can see from the tracking result on the real image sequences. The shape of the density function changes after tracking iterations. it becomes non-Gaussian quickly when the background gets cluttered. 6. In the experiments on estimation of hand orientation in 3D space. 3. we found that the following affect the performance of the hand tracker. Tracking speed is about 1 frame/second using a Pentium3 937MHz normal desktop.

100 . The hand motion in 3D space is projected to 2D image plane in each camera. 7.the hand seem to cluster when the estimated positions in vertical plane are projected to horizontal plane shown in 4. The last two kinds of rotation may change the aspect ratio of the hand shape.28. so that the tracking result become less accurate. The rotation with respect to z axis is estimated by the rotation parameters of hand state. while the rotation with respect to y ro x axis is estimated by scaling parameter. but the resolution is about 10cm at the distance of 1 meter.

then generate a set of hypotheses. it reduces the non-skin color clutter. and searching the nearest feature points from interior to exterior based on the knowledge of the hand shape at the initialization stage. The weight of the hypothesis is normalized and the state distribution is updated as the base for next iteration. Using the factored sampling technique. The tracker estimates the translation. it can deal with arbitrary distributions of the state. Initial hand contour is detected by simple substraction of two consecutive frames followed by skin color filtering and a morphological operation. we presented a hand tracker based on CONDENSATION algorithm. 101 . rotation. and scaling of such contours. It tracks hand in a rigid gesture in an active vision system and gives the 3D orientation of the hand in a lab situation.Chapter 5 Discussion and Future Work In this thesis. The measurement model estimates the likelihood of the hypothesis to the features in the image. according to the CONDENSATION algorithm. Such samples go through the dynamic model combining with camera motion. By applying an adaptive bivariate normal model in normalized RG color space. The tracker measures the feature points on the normals along the contour with bias on finger points.

the average resolution in depth is more than 10 cm when the distance is above 1 meter. Tracking speed is about 1 frame/second using a Pentium3 937MHz normal desktop. The tracker can guide the stereo cameras fixate at the moving hand with appropriate zooming in. the 3D position of the point on hand is calculated. the hand contour decreases the range of candidates.1. the appearances of the object in both images are almost the same. the robotic head may be placed at the back of a lecture room tracking a speaker standing in the front of the room. Since the vergence of the cameras are small. using 2000 samples. The base line between the two cameras in the epipolar geometry is far from the object. so that the stereo cameras are almost parallel to each other and the major motion is head panning. In the GestureCAM project. Zooming changes the focal length which need to be calibrated. which helps in the searching corresponding points on the hand shape. The tracker based on the CONDENSATION algorithm works well in cluttered and globally moving background. Due to the noise introduced by calibration of the stereo camera and corresponding points.The weighted mean of the hypotheses is the estimation of current hand contour state. 102 . By applying epipolar geometry and the correlation algorithm. The accuracy ranges from about 9dB to about 12db measured by SNR introduced in 4. As a cue for searching correspondences in the pair of stereo images.

5. it can be computed by parallel algorithm in the future in order to run in real time and keep a good accuracy at the same time. hand can do more complicated motion. In the real world. The hand tracker presented in this thesis can only track a single gesture. It could deal with more gestures and gesture switching if there is mechanism to evaluate which model is the best fit to the hand gesture. The computational complexity is the proportional to the number of samples used in the CONDENSATION tracker. is crucial in tracking object in 3D space. estimating the intrinsic and extrinsic parameters. A well-calibrated stereo vision system would not only dramatically reduce the complexity of the stereo correspondence problem but also significantly reduce the 3D estimation error. we found that generating the hypothetic contours and the normals takes large part of the total running time. The accurate calibration of active stereo camera. which is the full palm with the pointing finger up. the kinematic calibration and head/eye calibration. Since the measurement on each hypothesis is independent. or applying deformable model. In our implementation. The complexity of calibrating active camera is higher than for a passive one. It includes processes such as motorized lens calibration. 103 . more types of hand motion and gesture can by tracked.1 Future Work In this implementation we assume that the gesture of the hand is rigid. and may appear differently due to the change of lighting and perspective. By extending the state space to higher dimension.

1998. Third International Conference on Automatic Face and Gesture Recognition. In Proceedings of Fourth IEEE Workshop on Applications of Computer Vision (WACV) ’98. M.html. [3] S. 1998. Beauchemin and J. URL citeseer.A. [7] G. ACM Conf. Blake and M. C. Sharma. North.M. Nara. Sotelo. Barron. August 1970. 18:987. Bowers. L. and M. Bergasa.nj. Azoz. Blake. In Proc. Isard. April 1998. Greenhalgh. Fahl´n. S. and R. pages 185–192. 1994. Human Factors in Computing Systems. Unsupervised and adaptive Gaussian skin-color model. Devi. In In Proc. L. and D.com/article/blake98learning. URL citeseer. [2] S. pages 274–279. B. Boquete. [5] A. pages 214 –219. volume 1. Isard. A. 3d position.nj. ACM Computing Surveys. In Proceeding of ACM Siggraph. Snowdon. Mazo. Benford.nec. attitude and shape input using video tracking of hands and lips. M. L. pages 242–249. Image and Vision Computing. Tracking hand dynamics in unconstrained environments.R.com/benford95user. Gardel. Japan. 1995. Learning multi-class dynamics. [4] L. E. and L.Bibliography [1] Y. 27(3):433–467. [6] A. J. CHI. Real time face and object tracking as a component of a perceptual user interface.nec. 1995. Bradski.html. 104 . User e embodiment in collaborative virtual environments. The computation of optical flow.

34:2259. Cowie. pages 142–149. G. 1989. Cheng. In Proceedings of the IEEE and ACM International Symposium on Augmented Reality 2001. Mean shift and optimal prediction for efficient object tracking. [10] D. Hingorani. Computer Science. Pattern Recognition. J. Derpanis. Hilton Head Island. Dorfmuller-Ulhaas and D. Fellenz. Leung. [9] H. J. A. South Carolina. Thorpe. W.D. Real-time tracking of non-rigid objects using mean shift. [11] D. Ramesh. X. Douglas-Cowie. and J. Tsapatsoulis. May 2003. 18(1):32–80. Robust car tracking using Kalman filtering and Bayesian templates. [14] K. W. G. Dellaert and C. Comaniciu and V. pages 70–73. and J. pages 55–64. and P. volume 2. York University. 105 . Irvine. [15] K. Bergen. Jiang. Ramesh. J. Y. E. 2001. pages 2–12.G. Master’s thesis. R. Emotion recognition in human-computer interaction. Lee. Schmalstieg. In Proceedings of International Conference on Image Processing. Shvaytser. 2000. Finger tracking for interaction in augmented environments. Object tracking with a moving camera. In Conference on Intelligent Transportation Systems. and J. R. 1997. In IEEE Conf.H. IEEE Signal Processing Magazine. volume 3. Wang. Color image segmentation: advances and prospects. A. [12] R. [13] F. Lubin. Comaniciu. Meer. S. R. Taylor. Burt. In IEEE Workshop on Visual Motion. J. Computer Vision and Pattern Recognition (CVPR’00). Kolczynski. CA.[8] P. N. V. Vision based gesture recognition within a linguistics framework. Kollias. Sun. 2000. August 1970. Votsis. Jan 2001.

Chasing the colour glove: Visual hand tracking. R. An active stereo vision system for recognition of faces and related hand gestures. In Proceedings of The Eleventh British Machine Vision Conference. Tan. Conference on Audio. Jenkin. In Workshop Dynamische Perzeption. Herpers. June 1994. and H. E. A. Universitaet Ulm. and D. J. Topalovic. In Second Int. [22] R. Tsotsos. Francois and G. Derpanis. volume 1. Dorner. [21] R. Las Vegas.and Video-based Biometric Person Authentication. September 2000. K. G. pages 23–31.[16] B. I.D. New York. Germany. pages 217–223. [18] A. [20] S. Verghese. Kaufman. Systems. Face and hand gesture recognition using hybrid classifiers. Fang and T. Jepson. K. Medioni. 2000. Gutta. MacLean. and J. Derpanis. D. Y. 1996. Simon Fraser University. Adaptive color background modeling for real-time segmentation of video streams. K. J. 1991. Computer Science. and Technology. Enenkel. 1999. Imam. Herpers. F. A. Nevada. Keenan. Chow. Visual Motion Analysis by Probabilistic Propagation of Conditional 106 . pages 227–232. Huang. Hands : a pattern theoretic study of biological shapes. J. [19] U.html. G. Washington. R. URL citeseer. Grenander. [17] Y.nj. M.com/gutta96face. Weschler. Milios. [23] M. Jepson. 1999. In the Proceedings of the International on Imaging Science. Tsotsos. M. and J. J. A novel adaptive colour segmentation algorithm and its application to skin detection. Adaptive color background modeling for real-time segmentation of video streams. Springer-Verlag.nec. K.C. Darcourt. Isard. Master’s thesis..

In Proc. Department of Engineering Science. Palermo. ICIAP 2001. [30] W.nj.nj. Visual tracking using snake for object’s discrete motion. 11th International Conference on Image Analysis and Processing. Barlaud. Icondensation: Unifying low-level and high-level tracking in a stochastic framework. J. 107 . Rehg.Density. volume 2. Statistical color models with application to skin detection. In Computer Vision and Pattern Recognition (CVPR 99). Jones and J. pages 61–64. pages 893–908. In European Conference on Computer Vision. A system for tracking laboratory animals based on optical flow and active contours. Isard and A. Ft. [24] M. September 2001. and G.html. Robotics Research Group. M. Jehan-Besson. URL citeseer. [25] S. Italy. University of Oxford. Collins. nonlinear transformations of A general method for approximating probability distributions. 1999.html. 1996.com/julier96general. [27] S. Kim and J. Segmentation of a head into face. ears. PhD thesis. Blake. neck and hair for knowledge-based analysis-synthesis coding of videophone sequences. In Proceedings of 2001 International Conference on Image Processing. URL citeseer. Uhlmann. Kampmann. Lee.nec. [29] M. [28] Z. 1998. and V.com/227514.nec. Ribaric. M. pages 274– 280. Julier and J. CO. Stanisavljevic. 2001. pages 334–339. Aubert. Kalafatic. S. Region-based active contours for video object segmentation with camera compensation. 1998. [26] M.

Active hand tracking. R. Chen. In IEEE Intl. 1998. In IEEE Third International Conference on Automatic Face and Gesture Recognition. MacCormick and A. Real-time tracking of multiple fingertips and gesture recognition for augmented desk interface systems. Sharma. pages 129–134. Networked intelligent collaborative environment (netice).html. In International Conference on Computer Vision. [32] J. K. and H. Conf. In Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02). Y. Korea. July 2000. H. Tokyo.nj. and T. 108 . [31] W.nj. In IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems ITSC’99.nec. Wang.com/martin98active. [36] V.nec. FG ’98. pages 390–395. Martin. and J. Koike. Devin. S. [34] E. Japan. Oka. [35] K. Crowley. B.com/pavlovic97visual. Meier and F. citeseer. Panichpapiboon. and T. Ade. gestures for human-computer interaction: Visual interpretation of hand IEEE Transactions URL A review. Pavlovic. October 1999. Tracking cars in range images using the condensation algorithm. Leung. A probabilistic contour discriminant for object localisation. 1997. URL citeseer. on Multimedia and Expo. Seoul. New York.html. on Pattern Analysis and Machine Intelligence. [33] J. Sato. 2001. 19(7):677–695. Goudeaux.-B.In Proceedings of the 2001 IEEE International Conference on Robotics and Automation. V. S. 2002. S. April 1998. pages 429–434. Blake. Huang.

html. S. [41] K. In IEEE Work- shop on Motion of Non-Rigid and Articulated Objects. Sigal. URL citeseer. URL citeseer. 2000. 109 . M. Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. M. Kanade.[37] V. S. In European Conference on Computer Vision. 1993. In 6th IEEE International Workshop on Robot and Human Communication. Pedestrian tracking from a moving vehicles. Asset-2: visual tracking of moving vehicles. Smith. Meier.N. Sandeep and A. 1994. Rehg and T. and L. on Computer Vision and Pattern Recognition CVPR 2000. citeseer. Fjeld. R. [43] S. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000. pages 43–49. Rajagopalan.nec. Visual tracking of high DOF articulated structures: an application to human hand tracking. pages 350–355.nj.nj.com/article/sigal00estimation. Rosenblum. [39] J.nj.com/557854. Davis.nec. and M. 1994. Y.com/rosenblum94human. URL citeseer.html. and L. Yacoob. In IEEE Colloquium on Image Processing for Transport Applications. [38] M. Davis.html.com/rehg94visual.nec. [42] L. Rauterberg. pages 212–217. IEEE Conf. [40] M. Human emotion recognition from motion using a radial basis function network architecture. pages 35–46. Human face detection in clutURL tered color images using skin color and edge information. USA. Athitsos. and V. Sclaroff. M. A gesture based interaction technique for a planning tool for construction and design. In Proc. September 1997.nj.html. M. Philomin. Bichsel.nec. Duraiswami.

University of North Carolina. Model-based hand tracking c using an unscented Kalman filter. Object tracking with Bayesian estimation of dynamic layer representations. Smith and J. Asset-2: real-time motion segmentation and shape tracking. Real-time American Sign Language recognition from video using hidden Markov models. Starner. and A. pages 63–72. Smith. British Machine Vision Conference. [50] P. 24(1):75 –89. Kumar. M. 1997. An introduction to the Kalman filter. M. pages 265–270. In Fifth International Conference on Computer Vision. [48] B. and R. P. S. Welch and G. J. In International Simposium on Computer Vision. International Journal of Image and Graphics. USA. pages 130–137. Starner and A. [46] T. 1995. 2:343–359. In Proc. M. 110 .[44] S. 2002. [45] S. [47] T. IEEE Transactions on Pattern Analysis and Machine Intelligence. Tissainayagam and D. April 2002. Manchester. NC. Technical Report TR 95-041. IEEE Transactions on Pattern Analysis and Machine Intelligence. Asset-2: real-time motion segmentation and shape tracking. Weaver. Brady. Mendon¸a. R. Stenger. H. UK. Pentland. Performance measures for assessing contour trackers. Aug 1995. September 2001. Suter. A wearable computer based American Sign Language recognizer. Tao. 2002. 17: 814 –820. S. Sawhney. [49] H. volume I. Pentland. [51] G. Cipolla. 1995. and R. Department of Computer Science. pages 237–244. Bishop.

Skin-color modeling and adaptation. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition. W. and object recognition : a unified approach. 2000. [54] J. Waibel. 1996. Yang. 111 . Xu. and A. 1998. Lu. Kluwer Academic Publishers. [53] K. Matsuyama.[52] G. and T. motion. In Proceedings of ACCV’98. Epipolar geometry in stereo. pages 687–694. Yachi. Human head tracking using adaptive appearance models with a fixed-viewpoint pan-tilt-zoom camera. T. Wada.