You are on page 1of 6

Proceedings of the 2008 IEEE

International Conference on Robotics and Biomimetics

Bangkok, Thailand, February 21 - 26, 2009

Dual Hand Extraction Using

Skin Color and Stereo Information∗
Thien Cong Pham, Xuan Dai Pham, Dung Duc Nguyen, Seung Hun Jin and Jae Wook Jeon, Member, IEEE
School of Information and Communication Engineering
Sunkyunkwan University
Suwon, Korea
{pham, pdxuan},,,

Abstract—Extracting the positions of hands is an important

step in Human Computer Interaction and Robot Vision ap-
plications. Posture and gesture can be extracted from hand
positions and the appropriate task can be performed. In this
paper, we propose an approach to extract hand images using
skin color and stereo information. Our method does not require
clear hand-size or high-quality disparity. With a sound training
database and an adequate working environment, we obtain
nearly 100 percent accuracy. The run time, ignoring calculation
of disparity generation time, is also acceptable.

Index Terms—Hand extraction, skin detection, stereo vision.

Computer Vision plays an important role in Human Com-
puter Interaction (HCI). Much research has been done and
many papers have been published in this field. Among these,
hand extraction has been of recent interest. Extraction of hand
information is a task required for robots to understand our
commands. The robot reads commands using a camera, tak-
ing images of the environment, extracting hand information
and finding the appropriate operation to be performed. Hand
extraction is usually not the final step in HCI application.
Later steps may include posture recognition or gesture recog- (b)
This paper concentrates on hand extraction to provide Fig. 1. An example of hand extraction. (a) Original left image (b) Hand
result image.
information of hand position and hand shape. Our inputs are
the left color image, right color image and disparity image of
the environment. Our target result is an image that contains diffuser to detect hands. Requiring more calculation time, in
only two hands. Fig. 1 is an example of hand extraction. The [3], the authors use image segmentation. Our work, using a
more challenging final result is to obtain a clean and clear different approach, uses only a camera and requires fewer
hand image. That is, there should not be too much noise and calculations.
many holes in the resulting image. Our program performs correctly under the operational
In [1], the authors use only one hand and consider the environment that satisfies these conditions:
plane that contains the back of the hand to estimate hand
• The operational environment is indoors, either during
position and orientation. In our program, we have two hands
day-time or night-time.
in the camera view and we find their location. Two hand
• The left hand, right hand and face do not overlap on the
planes in our application do not locate in an approximate
plane. That is, distances from the left and rights hand to the
• The difference of the vertical position of the left and
camera can differ. In [2], the authors use a special regular
right hands should not be great.
∗ This work is partially supported by Samsung Electronics. The rest of this paper is organized as follows: in Section

978-1-4244-2679-9/08/$25.00 ©2008 IEEE 330

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
II, we describe related work and our method for skin-based 3) Parametric skin distribution modelling: The four main
detection. This includes training the model and extracting reported types are:
skin pixels. Section III explains some methods to generate • Single Gaussian.
disparity images for stereo input images and the method that • Mixture of Gaussians.
we use. Section IV describes a technique to use disparity • Multiple Gaussian clusters.
information and skin-based image to remove background and • Elliptic boundary model.
small noises. A technique to detect connected components is
In Single Gaussian, skin color distribution is modelled by a
used in this section. In Section V, we combine the results
Gaussian joint probability density function, as follows:
from previous sections to correctly generate images of two
hands. Our experimental result, together with the operational 1 1 T −1
p(c|skin) = .e− 2 (c−µs ) Σs (c−µs ) . (2)
environment, are provided in Section VI. Conclusions and 2π|Σs |
future work are described in section VII.
In (2), c is color vector, µs and Σs are mean vector and co-
II. S KIN D ETECTION variance matrix of the distribution. The p(c|skin) probability
A. Related works can alternatively be used as the skin-like measurement. The
A survey on techniques related to skin-based detection is Mixture of Gaussian method considers skin color distribution
provided in [4]. There are four groups of methods. as a mixture of Gaussian probability density function:
1) Explicitly defined skin region: This is the most simple k
and static method. An (R, G, B) pixel is classified as skin if p(c|skin) = πi .pi (c|skin), (3)
 i=1
 R > 95 and G > 40 and B > 20 and
max{R, G, B} − min{R, G, B} > 15 and (1) where k is the number of mixture components and πi
|R − G| > 15 and R > G and R > B. are mixture parameters. By approximating the skin color

We experimented to check the performance of this method. cluster with three 3D Gaussian in YCbCr colorspace, the
The results were of poor quality with much noise and many Multiple Gaussian Clusters method was proposed. The pixel
holes. In addition, the quality was dependent on illumination. is classified as skin, if the Mahalanobis distance from the
We conclude that this method is inadequate for our applica- c color vector to the nearest model cluster center is less
tion. than a threshold. In another approach, the Elliptic boundary
2) Nonparametric skin distribution modelling: Three main model claimed that the Gaussian model shape is insufficient
types are reported: to approximate color skin distribution. Instead, they proposed
• Normalized lookup table (LUT). the shape of an elliptical boundary model.
• Bayes classifier. The performance of these methods is clearly dependent on
• Self Organizing Map. the distribution shape of the appropriate application. Some
In LUT, the colorspace is divided into a number of bins, research proves the correctness of distribution choice for
each representing a particular range of color component value specific cases. Our work, described in the next part of this
pairs. These bins form a 2D or 3D (depending on the number section, uses the Mixture of Gaussian model.
of directions of the colorspace) histogram of a reference 4) Dynamic skin distribution models: Rather than fix the
lookup table (LUT). Each bin stores the number of times this skin distribution model, methods of this type tune the model
particular color occurred in the training skin sample images. dynamically during different operational conditions. These
The value of the lookup table bins constitutes the likelihood methods require rapid training and classification. Besides,
that the requesting color will correspond to skin. Similarly, in they should have the ability to adapt themselves to changing
Bayes classifier, not only P (c|skin) is calculated, but also the conditions. The complexity of this type of method is clearly
P (c|¬¬skin) is counted. The Self Organizing Map, using the greater than that of the earlier methods.
famous Compaq skin database provided by [5], is reported to Asserting the best method is outside the scope of this paper.
be marginally better than the Mixture of Gaussian model. In A comparative evaluation of skin-based detection methods
the Self Organizing Map method, two skin-only and skin can be found in [4].
+ non-skin labelled images are used to train the model.
Many colorspaces were tested with the Self Organizing Map B. Our work
detector. After considering properties of these skin detection meth-
The advantages of non-parametric methods are that they ods, we concluded that the Mixture of Gaussians is the
are rapid to train and, use; they are independent of the shape best fit for the program after reviewing related work on
of skin distribution. Their drawbacks include storage space skin detection. In [6], the HSV colorspace is used to obtain
and inability to interpolate or generalize from the training better tolerance to illumination. Operational conditions in [6]
data. change very rapidly, because they detect skin in real video.


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
(a) (b)

Fig. 2. Some images of our training database. (a) Images provide by Massey
(b) Images generated by ourselves.


Fig. 4. Intermediate result after skin detection. (a) Original left image and
(b) Skin result image.

step. Background and noise remain on the resulting image.

We will filter those pixels in future steps.


Fig. 3. Plotting sample database to guess number of Gaussians to train.
A. Related work
Much current research focusses on generating disparity
In our application, we use LUV colorspace and ignore the images. The official website [8] shows a number of recent
Luminance part of the pixel’s information. publications. A taxonomy and evaluation of these methods
Since the database in [5] is no longer freely available, our is provided in [9]. This paper does not focus on disparity
model is trained using two sources: images. Our purpose is to only consider disparity as an
• Skin color database provided by Massey in [7]. input component. The Graph-Cut, a global matching method,
• Skin color database created by ourselves: we capture which is described in [10], [11], is reported to be the best
sample images in the same operational environment of method.
our target hand extraction program. We then manually
B. Our work
erase the non-skin part of those images and retain the
skin pixels. To improve extractor performance, disparity images should
Fig. 2 shows some of our samples used to train the Mixture satisfy these requirements:
of Gausians model. • High accuracy.

We plot the sample and view the distribution shape to find • Short running time.

parameter k for the mixture of Gaussians in (3). Fig. 3 shows Input images including raw skin-based image and disparity
the plot of part of our database. This part of the database is image are not required to be highly accurate. We will
best approximated by a mixture of two Gaussians, because combine them to extract high-quality results. Run time is
its shape contains two peaks. In our real experiment, the considered to increase the system frame rate.
database contains day-time and night-time images, they both The method in [10] is chosen, because it provides adequate
set k = 2, we use k = 4 for the entire database. results in an acceptable run time. Fig. 5 shows a sample of
Fig. 4 is an intermediate result after the skin detection disparity images using [10].


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
connected components, which have a small number of pixels,
are considered noises and will be removed from the image.
The resulting image, Fig. 6(b), still has some big components,
located at the bottom left corner. These will be filtered in the
next step.

After removing background and small noises, the inter-
mediate result contains only the face, two hands and big
Fig. 5. Sample disparity images of Fig. 4(a), generated by Graph Cut [10]. noises. Some research uses many hand features or limitations
to rapidly and easily extract hands. In [12], the hand is always
the largest component inside the camera view. Or in [13],
a lot of hand shape must be trained before extraction. In
another approach, we use skin and depth information from
both hands. This makes the extractor powerful, but is not too
time consuming.
The relative information of any two components is consid-
ered to extract hands. Before designing an evaluation function
to extract hands, we need to define some features of the
connected components detected.
• Size of a component A: size(A), number of pixels in
(a) A.
• Size index of a component A: size index(A), the
index of A, using size information, compared to other
remaining components. The largest component would
have size index 0. The second largest component would
have size index 1. The smallest component would have
size index equal to the number of components minus 1.
• Average height of a component A: avg height(A),
average horizontal positions of all pixels in A.
• Average disparity of a component A: avg disp(A),
average disparity value all pixels in A.
• Disparity index of component A: disp index(A), the
index of A, using average disparity information, com-
Fig. 6. Intermediate result after filtering background and small noises. (a) pared to other remaining components. The component
Original left image. (b) Intermediate result that contains only face, hands that has the greatest average disparity value would have
and big noises.
disparity index 0. The component that has the second
largest average disparity value would have disparity
IV. BACKGROUND AND S MALL N OISE R EMOVAL index 1. The component with the smallest average
disparity value would have disparity index equal to the
Disparity information is useful to remove background on number of components minus 1.
the raw skin-based image. The value range of disparity is
We define the evaluation function as follows:
0 → 255. In this paper, using the range from 55 → 200,
we can simply remove the background. The appropriate f (A, B) = f1 (A, B) + f2 (A, B)
range-threshold can be re-adjusted in different application
+f3 (A, B) + f4 (A, B), (4)
We follow these two steps to remove small noises: where A and B are any two components. f1 , f2 , f3 and f4
• Every connected component is extracted. are respectively sub functions that represent size index, size
• Small connected components containing less than a pre- difference, disparity index, and height difference of A and
defined number of pixels are removed. B.
Fig. 6 shows the intermediate result. At this step, the Each sub-function in (4) is defined as follows. By observ-
background is filtered using disparity range. In addition, all ing that hands, together with the face are usually the biggest


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
components, we define the size index component as: • Operating System: WindowsXP 32bit SP3.
 • Camera: BumbleBee 2 ICX204, product of Point Grey
5, if size index(A) ≤ 3

 Research inc. [14].
 and size index(B) ≤ 3

Frame resolution: 640×480.

 •
f1 (A, B) = 3, if size index(A) ≤ 5 (5) • Operational environment: indoor (either day time or
night time).

and size index(B) ≤ 5

0, otherwise.
B. Result
Two hands, in our application, usually have the same size. We chose an environment close to the training environment
We define the size difference component of the evaluation for the experiment. About 80 sample images were used in
function as: the training process. More than 200 images were evaluated
 by our extractor. Table I and Table II summarize statistical

 5, if |size(A) − size(B)| < information for our experiment.
 1
10 . max(size(A), size(B)) In Table I, we captured 120 stereo images in the morning

3, if |size(A) − size(B)| < or afternoon, extracted hand results, and entered on the

f2 (A, B) = 1
(6) upper row. These were considered the results under day-
5 . max(size(A), size(B))
 time conditions. For night conditions, we captured another
1, if |size(A) − size(B)| <

120 stereo images in the evening, extracted hand results, and


3 . max(size(A), size(B))

entered on the lower row. The statistical information shows

0, otherwise. that we obtained 100 percent accuracy.
We observe that the two hands are usually the closest We calculated the average running time of skin based
components to the camera, causing their average disparity detection, noises removal and hand extraction and entered
values to be larger than any others. We define the disparity in Table II. The statistical information shows that the speed
index sub-function as: of the entire process is good. The remaining time-consuming
 step is generating disparity by Graph Cut. In our test, this

 5, if disp index(A) ≤ 3 usually takes around 7 minutes for one frame. This problem
and disp index(B) ≤ 3 can be overcome by using BVZ [11] or Census [15] with

f3 (A, B) = 3, if disp index(A) ≤ 5 (7) the help of FPGA. Our current Census software result is 5

 and disp index(B) ≤ 5 seconds for one frame. We will apply to our application in
future work.

0, otherwise.
Fig. 7 are some of our hand extraction results. Some holes
The final part of the evaluation function is considered by on the results can be simply removed using the morphology
observing that two hands usually have a close average height opening operation (chapter 9, [16]). In later steps, for some
value. We define the height difference sub-function as: applications, hole removal is sometime unnecessary. There-
 fore, we do not chose to implement it in our work, as it would

 10, if |avg height(A) increase the run time.
− avg height(B)| < 70

|avg height(A)

f4 (A, B) = − avg height(B)| < 100 (8)
Total frames Correct frames Correct percentage

3, if |avg height(A)

Day-time 120 120 100%

 − avg height(B)| < 150 Night-time 120 120 100%

0, otherwise.
Each two components are to be evaluated using this TABLE II
evaluation function. The components providing the largest RUNNING TIME STATISTICS
return values are hands. Return values of all sub-functions of
f can also be re-adjusted to fit the appropriate application. Skin extraction Noises removal & hand extraction Total
30 ms/frame 90 ms/frame 120 ms/frame
A. Working environment
We used a standard personal computer for the experiments. VII. C ONCLUSION AND F UTURE W ORK
• CPU: AMD Athlon 64 X2 Dual 3800+. Our work, under specific conditions, uses skin color and
• Memory: 1GB RAM. disparity information to extract the images of two hands. The


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)

(d) (e) (f)

Fig. 7. Experimental result. (a), (b) and (c) are original left color images. (d), (e) and (f) are the corresponding respective hand extraction results.

program has been implemented and tested. Accuracy is high [6] L. Sigal, S. Sclaroff and V. Athitsos, ”Skin Color-Based Video Segmen-
and run time is acceptable. tation under Time-Varying Illumination,” IEEE Trans. Pattern Analysis
and Machine Intelligence, pp. 862-877, 2004.
Future research will focus on posture and gesture recogni- [7] F. Dadgostar, A. L. C. Barczak and A. Sarrafzadeh, ”A Color Hand
tion. Extracted hands will provide us with features related to Gesture Database for Evaluating and Improving Algorithms on Hand
hand position, and hand posture. The gesture recognizer will Gesture and Posture Recognition,” Research Letters in the Information
and Mathematical Sciences, vol. 7, pp. 127-134, 2005.
be taught using a training model. In [17], a Hiden Markov [8] The Middlebury website. [Online]. Available:
Model is used with a high reported accuracy. We will apply, 2008.
this model for our gesture recognizer. [9] D. Scharstein and R. Szeliski, ”A Taxonomy and Evaluation of Dense
Two-Frame Stereo Correspondence Algorithms,” International Journal
ACKNOWLEDGMENT of Computer Vision, vol. 47, pp. 7-42, 2002.
[10] V. Kolmogorov and R. Zabih, ”Multi-camera Scene Reconstruction via
This research was performed as part of Samsung Project Graph Cuts,” Proc. of European Conference on Computer Vision, pp.
on Gesture Recognition, funded by the Samsung Electronics, 82-96, 2002.
[11] Y. Boykov, O. Veksler, and R. Zabih, ”Efficient Approximate Energy
Republic of Korea. Minimization via Graph Cuts,” IEEE Trans. Pattern Analysis and
Machine Intelligence, pp. 1222-1239, 2001.
R EFERENCES [12] S. J. Schmugge, M. A. Zaffar, L. V. Tsap and M. C. Shin, ”Task-based
Evaluation of Skin Detection for Communication and Perceptual Inter-
[1] A. Sepehri, Y. Yacoob and L. S. Davis, ”Estimating 3D Hand Position
faces,” Journal of Visual Communication and Image Representation, pp.
and Orientation Using Stereo,” Proc. of Conference on Computer Vision,
487-495, 2007.
Graphics and Image Processing, pp. 58-63, 2004.
[13] E. J. Ong and R. Bowden, ”A Boosted Classifier Tree for Hand Shape
[2] L. W. Chan, Y. F. Chuang, Y. W. Chia, Y. P. Hung and J. Y. Hsu,
Detection,” Proc. of Automatic Face and Gesture Recognition, pp. 889-
”A New Method for Multi-finger Detection Using a Regular Diffuser,”
894, 2004.
Proc. of International Conference on Human-Computer Interaction, pp.
[14] The Point Grey Research Inc. [Online]. Available:
573-582, 2007, 2008.
[3] X. Yin, D. Guo and M. Xie, ”Hand Image Segmentation using Color
[15] J. Woodfill and B. V. Herzen, ”Real-Time Stereo Vision on the PARTS
and RCE Neural Network,” International journal of Robotics and
Reconfigurable Computer,” IEEE Symposium on FPGAs for Custom
Autonomous System, pp. 235-250, 2001.
Computing Machines, pp. 201-210, 1997.
[4] V. Vezhnevets, V. Sazonov and A. Andreeva, ”A Survey on Pixel-based
[16] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd
Skin Color Detection Techniques,” Proc. of Graphicon, pp. 85-92, 2003.
Edition. Prentice Hall, 2008.
[5] M. J. Jones and J. M. Rehg, ”Statistical Color Models with Application
[17] H. K. Lee and J. H. Kim, ”An HMM-Based Threshold Model Ap-
to Skin Detection,” Proc. of Computer Vision and Pattern Recognition,
proach for Gesture Recognition,” IEEE Trans. Pattern Analysis and
vol. 1, pp. 274-280, 1999.
Machine Intelligence, pp. 961-973, 1999.


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.