You are on page 1of 6

16th IEEE International Conference on Robot & Human Interactive Communication MP-10

August 26 -
29, 2007 / Jeju, Korea

Hand Gesture Recognition To Understand Musical Conducting Action


Hongmo Je, Jiman Kim, and Daijin Kim
Intelligent Multimedia Laboratory, Dept. of Computer Science & Engineering
Pohang University of Science and Technology (POSTECH), Pohang, Korea
{invu7 1 ,jmk,dkim} @postech.ac.kr

Abstract- This paper deals with the understanding of four researchers in the filed of human computer interaction also
musical time patterns and three tempos that are generated have been concerned about creating a machine-based music
by a human conductor of robot orchestra or an operator play system, which includes intelligent robots or computers,
of computer-based music play system using the hand gesture
recognition. We use only a stereo vision camera with no extra and considers the conductor's desired beat and the tempo.
special devices such as sensor glove, 3D motion capture system, The first electronic orchestra with a complex performance
infra-red camera, electronic baton and so on. We propose database and Musical Instrument Digital Interface (MIDI)
a simple and reliable vision-based hand gesture recognition controllers responds to the gestures of the conductor through
using the conducting feature point (CFP), the motion-direction a sensor glove. Also, a special purpose electronic baton
code, and the motion history matching. The proposed hand
gesture recognition system operates as follows: First, it extracts was introduced in [5]. It can identify 4 over 4 beat timing
the human hand region by segmenting the depth information pattern (4/4) by following the motion of electronic baton in
generated by stereo matching of image sequences. Next, it the right hand, while recognizing a play speed by tracking
follows the motion of the center of the gravity(COG) of the the 3D motion of the left hand wearing a sensor glove.
extracted hand region and generates the gesture features such Another methodo of using sensor glove had been proposed
as CFP and the direction-code. Finally, we obtain the current
timing pattern of the music's beat and tempo by the proposed by Winker [6]. In Modler [7], neural networks for mapping
hand gesture recognition using either CFP tracking or motion hand gestures into the music play parameter in the "virtual
histogram matching. The experimental results show that the musical system" is described. In addition, a research of
musical time pattern and tempo recognition rate are over 86% mapping gesture into music using 3D motion data captured
on the test data set when the motion histogram matching is by a commercial 3D motion capture system (Vicon 8) has
used.
been reported [8].
I. INTRODUCTION Bien and Kim [9] have suggested a vision-based method
Hand gesture recognition, as one of pattern recognition for understanding human's conducting action for chorus
and analysis problem, is so important that the motion of with a special purpose light baton and infra-red camera.
human hands can provide abundant information of human They proposed a vision system which captures the image
intention and implicit meaning to the machines in real world. sequences, tracks each end-point of the baton which is a
Many reports on intelligent human machine interaction using stick having a distinguished color feature to be detected
hand gesture recognition have already been presented [1], easily by a camera, and analyzes a conducting action by
which can be mainly divided into "Data Glove- based" and fuzzy-logic based inference. Lately, Watanabe and Yachida
"Vision-based" approaches. [10] have proposed a real-time interactive virtual conducting
The "Data Glove-based" methods use a special input system using the Principle Component Analysis (PCA)-based
device named "hand data sensor glove" for digitizing hand gesture recognition that can identify only 3/4 time pattern.
and finger motions into multi-parametric data. It is possible In general, conductors perform various music using both
to analyze 3D space hand motion with the sensing data. hands and natural conducting actions may be very difficult
However, the device is too expensive and the users might to represent. Hence, we take the following assumptions to
feel uncomfortable when they communicate with a machine alleviate the problem:
[2]. . The conductor uses only one-side hand.
The "Vision-based" methods use only the vision sensor;
camera [3]. In general, the entire system of the vision-based . The conducting action must be in the view range of the
hand gesture recognition must be more simple than the Data camera.
Glove-based approach, and it makes human-friendly inter- . The conductor may indicate four timing patterns (2/4,
action with no extra device. The vision-based hand gesture 3/4, 4/4, 6/8 see Fig. 1) with three tempos (Andante,
recognition is a challenging problem in the field of computer Moderato, Allegro) by his/her hand motion.
vision and pattern analysis, since it has some difficulties . The conductor needs no special devices.
of algorithmic problems such as camera calibration, image We propose a very simple but reliable vision-based hand
segmentation, feature extraction, and so on. gesture recognition of the music conductor with no ex-
Conducting a music band is a highly sophisticated art tra devices. Unlike the previous vision-based hand gesture
that has been matured over centuries [4]. Recently, some recognition, we use the depth information, instead of using

978-1-4244-1635-6/07/$25.00 ©2007 IEEE. 163


and complicated and cluttered background which has many
skin-like colored objects such as wood and wall papers. We
use the depth information of a stereo image instead of the
2D pixel image. The depth information might not only be
insensitive in any light condition but also robust even if there
is a complicated background.
Normally, the members of orchestra must concentrate their
attention on the conductor's face and hands to read his/her
intention and his/her hands are placed in front of conductor's
body when conducting. Based on this fact, we utilize a face
detector to detect the human face, which allow us to find
(a) (b) the hand candidate region easily because we know that the
depth of hand region must be closer than the face region
to the stereo camera. Fig. 2 shows how the hand candidate
region is changed according to the distance of human faces
from the camera.
The exact hand region can be completely segmented after
postprocessing on the extracted hand candidate region. Fig.
3 shows several postprocessing stages such as the morpho-
logical operator and the connected component analysis to
remove the non-hand region [11].
To track the motion of the hand, the COG of the hand
region needs to be computed. We approximate the COG
(c) (d) by computing the mean coordinates of the segmented hand
region as
Fig. 1. Four typical musical time patterns: (a) 2/4, (b) 3/4, (c) 4/4, (d)
6/8
Xcog N: cog N (1)
the intensity or the color information of image, generated where xi, yi are the x and y coordinates at the ith pixel
by a stereo vision camera to extract human hand region position, respectively, and N is the number of pixels of the
that is the key region of interest (ROI) in this application. hand region.
Our proposed system can obtain the motion velocity and III. THE PROPOSED HAND GESTURE RECOGNITION FOR
the direction by tracking the center of gravity (COG) of the UNDERSTANDING THE MUSICAL TIME PATTERN AND
hand region, which provides the speed of any conducting TEMPO
time pattern. We introduce two methods to recognize the
musical time pattern. One is the CFP tracking which uses In contrast to the gesture recognition for hand sign lan-
only special features like conducing feature point and another guage, the hand gesture recognition for understanding a
is the motion histogram matching which can identify the time musical time pattern and tempo does not have to be accurate.
pattern and the tempo at once, where the "Mahalanobis dis- While a slight posture or movement of hands in the hand sign
tance" is chosen as the distance metric of motion histogram language represents an independent and important meaning,
matching. only salient features like beat transition point of hand motion
The remainder of this paper is organized as follows. are the most important information in the conducting gesture.
Section II describes the hand segmentation of our system. A. The direction code of the hand motion
Section III shows the proposed hand gesture recognition for
understanding the time pattern and music tempo. Section The easiest way to find the trajectory of the conducting
IV presents the experimental results of both simulation and gesture is to track the motion direction of the hand. We obtain
real world videos. Finally, section V draws conclusion and the direction angle of the hand motion by computing the
discusses future work. difference between the previous COG of hand region and
the current COG of it as
II. HAND SEGMENTATION AXcog (t) = Xcog (t) -Xcog (t- 1), (2)
Hand segmentation is the first step of our proposed AYcog (t) = Ycog (t) -Ycog (t- 1),
hand gesture recognition system, which separates the human A Ycog (t)
hand region from the others. Most methods for the hand 0 (t) = arctan co
A Xcog (t)'
segmentation uses the skin color information to extract the
hand region. The skin color-based hand region extraction is where 0f(t) is the direction of the hand movement on time
quiet simple, but it is sensitive to light condition change t. To represent the direction-code, the real value of the hand

164
(a) (b) (c)

(d) (e) (f)


Fig. 2. Real images vs. depth images of the hand candidate regions: (a),(d) A human conductor stands at 1.Om from the camera, (b),(e) 1.5m from the
camera, (c),(f) 2.Om from the camera.

(a) (b) (c)


Fig. 3. Post processing: (a) the extracted hand candidate region, (b) noise removal by morphological operation, and (c) component merging by connected
component analysis.

direction should be quantized in eight directions. Fig. 4 3to6. Thus, the recognition system can identify the time
shows three-bit codes for the eight dominant direction of pattern by observing the sequence of CFPs.
hand movement.
C. Motion histogram matching
B. Conducting feature point Although the analysis of CFP sequences is reliable for
Instead of analyzing all the sequences of motion direc- identifying musical time patterns, it can fail when the system
tions, we simply track the salient features of conducting misses an important CFP. This can be occur for a complicated
gestures. Fig. 6 illustrates the representative features which time pattern like 6/8 which has a large variation among
are called as "Conducting Feature Point (CFP)", of each the different human conductors. To avoid this problem, we
musical time pattern. For example, a musical time pattern of propose a motion histogram matching based on the musical
2/4 has three CFPs which are 3to6, 6tol, and ]to3. Assuming time pattern tempo analysis. Fig. 5 indicates the plot of the
that the new coming CFP is 6tol while the previous CFP is average motion histogram of direction codes for each musical
3to6 or the start point of the initial gesture, then the gesture time patterns with the moderato tempo.
recognition system expects the next CFGs ]to3 following We can obtain a cycle of each time pattern, where one

165
TABLE I
THE PROFILE INDEX FOR THE TIME PATTERNS AND TEMPOS.

index 0 1 2 3 4 5 6 7 8 9 10 11
Time pattern 2/4 2/4 2/4 3/4 3/4 3/4 4/4 4/4 4/4 6/8 6/8 6/8
Tempo And. Mod. Alle. And. Mod. Alle. And. Mod. Alle. And. Mod. Alle.

TABLE II
010(2) THE RECOGNITION RESULTS IN CONFUSION MATRIX FORM: USING CFP
011(3)
001(1 SEQUENCE ANALYSIS.

CFP 0 1 2 3 4 5 6 7 8 9 10 11
0 41 0 1 0 2 1 1 2 0 1 0 1
1 0 42 0 0 0 1 1 0 0 2 3 1
1000() / oo0(0) 2
3
0
0
0
1
40
0
0
0
0
39
0
0
3
1
1
3
1
2
1
1
2
2 0
2
* "- b~ ~ ~ ~ ~ ~ ~ -1
4 0 2 0 0 0 42 1 0 1 0 1 1
5 1 0 0 2 1 39 0 4 1 0 0 2
6 3 0 1 0 1 1 38 1 0 3 1 1
7 0 2 2 0 0 0 0 40 2 3 1 0
8 0 0 0 1 2 2 3 2 39 1 0 0
101(5 111(7) 9 2 0 1 0 2 2 3 1 1 36 0 2
110(6) 10 0 0 1 2 1 0 2 1 1 1 39 2
11 0 1 2 0 1 0 2 0 4 0 2 38

Fig. 4. Eight direction codes. TABLE III


THE RECOGNITION RESULTS IN CONFUSION MATRIX FORM: USING
MOTION HISTOGRAM MATCHING.

cycle means a sequence of the direction-code from the start CFP 0 1 2 3 4 5 6 7 8 9 10 11


point of time patterns to thier end point (usually both the 0 45 0 1 0 1 0 0 2 0 0 0 0
start point and the end point are the same). In general, most 1 0 44 0 0 0 0 1 0 0 2 2 1
2 1 0 45 0 0 0 0 0 2 1 0 1
musical time patterns have "3to6" type of the CFP as the 3 0 1 0 47 0 1 0 0 1 0 0 0
start point of thier action. 4 0 2 0 0 43 1 0 1 0 0 1 2
In the training stage, we collect the histogram vectors 5 1 0 0 2 0 44 0 3 0 0 0 0
6 3 0 1 0 1 1 43 0 0 0 1 0
H [ho:h, ....... h7] where ho to h7 are the number of each
=
7 0 2 2 0 0 0 0 41 2 3 0 0
direction code for the cycle and obtain the statistics (mean, 8 0 0 0 1 2 1 2 2 42 0 0 0
9 2 0 1 0 2 0 2 0 1 40 0 2
variance) of motion histogram for all combinations of time 10 0 0 1 2 1 0 1 1 1 0 42 1
patterns (2/4, 3/4, 4/4, 6/8) and tempos (Andante, Moderato, 11 0 1 2 0 1 0 2 0 3 0 1 40
Allegro). Then, the mean and variance vectors H,, and Hz
of the histogram vectors can be computed as
Prof ileIlndex = arg min MDk, (5)
k
where H, is the current motion histogram and k is the profile
H,t = N Hi, (3) index given in Table. I.
Hz = (N 1) Z(H,HN-)2, IV. EXPERIMENTAL RESULTS
We used the Bumble Bee stereo camera which provides
a depth map for each frame of stereo images. We collected
where N is the number of training data. the conducting gesture data for each time pattern and tempo
Thus, we have twelve profiles of motion histogram. Table which consists of 150 cycles respectively. We divided them
I denotes the profile index for the time patterns and tempos. into 100 cycles for training the recognition system and 50
For example, H, represents the mean of 2/4 with moderato cycles for testing the recognition system. Tables II and III
tempo and HI1 represents the variance of 6/8 with allegro summarize the recognition results in the form of confusion
tempo. We selected the "Mahalanobis distance" as a metric matrix, where the left and right table are obtained from the
of motion histogram matching. By Eq.(4), the similarity CFP sequence analysis and the motion histogram matching,
scores for all profile are evaluated. The proposed musical respectively. As a result, the recognition rate of using the
time pattern and tempo recognition system identify the time CFP sequence analysis is 78.8% and that of using the motion
pattern and tempo by taking the profile whose similarity histogram matching is 86.5%.
score is the minimum.
V. CONCLUSION AND FUTURE WORK
We proposed a vision-based hand gesture recognition for
MDk ='(HC-H)THk 1 (HC Hk) understanding musical time pattern and tempo that was

166
10' 10 F

a) a)
0 8 0
8

0 0

0
0 6- 0
0
6-

fl4

0 The direction code The direction-code


0
0 (a) 2/4 (b) 3/4
E
14 14

u,12 0
u,12

l,
10

) 6 ) 6

4- = 4-

i- 2 i- 2

The direction code The direction-code


(c) 4/4 (d) 6/8

Fig. 5. The average motion histogram of direction codes for each time pattern with the moderato tempo.

3tc 3o

tl zz:33

tCo3

f6itc Ci

(a) 2/4 (b) 3/4

\
..

(c) 4/4 (d) 6/8

Fig. 6. The real trajectories and the approximated directions of each conducting pattern. (solid line - real trajectory, dashed line - motion direction, red
circle -conducting feature point).

167
generated by a human conductor. We only used the stereo REFERENCES
vision camera with no extra devices such as motion capture [1] Ying Wu and Thomas S Huang, "Vision-based gesture recognition:
system, data gloves, and etc. Instead of using the color pixel A review," LNCS:Gesture-Based Communication in Human-Computer
image, we used the depth information of the current stereo Interaction: International Gesture Workshop, vol. 1739, pp. 103, 2004.
[2] A. Mulder, "Hand gestures for hci," Technical Report 96-1, vol. Simon
image. Face detection helped us to find the hand candidate Fraster University, 1996.
regions because we assumed that the hand motion of the [3] F. Quek, "Toward a vision-based hand gesture interface," in
conductor usually is made in front of the human face. With Proceedings of Virtual Reality Software and Technology, Singapore,
1994, pp. 17-31.
this assumption, we could extract the hand region faster and [4] Ingo Grull, "Conga: A conducting gesture analysis framework," ULM
easier. University, 2005.
[5] Hideyuki Morita, Shuji Hashimoto, and Sadamu Ohteru, "A computer
We introduced the conducting feature points with the music system that follows a human conductor," IEEE Computer
direction-codes which simply indicated the musical time Society, vol. 24, pp. 44-53, 1991.
pattern and tempo that was generated by hand motions. [6] Todd Winkler, "Making motion musical: Gesture mapping strategies
for interactive computer music," in Proceedings of the 1995 Interna-
When the human conductor made a conducting gesture, tional computer Music Conference, Canada, 1995, pp. 27-31.
our proposed system tracked the COG of the hand region [7] Paul Modeler, Trends in Gestural Control of Music, pp. 301-313,
and encoded the motion information into the direction-code. IrCam, 2000.
[8] Christopher Dobrian and Frederic Bevilacqua, "Gestural control of
We also suggested the motion histogram matching which music using the vicon 8 motion capture system," 2002.
could identify the current musical time pattern and tempo [9] Zeungnam Bien and Jong-Sung Kim, "On-line analysis of music
simultaneously by finding the best matched distribution of conductor's two-dimensional motion," San Diego, CA, USA, 1992,
pp.1047-1053.
direction-code in a cycle. From the numerous experiments, [10] Takahiro Watanabe and Masahiko Yachida, "Real-time gesture recog-
the recognition accuracies were 78.8% and 86.5% using the nition using eigenspace from multi-input image sequences," IEEE
CFP sequence analysis and the motion histogram matching, Computer and System in Japan, vol. 30, no. 13, pp. 810-821, 1999.
[11] R. Bloem, H. N. Gabow, and F. Somenzi, "An algorithm for strongly
respectively. connected component analysis in n log n symbolic steps," pp. 37-54,
Currently, the proposed conducting gesture recognition Nov 2000, LNCS 1954.
method has two limitations. First, it cannot recognize the
decoration conducting actions such as cresendo(gradually
stronger), decresendo(gradually weak), staccato(seperate
sound), and various dynamics signs(mf f mp, p). We will
extend the gesture recognition algorithm by using a power-
ful gesture recognition algorithms such as HMM (Hidden
Markov Model), and DBN (Dynamic Bayesian Networks
which are the promising state space models that can deal
with the complicated motions. Second, it can only treat the
the frontal view of conductor, which means the conductor
always needs to confront the vision sensor while some
members of robot orchestra might sit on not the exact front
side but another angle side. We can deal with an unrestrict
hand motion in the 3D space considering the structural
connectivity of human body. If we have the hand motion
information at any positions, then the conductor may not
have to be face with camera.
It is also a challenging problem to implement the robot
orchestra, which has an ability to perform a virtual instru-
ment, and play music by following the conducting actions
of the human conductor.

ACKNOWLEDGMENT

This research was performed for the Intelligent Robotics


Development Program, one of the 21st Century Frontier
R&D Programs partially funded by the Ministry of Science
and Technology of Korea. Also it was partially supported by
the Ministry of Education and Human Resources Develop-
ment(MOE), the Ministry of Commerce, Industry and En-
ergy(MOCIE) and the Ministry of Labor(MOLAB) through
the fostering project of the Lab of Excellency.

168

You might also like