You are on page 1of 2

Device Agnostic 3D Gesture Recognition using Hidden Markov Models

Anthony Whitehead, Kaitlyn Fox


School of Information Technology
Carleton University
1125 Colonel By Drive, Ottawa, Ontario, Canada
{awhitehe},{kfox2}@connect.carleton.ca

ABSTRACT precisely: Does the input gesture “look like” the training gestures?
Hidden Markov Models have been effectively used in pattern To answer, we compute the probability, P[S|M], of an observed
recognition systems in the past. In this work, we identify the ne- sequence S (an input gesture) being generated by the (trained)
cessary elements to successfully use an HMM system for 3D ges- model M. The log of the ratio of P[S|M] to the probability of ge-
ture recognition regardless of the sensor device being used. So nerating S by chance is usually used an assessment function.
long as the sensor system itself is capable of outputting informa-
tion about the 3 axes of motion (X, Y, and Z), that information 2. Decomposing the HMM
can be used in this generic model for accurate, high speed gesture In order to remain device agnostic, the goal is to ensure that a
recognition. The proposed system works with accelerometer data, mapping of sensor inputs to a common alphabet of inputs is de-
positional data and gyro data alike. manded. This may seem quite difficult at first glance given that
different sensors will be recording different types of information
that are not altogether related. However, given the discrete nature
Categories and Subject Descriptors of any sensor, it is relatively simple to have symbols output based
H.5.2 [II nfor mation I nter faces and Pr esentation]: User Interfaces-Input on the discrete segmentation of the sensors range. Suffice it to
devices and strategies; K.8.0 [P Per sonal Computing]: General-Games; say, that granularity of the decomposition depends primarily on
Patter n Recognition]: Implementation - Interactive systems
I.5.5 [P the precision of the sensors being used. However, it does not
invalidate the model; rather it simply expands the alphabet. The
General Terms only issue that remains is the computational requirements for an
Performance, Design, Experimentation , Human Factors, Theory. increased alphabet. It is also important to note that this decompo-
sition requires a 1:1 alphabet to state. i.e. each state outputs a
Keywords unique symbol. Since our gestures are in 3D space, and our sen-
Gestures, HMM, Games. sors are emitting 3-dimensional data, we carve our space into
discrete sub-cubes. In our tests, we have experimented with dis-
crete segmentations of 3,4 and 5 subspaces in each dimension
1. INTRODUCTION resulting in alphabet sizes of 27, 64 and 125. Our experiments
A hidden Markov model (HMM) is a statistical model, initially show that alphabet sizes of 27 allow real-time interactivity; the
developed for speech recognition [1]. Some current applications larger sizes are still too computationally expensive for effective
of HMMs in gesture recognition include, among others, work by real time applications.
Kratz[2], Kreskin[3], and Segen[4]. In a hidden Markov model, a
sequence is modeled as an output generated by a stochastic
process progressing through discrete time steps. At each time step,
the process outputs a symbol (from a predefined alphabet) and
moves from one state to the next state. Both actions, the transition
from state to state and the emission of an alphabet symbol, follow
probability distributions that define the model. Thus we can esti-
mate these probabilities using a training by example process.
In a hidden Markov model, only the sequence of emitted symbols
is observed. The path of states followed by the process is “hid-
den” from the observer. Given a hidden Markov model (trained by
examples), M, and an input sequence S (a gesture), the standard
question is whether S has the properties of the model M. Less
Figure 1. 7 Gestures

Permission to make digital or hard copies of all or part of this work for 2.1 Gestures in Training
personal or classroom use is granted without fee provided that copies are To record gesture data, a sensor was placed on a subject’s wrist.
not made or distributed for profit or commercial advantage and that We gathered and stored sensor data for the duration of a gesture
copies bear this notice and the full citation on the first page. To copy performance. This procedure was repeated multiple times with
otherwise, or republish, to post on servers or to redistribute to lists, several subjects to generate the training set. Gestures used in our
requires prior specific permission and/or a fee.
experiments are shown below.
FuturePlay @ GDC Canada, May 12-13 2009, Vancouver, Canada.
Copyright 2009

29
2.2 The Number of HMM States We used a 1.5 standard deviation rule to decide which sequences
A balance needs to be struck between false negative results and were to be removed.
false positive results in order for the system to be usable. We
experimented with 27, 64, and 125 states in the HMM. Overall,
2.5 Grouping Data to Lower False Positives
We selected our gestures as part of a game design, rather than
the best results were achieved using 27 states, with the exception
being arbitrarily chosen. This created a set of gestures that had
of the false positive rate which increases slightly at 27 states. This
significantly different lengths and allowed us to further reduce
slight increase is negligible however due to the significant in-
false positive rates by using a linear discriminator to first classify
crease in correct classification rate attained at 27 states. This ex-
the gestures into short gestures and long gestures. Any gesture
periment also illustrates that more training samples provided will
sequences with fewer than 27 elements were considered short and
result in better recognition rates.
any with more than 27 were considered long.
During our live testing, no long moves performed were ever
shorter than 28 elements and only one short move (out of 70 tri-
als) was longer than 27 elements, thus effectively removing the
false positives that occurred between small and large moves.

2.6 Left Hand vs. Right Hand Training


We conducted an experiment to determine whether gestures
Figure 2: Correct Classification Rates for 27, 64 and 125 state trained by the right hand could be detected if the sensor was
HMMs placed on the left hand for testing. As expected, for the majority
of gestures, the left hand tests were not recognized as valid ges-
The side benefit of using a smaller number of states is that the
tures when compared to the right hand training data. This is be-
recognition time decreases significantly. We are able to achieve
cause our left and right arms have a different “natural tilt” to them
800 gesture recognition tests per second on a 27 state HMM (2.6
which result in different sensor values when the mirrored gestures
GHz processor) while the 125 state HMM runs a single recogni-
are performed. Rates dropped by more than half in most cases.
tion task in the order of seconds.
However, some gestures that are unidirectional and involve no
2.3 Number of Samples in a Training Set roll or tilt, can be performed fairly easily by both hands. For op-
timal results in gesture recognition, it is recommended that each
Previous experiments have shown that several subjects’ data is
gesture for each limb should be trained individually.
required for easy replication by the general population. We con-
ducted several experiments to determine the optimal number of
samples required to generalize the gesture training data. 3. RESULTS AND CONCLUSION
We have achieved very good recognition rates (in the table below)
using the above stated decomposition of the HMM model which
is applicable in a sensor agnostic way.
F +ve T +ve F -ve T -ve Correct
In training set 11.2% 94.4% 5.6% 88.8% 91.6%
Not in training 11.4% 84.1% 15.9% 88.6% 86.4%
set
Table 1: Recognition results for tests for all 7 gestures when
Figure 3. Effect of Training Sample Size on Classification the person is part of the training set and not.
Rates for 27, 64 and 125 state HMMs
The decomposition technique allows for any 3D sensing device to
As the number of gesture samples increases, the curves of the use a discrete HMM as part of a gesture recognition system while
graphs begin to flatten. Based on these results, a set of around 250 the speed allows for multiple sensors to be used together.
samples in a training set is a sufficient and feasible amount of data
to collect while still providing good detection results. It should be
noted that the more states the HMM has, the more training sam-
4. REFERENCES
ples are needed to achieve higher classification rates. In our expe- [1] Rabiner, L. A tutorial on hidden Markov models and se-
riments, since we have a real time interaction constraint, we are lected applications in speech recognition, Readings in speech
effectively limited to 27 and 64 states as 125 does not compute recognition, Morgan Kaufmann Inc., San Fran., CA, 1990
results quickly enough. [2] Kratz, L., Smith, M., and Lee, F. J. Wiizards: 3D gesture
recognition for game play input. In Proceedings of the 2007
2.4 Culling Training Data Conference on Future Play Toronto, Canada, Nov '07.
We also found that during the data collection phase there were a
[3] C. Keskin, A. Erkan, and L. Akarun. Real time hand tracking
number of issues with the samples being inconsistent. For exam-
and 3d gesture recognition for interactive interfaces using
ple the training set was created by pressing a button on the device,
hmm. In Proceedings of the Joint International Conference
performing the gesture and letting go of the button. In this case,
ICANN/ICONIP 2003. Springer.
delays in button presses and releases inaccurately skew the statis-
tics. This is solved by culling the input data based on sequence [4] Segen, J. and Kumar, S. Fast and Accurate 3D Gesture Rec-
lengths. We throw away the longest and the shortest sequences. ognition Interface. In Proc of the 14th ICPR. IEEE Comput-
er Society, Washington, DC, August 1998.

30

You might also like