You are on page 1of 10

Time Invariant Gesture Recognition

by Modelling Body Posture Space

Binu M. Nair and Vijayan K. Asari
Computer Vision and Wide Area Surveillance Laboratory,
Electrical and Computer Engineering,
300 College Park Avenue, Kettering Labs KL-302A,
University of Dayton, Dayton, OH - 45469, USA

Abstract. We propose a framework for recognizing actions or gestures

by modelling variations of the corresponding shape postures with respect
to each action class thereby removing the need for normalization for the
speed of motion. The three main aspects are the shape descriptor suitable
for describing its posture, the formation of a suitable posture space, and
a regression mechanism to model the posture variations with respect to
each action class. Histogram of gradients(HOG) is used as the shape
descriptor with the variations being mapped to a reduced Eigenspace
by PCA. The mapping of each action class from the HOG space to the
reduced Eigen space is done using GRNN. Classification is performed
by comparing the points on the Eigen space to those determined by
each of the action model using Mahalanobis distance. The framework
is evaluated on Weizmann action dataset and Cambridge Hand Gesture
dataset providing significant and positive results.
Keywords: Histogram of gradients(HOG), Generalized Regression
Neural Nets(GRNN), Human Action Modelling, Principal Component
Analysis(PCA), K-Means Clustering.


Human gesture recognition has been a widely researched area over the last few
years due to potential applications in the eld of security and surveillance. Early
research on gesture recognition used the concept of space time shapes, which are
concatenated silhouettes over a set of frames, to extract certain features corresponding to the variation within the spatio-temporal space. Gorelick et al. [7]
modelled the variation within the space time shape using Poissons equation and
extracted space time structures which provides discriminatory features. Wang et
al. recognized human activities using the derived form of the Radon transform
known as the R-Transform [17,16]. A combination of a 3D distance transform
along with the R-Transform is used to represent a space time shape at multiple
levels and used as corresponding action features [11].
H. Jiang et al. (Eds.): IEA/AIE 2012, LNAI 7345, pp. 124133, 2012.
c Springer-Verlag Berlin Heidelberg 2012

Time Invariant Gesture Recognition


Action sequences can also be represented as a collection of spatio-temporal

words with each word corresponding to a certain set of space-time interest points
which are detected by set of 2D spatial gaussian lter and 1D gabor temporal lters [12]. Here, Niebles computes the probability distributions of the
spatio-temporal words corresponding to each class of human action using a probabilistic Latent Semantic Analysis model. Another algorithm which is similar is
given by Batra where a dictionary of mid-level features called space time
shapelets is created which characterize the local motion patterns within a space
time shape thereby representing an action sequence as a histogram of these space
time shapelets over the trained dictionary [2]. However, these methods are susceptible to illumination variation or require good foreground segmentation of the
silhouettes. Another approach is to model the non-linear dynamics of the human
action by tracking the trajectories of certain points in the body and capture features from those trajectories. Ali et al. used the concepts from Chaos Theory
to reconstruct the phase space from each of the trajectories and compute the
dynamic and metric invariants which are then used as action feature vectors [1].
This method will be aected by partial occlusions as some trajectories maybe
missing which may aect the metrics extracted. Scovannar et al. used a 3D-SIFT
to represent spatio-temporal words in a bag of words model representation of
action videos [13]. Sun et al extended the above methodology which combined
local descriptors based on SIFT features and hositic moment-based features [15].
The local features comprised of the 2D SIFT and 3D SIFT features computed
from suitable interest points and the holistic features are the Zernike moments
computed from motion energy images and motion history images. The approach
taken here assumes that the scene is static as it relies on the frame dierencing
to get suitable interest points.
A dierent approach for characterizing human action sequences is to consider
these sequences as multi-dimensional arrays called tensors. Kim et al. presented
a new framework called Tensor Cannonical Correlation Analysis where descriptive similarity features between two video volumes are used in nearest neighbour
classication scheme for recognition [8]. Lui however, studied the underlying geometry of the tensor space occupied by human action sequences and
performed factorization on this space to obtain product manifolds [10]. Classication is done by projecting a video or a tensor onto this space and classifying
it using a geodesic distance measure. In this type of methodology, unlike in the
space time approach, it shows much improved performance on datasets with
large variations in illumination and scale. However, the classication is done per
video sequence and not one a set of frames constituting a part of a video sequence. A 3D gradient-based shape descriptor representing the local variations
was introduced by Klaser et al. [9] and is based on the 2D HOG descriptor used
for human body detection [5,6]. Here, each space time shape is divided into cubes
where in each cube, the histogram is computed from the spatial and temporal
gradients. Chin et al performed an analysis on modelling the variation of the
human silhouettes with respect to time [4]. They studied the use of dierent dimensionality reduction techniques such as PCA and LLE and the use of neural


B.M. Nair and V.K. Asari

networks to model the mapping. The proposed technique in this paper uses the
histogram of spatial gradients in a region of interest, nds an underlying function
which captures the temporal variance of these 2D shape descriptors with respect
to each action and classies a set of contiguous frames irrespective of the speed
of the action or the time instant of the body posture.

Proposed Methodology

In this paper, we focus on three main aspects of the action recognition framework.
The rst is that of feature extraction where a shape descriptor is computed for
the region of interest in each frame. The second is that of a computation of an
appropriate reduced space which spans the shape change variations across time.
The third aspect is that of suitable modelling of the mapping from the shape
descriptor space to the reduced space. A suitable block diagram illustrating the
framework is shown Figure 1.

Fig. 1. Gesture Recognition Framework

Histogram of gradients is used as the shape descriptor as it provides a more

local representation of the shape and it is partially invariant to illumination.
To obtain a reduced dimensional space where the inter-frame variation among
the HOG descriptors are large, we use the Principal Component Analysis. The
inter-frame variation between the HOG descriptors is due to the change in shape
of a body or hand with respect to a particular action being performed. So, we
propose a modelling of these posture or shape variations with respect to each
action, the underlying idea being the posture variations dier with action class.
A modelling of actions in this manner removes the need for normalization with
respect to time and that a slow or fast moving action of the same class will
not cause any dierence. Only the postures of each frame are correspondingly
mapped onto a reduced space containing variations in time, thereby making
the framework time-invariant. In other words, while classifying a slow action,
the posture variations will occupy a small part of the manifold when compared
to a fast moving action where the posture variations occupy a large section of
the action manifold. Moreover, due to varying speed of the action in dierent
individuals, some of the postures in the action sequence may not be present
during the training phase. So, when these particular postures occur in a test
sequence of an action, the action model can estimate where that posture lies
on the reduced space. This approach gives a more accurate estimation of the

Time Invariant Gesture Recognition

(a) Silhouette (b) Gradient


(c) HOG Descriptor

Fig. 2. HOG descriptor extracted from a binary human silhouette from the Weizmann
Database [7]

(a) Hand

(b) Gradient

(c) HOG Descriptor

Fig. 3. HOG descriptor extracted from a gray scale hand image from Cambridge Hand
Gesture Database [8]

corresponding location of that particular shape on the action manifold than the
approach which uses nearest neighbours to determine the corresponding reduced
posture point. In this paper, we use a separate model for each action class and
the modelling is done using generalized regression neural networks which is a
multiple-input multiple-output network.

Shape Representation Using HOG

The histogram of gradients is computed by rst taking the gradient of the image
in the x and y directions and calculating the gradient magnitude and orientation
at each pixel. The image is then divided into overlapping K blocks and the orientation range is divided into n bins. From each block, the gradient magnitudes
of those pixels corresponding to the same range of orientation (belonging to the
same bin) are added up to form a histogram. The histograms from the various blocks are normalized and concatenated to form the HOG shape descriptor.
An illustration of the HOG descriptor extracted from masked human silhouette
image are shown in Figure 2. It can been seen that since the binary silhouette
produces a gradient where all of its points correspond to the silhouette, the HOG
descriptor produces a discriminative shape representation. Moreover, due to the
block operation during the computation of the HOG, this descriptor provides
a more local representation of the particular posture or shape. An illustration
of the HOG descriptor(rst 50 elements) applied on a gray scale hand image is


B.M. Nair and V.K. Asari

shown in Figure 3. Unlike the binary image, there is some noise in the gradient
image which gets reected onto the HOG descriptor. Since the HOG descriptors are illumination invariant, we can assume that under varying illumination
conditions, the feature descriptors do not vary much.

Computation of Reduced Posture Space Using PCA

The next step in the framework is determine an appropriate space which represents the inter-frame variation of the HOG descriptors. An illustration of the
reduced posture or shape space using PCA is shown for the Weizmann dataset
and the Cambridge Hand dataset using three Eigenvectors in Figure 4. Each
action class of the reduced posture points shown in Figure 4 are color-coded to
illustrate how close the action manifolds are and the separability existing between them. We can see that there are lot of overlaps between dierent action
manifolds in the reduced space and our aim is to use a functional mapping for
each manifold to distinguish between them. We rst collect the HOG descriptors
from all the possible postures of the body irrespective of the action class and
form a space denoting what is known as an action space denoted by SD . We can
express the action space mathematically as
SD = {hk,m : 1 k K(m) and 1 m M }


where K(m) is the number of frames taken over all the training video sequences
from the action m out of M action classes and hk,m being the corresponding
HOG descriptor of dimension D 1. The reduced action or posture space is
obtained by extracting the principal components of the matrix HHT where H =
[h1,1 h1,1 h1,1 ... hK(M),M ] using PCA. This is done by nding the Eigenvectors
or Eigenpostures v 1 v 2 ...v d corresponding to the largest variances between the
HOG descriptors. In this reduced space, the inter-frame variation between the
extracted HOG descriptors due to the changes of shape of the body (due to the
motion or the action) are maximized by selecting the appropriate number of
Eigenpostures and at the same, reducing the eect of noise due to illumination

(a) Weizmann Dataset

(b) Cambridge Hand Dataset

Fig. 4. Reduced Posture Space for the HOG descriptors extracted from video sequences

Time Invariant Gesture Recognition


by removing those Eigenpostures having low Eigenvalues. In other words, the

Eigenvectors with the highest Eigenvalues corresponds to the direction along
which the variance between the HOG descriptors due to the posture or shape
change is maximum and all the other Eigenvectors with lower Eigenvalues can
be considered as directions which corresponds to the noise in the HOG shape

Modelling of Mapping from HOG Space to Posture Space Using


The mapping from the HOG descriptor space (D 1) to the reduced posture or
shape space (d 1) can be represented as SD  Sd where Sd = {pk,m : m =
1 to M } and p is a vector representing a point in the reduced posture space.
In this framework, we aim to model the mapping from the HOG to the posture space for each action m separately using the Generalized Regression Neural
Network [14,3]. This network is a one-pass learning algorithm which provides
fast convergence to the optimal regression surface. It is memory intensive as it
requires the storage of the training input and output vectors where each node
in the rst layer is associatedwith one training point. The network models the
yi radbasis(xxi )
= i=1
where (y i , xi ) are the trainequation of the form y
i=1 radbasis(xxi )
is the estimated point for the test input x. In our
ing input/output pairs, y
algorithm, since a lot of training points are present, a lot of nodes have to be
implemented for each class which is not memory ecient. To get suitable training points that marks the transitions in the posture space for a particular action
class m, k-means clustering is done to get L(m) clusters. So, the mapping of the
HOG descriptor space to its reduced space for a particular action class m can
be modelled by a general regression equation given as



i,m exp(



2 2

2 2

i,m )T (h h
i,m )
; Di,m = (h h


i,m ) are the ith cluster centres in the HOG descriptor space and
where (
pi,m , h
the posture space. Selection of the standard deviation for each action class
is taken as the median Euclidean distance between the corresponding actions
cluster centres. The action class is determined by rst projecting the consecutive
set of R frames onto to the Eigenpostures. These projections of the frames given
(m)r of the
by pr : 1 r R is compared with the estimated projections p
corresponding frames estimated by each of the GRNN action model using the
Mahalanobis distance. The action model which gives the closest estimates of
the projections is selected as the action class.


B.M. Nair and V.K. Asari

(a) Weizmann Dataset [7]

(b) Cambridge Hand Dataset [8]

Fig. 5. Datasets for Testing

Experiments and Results

The algorithm presented in this paper has been evaluated on two datasets,the
Weizmann Human Action [7] and the Cambridge Hand Gestures [8]. The histogram of gradients feature descriptor has been extracted by dividing the detection region into 7 7 overlapping cells. From each cell, a histogram of gradient is
computed with 9 orientation bins which are normalized by taking the L2norm,
and the normalized histograms are concatenated to form the feature vector of
size 441 1.

Weizmann Action Dataset

This action dataset consists of 10 actions classes, each of them contain 9 10

samples performed by dierent people. The background in these video sequences
are static with uniform lighting at low resolution and so, silhouettes of the person
can be extracted by a simple background segmentation. HOG features, computed
for these silhouettes represent the shape or the postures of the silhouette at one
particular instant. During the training phase, all the frames of every training
sequence of each class are taken together to get the HOG feature set for each
action class. The test sequence is split up into overlapping windows (partial
sequences) of size N with an overlap of N 1. The HOG features of each frame
of the window is compared with the estimated features from each action class
model corresponding to this particular frame using Mahalanobis distance, and
the appropriate distance from each class is computed by taking the L2 norm
of the distances for each frame. The action model which gives the minimum
nal distance measure to the testing partial sequence is determined to be its
action class. Table 1 gives the results for the framework with GRNN having 10
clusters with a window size of 20 frames. The testing is done by using leave-10
out procedure where 10 sequences, each one corresponding to a particular action
class are considered as the testing set while the remaining sequences are taken
as the training set. The variation of the overall accuracy for dierent window
sizes of 10, 12, 15, 18, 20, 23, 25, 28, 30 of the test partial sequences are shown in
Figure 6.

Time Invariant Gesture Recognition


Table 1. Confusion Matrix for Weizmann Actions: a1 - bend; a2 - jplace ; a3 - jack ;

a4 - jforward ; a5 - run ; a6 - side ; a7 - wave1 ; a8 - skip ; a9 - wave2 ; a10 - walk
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
a1 99

Fig. 6. Average Accuracy computed for the action classes for window size
10, 12, 15, 18, 20, 23, 25, 28, 30


Cambridge Hand Gesture Dataset

The dataset contains 3 main actions classes showing dierent postures of the
hand, at, spread out and V-shape. Each of the main classes has three other subclasses which diers in the direction of movement. In total, we have 9 dierent
action classes which diers in the posture of the hand as well as its direction
of motion. The main challenge is to dierentiate between dierent motion and
shape at dierent illumination conditions. The dataset is shown in Figure 5(b).
There are 5 sets, each containing dierent illuminations of all the action classes
with class having 20 sequences. From each of the video sequences, we applied skin
segmentation to get a rough region of interest, and extracted the HOG based
shape descriptor from the gray scale detection region. Unlike the descriptors
extracted from silhouettes in the Weizmann dataset, these descriptors contain
noise variations due to dierent illumination conditions. The testing strategy
we used is the same as that of the Weizmann with leave-9 out video sequences


B.M. Nair and V.K. Asari

Table 2. Confusion Matrix and Overall Accuracy for Cambridge Hand Gesture Dataset
(a) Confusion Matrix

(b) Overall Accuracies for each Set

a1 a2 a3 a4 a5 a6 a7 a8 a9
a1 94.0
91.0 6
a3 2
1 95.0
91.0 1
5 85.0 10
1 99.0
83.0 3 14
86.0 14
1 13 9 77.0

Set 1 Set 2 Set 3 Set 4 Set 5

Acc 96.11 73.33 70.00 86.67 87.72

where each test sequence corresponds to an action class. The confusion matrix
for the action classes obtained from the framework with 4 clusters is given in
Table 2(a). We can see that if all the illumination conditions are trained into
the system, the overall accuracy obtained with the framework is high. Using
the same testing strategy, we tested the system for overall accuracy for each
set and this is given in Table 2(b). For set1, the overall accuracy is high as the
non-uniform lighting does not aect the feature vectors and noise is diminished
by the partial illumination variant property of the HOG descriptor. For sets 4
and 5, it shows moderate accuracies while sets 2 and 3 give an average overall

Conclusions and Future Work

In this paper, we presented a frame work for recognizing actions from partial
video sequences which is invariant to the speed of the action being performed. We
illustrated this approach using the Histogram of Gradients shape descriptor and
computed the mapping from the HOG space to the reduced dimensional posture
space using Principal Component Analysis. The mapping from the HOG space
to the reduced posture space for each action class is learned separately using
Generalized Regression neural network. Classication is done by projecting the
HOG descriptors of the partial sequence onto the posture space and comparing
the reduced dimensional representation with that of the estimated posture from
the GRNN action models using Mahalanobis distance. The results shows the
accuracy of the framework as illustrated on the Weizmann database. However,
when using the gray scale images to compute the HOG, severe illumination conditions can aect the framework as illustrated by the Hand Gesture database
results. In future, our plan is to extract a shape descriptor which represents
a shape from a set of corner points where relationships between them are determined in the spatial and temporal scale. Other regression and classication
schemes will also be investigated in this framework.

Time Invariant Gesture Recognition


1. Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition.
In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 18
(October 2007)
2. Batra, D., Chen, T., Sukthankar, R.: Space-time shapelets for action recognition.
In: IEEE Workshop on Motion and video Computing, WMVC 2008, pp. 16 (January 2008)
3. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science
and Statistics), 1st edn. Springer (2006), corr. 2nd printing edn. (October 2007)
4. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds for
human activity recognition. In: IEEE International Conference on Image Processing, ICIP 2007, vol. 1, pp. 381384 (October 2007)
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
CVPR 2005, vol. 1, pp. 886893 (June 2005)
6. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of
Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006,
Part II. LNCS, vol. 3952, pp. 428441. Springer, Heidelberg (2006)
7. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as spacetime shapes. Transactions on Pattern Analysis and Machine Intelligence 29(12),
22472253 (2007)
8. Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for action
classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2007, pp. 18 (June 2007)
9. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3dgradients. In: Proceedings of the British Machine Vision Conference (BMVC 2008),
pp. 9951004 (September 2008)
10. Lui, Y.M., Beveridge, J., Kirby, M.: Action classification on product manifolds.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010,
pp. 833839 (June 2010)
11. Nair, B., Asari, V.: Action recognition based on multi-level representation of 3d
shape. In: Proceedings of the International Conference on Computer Vision Theory
and Applications, pp. 378386 (March 2010)
12. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories
using spatial-temporal words. In: British Machine Vision Conference, BMVC 2006
13. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the International Conference on
Multimedia (MultiMedia 2007), pp. 357360 (September 2007)
14. Specht, D.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568576 (1991)
15. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and
holistic features. In: IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops, CVPR Workshops 2009, pp. 5865 (June 2009)
16. Tabbone, S., Wendling, L., Salmon, J.: A new shape descriptor defined on the radon
transform. In: Computer Vision and Understanding, vol. 102, pp. 4251 (2006)
17. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform.
In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007,
pp. 18 (June 2007)