Professional Documents
Culture Documents
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
augmentation method also incorporates existing spatial aug 2.3. Classifier
mentation techniques [8]. This work bears similarities to the
Our convolutional neural network classifier consisted
multi-sensor approach of Mo1chanov et al. [12], but differs
of two sub-networks (Fig. 1): a high-resolution network
in the the use of two separate sub-networks and data aug
(HRN) and low-resolution network (LRN), with network
mentation.
parameters WH and WL, respectively. Each network,
We demonstrate that our system, with two sub-networks,
with parameters W, produced class-membership probabil
that employs spatio-temporal data augmentation for train
ities (P(Glx,W) for classes C given the gesture's obser
ing, outperforms both a single CNN and the baseline
vation x. We multipled the class-membership probabilities
feature-based algorithm [14] on the VIVA challenge's
from the two networks element-wise to compute the final
dataset.
class-membership probabilities for the gesture classifier:
2. METHOD
(1)
We use a convolutional neural network classifier for dy
namic hand gesture recognition. Sec. 2.1, briefly describes We predicted the class label c* = argmaxP(Clx). The
the VIVA challenge's hand gesture dataset used in this pa networks contained more than 1.2 million trainable param
per. Sec. 2.2 to 2.4 describe the preprocessing steps needed eters.
for our model, the details of the classifier and the train The high-resolution network consisted of four 3D con
ing pipeline for the two sub-networks (Fig. 1). Finally, we volution layers, each of which was followed by the max
introduce a spatio-temporal data augmentation method in pooling operator. Fig. 1 shows the sizes of the convolu
Sec. 2.5, and show how it is combined with spatial transfor tion kernels, volumes at each layer, and the pooling op
mations. erators. We input the output of the fourth 3D convolu
tional layer to two fully-connected layers (FCLs) with 512
2.1. Dataset and 256 neurons, respectively. The output of this high
The VIVA challenge was organized to evaluate and ad resolution network was a softmax layer, which produced
vance the state-of-the-art in multi-modal dynamic hand ges class-membership probabilities (P(Glx,WH)) for the 19
ture recognition under challenging conditions (with vari gesture classes.
able lighting and multiple subjects). The VIVA challenge's We input a spatially down sampled (via NNI) gesture
dataset contains 885 intensity and depth video sequences of volume of 28 x 62 x 32 interleaved depth and image gra
19 different dynamic hand gestures performed by 8 subjects dient values to the low-resolution network. Similar to the
inside a vehicle [14]. Both channels were recorded with the HRN, LRN also comprised of a number of 3D convolu
Microsoft Kinect device and have a resolution of 115 x 250 tional layers, each followed by a max-pooling layer, two
pixels. The dataset was collected under varying illumina FCLs, and an output softmax layer that estimated the class
tion conditions. The gestures were performed either with membership probability P(Glx,WL) values (Fig. 1).
the right hand by subjects in the driver's seat or with the All the layers in the networks, except for the softmax
left hand by subjects in the front passenger's seat. The hand layers, had rectified linear unit (ReLU) activation functions:
gestures involve hand and/or finger motion.
J(z) = max(O,z). (2)
2.2. Preprocessing
We computed the output of the softmax layers as:
Each hand gesture sequence in the VIVA dataset has a
different duration. To normalize the temporal lengths of the
exp(zc)
gestures, we first re-sampled each gesture sequence to 32 P(Glx,W) = , (3)
Lq exp(zq)
frames using nearest neighbor interpolation (NNI) by drop
ping or repeating frames [12]. We also spatially down sam
where Zq was the output of the neuron q.
pled the original intensity and depth images by a factor of 2
to 57 x 125 pixels. We computed gradients from the inten 2.4. Training
sity channel using the Sobel operator of size 3 x 3 pixels.
Gradients helped to improve robustness to the different illu The process of training a CNN involves the optimization
mination conditions present in the dataset. We normalized of the network's parameters W to minimize a cost function
each channel of a particular gesture's video sequence to be for the dataset V. We selected negative log-likelihood as
of zero mean and unit variance. This helped our gesture the cost function:
classifier converge faster. The final inputs to the gesture IDI
classifier were 57 x 125 x 32 sized columns containing in 1
£(W, V) = -1Df L10g (P(C(i)lx(i),W)). (4)
terleaved image gradient and depth frames. 2=0
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
FCL 1 FCL 2
.g �
" a
SM 1
go �
32@6xl1x4 •
64@2x3x2
8 E
c�
m :i
2@57x125x32 19
classes
kernels
32@5x5x3
max-pooling max-pooling max-pooling
2x2x2 2x2x2 1x4xl
S12 2S6 19
Low-resolution network neurons classes
Figure 1: Overview of our CNN classifier We used a CNN-based classifier for hand gesture recognition. The inputs to
the classifier were 57 x 125 x 32 sized volumes of image gradient and depth values. The classifier consisted of two sub
networks: a high-resolution network (HRN) and a low-resolution network (LRN). The outputs of the sub-networks were
class-membership probabilities P( Clx,WH) and P( Clx,WL), respectively. The two networks were fused by multiplying
their respective class-membership probabilities element-wise.
We performed optimization via stochastic gradient de for decision making. We applied weight decay to all con
scent with mini-batches of 40 and 20 training samples for volution layers. After processing each mini-batch we sub
the LRN and the HRN, respectively. We updated the net tracted 0.5% from the network weights. We observed that
work's parameters, W E W with the Nesterov accelerated regularization with weight decay usually led to better gen
gradient (NAG) [22] at every iteration i as: eralization for gestures from different subjects. We also ap
plied drop-out (with p = 0.5) to the outputs of the fully
'VWi = /
\J( wi- d batch
t5£) , (5a) connected hidden layers [6]. During drop-out, the outputs
were randomly (with p = 0.5) set to 0, and were conse
Vi = fJVi-l - A'VWi, (5b) quently not used in the back-propagation step of that train
Wi = Wi-l + fJVi - A'VWi, (5c) ing iteration. For the forward propagation stage, the weights
of the layer following the dropped layer were multiplied by
where A was the learning rate, fJ was the momentum coef
2 to compensate for the effect of drop-out.
ficient, 'VWi was the value of gradient of the cost function
For tuning the learning rate, we first initialized the rate to
with respect to the parameter Wi averaged across the mini
0.005 and reduced it by a factor of 2 if the cost function did
batch. We set the momentum to 0.9. We observed that NAG
not improve by more than 10% in the preceding 40 epochs.
converged faster than gradient descent with only momen
We terminated network training after the learning rate had
tum.
decayed at least 4 times or if the number of epochs had ex
We initialized the weights of the 3D convolution layers
ceeded 300. Since the dataset is small, we did not reserve
with random samples from a uniform distribution between
data from any subjects to construct a validation set. Instead,
[-Wb, Wb], where Wb = J6/(ni + no), and ni and no
we selected the network configuration that resulted in the
were the number of input and output neurons, respectively.
smallest error on the training set.
We initialized the weights of the fully-connected hidden
layers and the softmax layer with random samples from a
2.5. Spatio-temporal Data augmentation
normal distribution N(O, 0.01). The biases for all layers,
except for the softmax layer, were initialized with a value The VIVA challenge dataset contains less than 750 ges
of 1 in order to have a non-zero partial derivative. For the tures for training, which are not enough to prevent overfit
softmax layers biases were set to O. ting. To avoid overfitting we performed offline and online
We trained the LRN and the HRN separately and merged spatio-temporal data augmentation. Note that we did not
them only during the forward propagation stage employed augment the test data.
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
original: swipe to the left - right hand Gradient of the image
;t 10
I
r .
// //
/ I -2
r
/ r
/
mirroring: swipe to the right - left hand -4
Fixed pattern drop-out Random drop-out -6
-8
-10
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
30
0.9
25 0.8
1l
E
� 20
"
E
jg 15
=>
"-
10 15 20 25 30
Input frame number
o L---�----�--�
Figure 4: Temporal Elastic Deformation TED temporally o 50 100 150 200 250 300
epoch
warps video sequences while maintaining the order of its
frames and the size of its volume. Its key characteristic is Figure 6: LRN Training The curves show average values
the principal point. After the deformation, the frames are for leave-one-subject-out cross-validation. Observe that the
re-mapped according to the temporal elastic deformation network does not overfit even after a very large number of
curve. epochs.
standard deviation b, and M is half of the gesture's tem Method HOG [14] LRN HRN LRN+HRN
poral length. The elastic deformation function X(g) is ap Mean 64. 5% 74.4% 70.0% 77.5%
proximated as a polynomial of order 2 that fits the posi 16.9% 7.9%
Std 8.9% 7. 8%
tions of three points: the first frame [1,1], the principal
point g = [n,m] and the last frame [T,T]. Finally, the Table 1: The classification results for leave-one-subject
frames within the gesture's volume are interpolated using out cross-validation Both the LRN and the HRN outper
the function X(g) via nearest-neighbor interpolation. The formed the HOG-based approach [14]. Our final classifier,
parameters of spatial and temporal data augmentation were which combined the LRN and the HRN resulted in the best
randomly sampled from the corresponding distributions as performance.
described above. After each epoch, all the spatial and tem
poral transformations were re-applied to the gesture in the
formed the baseline method by 13.0%. Moreover, 52% of
training dataset.
the final classifier's errors were associated with the second
most probable class. The results indicate that our CNN
3. RESULTS
based classifier for in-car dynamic hand gesture recognition
We evaluated the performance of our dynamic hand ges considerably outperforms approaches that employ hand
ture recognition system using leave-one-subject-out cross crafted features.
validation on the VIVA challenge's dataset [14]. We used CNNs have been shown to be effective at combining data
data from one of the 8 subjects for testing and trained the from different sensors [12]. We compared the performance
classifier with data from the 7 remaining subjects; we re of early and late fusion of sensors in the CNN architecture.
peated this process for each of the 8 subjects and averaged In Table 2, we present the correct classification rates for
the accuracy. Fig. 6 shows the performance of the LRN the LRN trained with different input modalities. Observe
during training. We applied various forms of regularization that, individually, the depth data (accuracy 65%) per =
to the network in order to prevent overfitting even after a formed better than the intensity data (accuracy 57%). On =
large number of training epochs. Data augmentation and element-wise late multiplying the class-membership proba
drop-out were key components to successful generalization bilities of the two LRNs trained individually with the two
of the classifier. modalities, the accuracy improved to 70. 4%. However, the
The correct classification rates for our gesture recogni highest accuracy of 74.4 % was obtained when the LRN was
tion system are listed in Table 1. We compared our classifier trained with interleaved depth and image gradient frames as
to the baseline method proposed by Ohn-Bar and Trivedi input.
[14], which employs HOG+HOG2 features. Both the low In Table 3, we present the correct classification rates for
and high resolution convolutional neural networks that we the low resolution network with different forms of data aug
proposed, outperformed Ohn-Bar and Trivedi's method by mentation. We observed that the training error increased
9. 8% and 5. 5%, respectively. Furthermore, the final clas on enabling data augmentation. However, the test error
sifier that combined the outputs of LRN and HRN outper- decreased. This demonstrates that the proposed data aug-
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
frame #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 original
1 2 4 5 7 8 9 10 12 13 14 15 16 17 19 20 21 22 22 23 24 25 26 27 27 28 29 30 30 31 31 32 temporal elastic deformation
1 1 1 2 4 5 7 9 10 12 13 14 16 17 19 20 21 22 23 24 25 26 27 28 29 30 30 31 32 32 32 32 scaling
1 1 1 1 1 1 1 2 4 5 7 9 10 12 13 14 16 17 19 20 21 22 23 24 25 26 27 28 29 30 30 31 translation
Figure 5: Examples of temporal data augmentation. The numbers in each row correspond to the frame IDs in the original
sequence (top row). Note how the order and the occurrence of frames is changed by incrementally applying temporal elastic
deformation (TED) (2nd row), TED + scaling (3rd row) and TED + scaling + translation (4th row) to the original input.
LRND LRNRGBD
:::
u :::
57.0% 65.0% 70.4% 74.4% Cl ::J "0 U U
0:: ...l Cl ::J > >< + 0:: ...l " 2 2
" " " " M .<: <0 11
0-
<U
0- 0- 0-
<U
0-
<U
0- 0-
"2<.) "2<.) § "2<.) 6. 6. " <.) 0-
"
S
0
S
0
ii
0- 0
Table 2: The accuracy of LRN with different in �
� .� .� .� .� .� .� .� VJ VJ � � c: Lll 0:: 0:: 0 [)
<0 VJ VJ VJ VJ VJ VJ VJ VJ VJ
a - N c<i -i vi -0 r-' 00 0;
[) - N c<i -i vi -0 r-' 00 0; - - - -
puts Leave-one-subject-out cross-validation results of LRN
I. 77 12 2 2 7
trained with different inputs. The best results were obtained
2. 74 2 7 7 2 4 2 2
when we interleaved the intensity and depth frames and in 3. 66 9 9 2 4 4 2 4
The average correct classification rates for LRN with differ 14. 19 2 75 2 2
IS. 4 4 74 6 10 2
ent forms of data augmentation. When we employed both 16. 11 7 2 2 2 4 65 7
forms of data augmentation the accuracy for the test set in 17. 4 10 2 2 2 8 70 2
18. 2 88 10
creased considerably.
19. 2 4 10 84
mentation method successfully reduced over-fitting and im Table 4: The confusion matrix for our proposed gesture
proved generalization of the gesture classifier. Additionally, classifier. Abbreviations: L - left, R - right, D - down, U -
we observed that image gradients increased the final correct up, CW/CCW - clock/counter-clock wise.
classification rate by 1.1%, and spatial and temporal elas
tic deformations applied to the training data increased it by
1. 2% and 1.72%, respectively. 0.8
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.
we get 400 FPS for LRN, 160 FPS for HRN and 110 FPS [9] J. 1. LaViola 1r. An introduction to 3D gestural interfaces. In
for their combination. The decision making time for our SIGGRAPH Course, 2014. 1
classifier is nearly half of that of the baseline HOG+HOG2 [10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient
features-based method (50 FPS) [14]. based learning applied to document recognition. In Proceed
ings of the IEEE, pages 2278-2324,1998. 1
4. CONCLUSIONS [11] S. Mitra and T. Acharya. Gesture recognition: A survey.
IEEE Systems, Man, and Cybernetics, 37:311-324,2007. 1
We developed an effective method for dynamic hand ges [12] P. Molchanov, S. Gupta, K. Kim, and K. Pulli. Multi-sensor
ture recognition with 3D convolutional neural networks. System for Driver's Hand-gesture Recognition. In AFGR,
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:46 UTC from IEEE Xplore. Restrictions apply.