Professional Documents
Culture Documents
PII: S0031-3203(18)30030-X
DOI: 10.1016/j.patcog.2018.01.035
Reference: PR 6444
Please cite this article as: Aparna Mohanty, Rajiv R. Sahay, Rasabodha: Understanding Indian
classical dance by recognizing emotions using deep learning, Pattern Recognition (2018), doi:
10.1016/j.patcog.2018.01.035
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
T
IP
CR
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
aparnamhnty@gmail.com and rajivsahay@gmail.com
Computational Vision Laboratory, Department of Electrical Engineering, Indian Institute of
IP
Technology, Kharagpur, India
CR
Abstract
US
Understanding human behaviour using computer vision techniques for recogniz-
ing body posture, gait, hand gesture, and facial expressions has recently wit-
nessed significant research activity. Emotions/affect have a direct correlation
AN
with the mental state, as well as intention of a person, based on which his/her
present and future states can be understood and predicted. As a case study in
this work, we demonstrate the utility of deep learning in understanding videos
M
of Indian classical dance (ICD) forms. ICD comprises hand gestures, body poses
and facial expressions enacted by the performer along with the accompanying
ED
with associated depth information collected using the Microsoft Kinect sensor.
We propose a deep learning framework using convolutional neural networks to
understand the semantic meaning associated with videos of ICD by recognizing
AC
1. Introduction
T
5 interaction (HCI). It is interesting to observe that classical dance forms across
IP
the world also use non-verbal communication to engage the audience. Evidence
of dance exist from pre-historic times and is still found in Bhimbetka rock shel-
CR
ters paintings in India. Early Egyptian tomb paintings depict dancing figures
also. Dance has always been a part and parcel of the Indian heritage. This is
evident from the archaeological excavations of Harappa and Mohenjodaro civil-
US
10
isations wherein sculptures striking various dance poses were unearthed. Dance
poses can still be found inscribed on the walls of temples of India. The Chi-
AN
dambaram temple in a southern province of India has several dance poses of
Bharatnatyam sculpted on its walls.
15 Dance has been a prominent means of transferring knowledge from one gen-
M
eration to the other in earlier times when written scripts were not in use and
has always remained a means of emotionally connecting with the audience. Ini-
tially, dance was used to narrate stories to people with the associated hand
ED
associated with it. Prior work in understanding ICD has attempted to inter-
pret the meaning of dance through the hand gestures and poses enacted by the
performer [1].
CE
Bharatnatyam is a famous dance form in India which has been practised since
time immemorial and is a living example of our heritage. It is associated with
hand gestures, body postures, facial expressions, neck and eye movements, etc.
30 The emotions/affect associated with Bhratnatyam dance are called Navarasas.
3
ACCEPTED MANUSCRIPT
The word Navarasas is basically a combination of Nava i.e nine and Rasas or
emotions. Akin to sign languages wherein meaningful information is conveyed
with the usage of hand gestures as well as facial expressions [2], Navarasas in
ICD too are mainly enacted with the aid of facial expressions which may also be
T
35 augmented with appropriate upper body postures and hand gestures. Recently,
IP
affect recognition has gained popularity because of its applications in cognitive
science, psychology, and in human computer interaction (HCI) [3], [4], [5]. Our
CR
work differs from several works [3], [4] in the field of affect recognition which uses
a multi-modal approach for recognition of emotions. In fact, we do not use two
40 cameras to capture multiple modalities independently such as facial expressions
US
or body postures or hand gestures as in [6]. Also, we do not use the audio cue for
affect recognition in ICD. Rather the analysis of Navarasas in this work involves
affect recognition in ICD using both facial expressions as well as hand gestures
AN
coupled with upper body postures in some cases and only facial expressions in
45 other Rasas. In fact, we use a single Kinect sensor to capture both RGB and
depth data while performers enact Navarasas. The word ‘Rasabodha’ used in
M
the title of this paper refers to understanding the meaning associated with short
videos of of ICD by recognizing Navarasas enacted in them. Since there are no
ED
publicly available datasets for affect in ICD, in this work we propose a dataset of
50 Navarasas consisting of a subset (eight out of of the nine classes) of emotions,
namely, Adbhuta, Bhayanaka, Bibhatsa, Hasya, Roudra, Shaanta, Shringaar,
PT
works on facial expression recognition have mostly focused on the six basic
emotions anger, disgust, fear, happiness, sadness and surprise recorded under
controlled conditions. Generally, the approaches proposed in the literature are
60 not robust enough to work accurately in unconstrained scenarios. Facial ex-
pressions convey intricate details about the emotion associated with them. This
4
ACCEPTED MANUSCRIPT
T
IP
(a) (b) (c)
CR
Figure 1: Data captured in real-world unconstrained settings. (a) Poor illumination. (b) Low
resolution of face due to large distance between the camera and the performer. (c) Non-frontal
appearance of face of performer.
US
is more evident in ICD since performers of ICD rely heavily on the non-verbal
mode of communication using hand gestures, body poses and facial expressions.
AN
In real-world scenario ICD is performed in an unconstrained environment.
65 Recognising emotions enacted by the dancer in such unconstrained environments
is always a challenge due to:
M
• Make-up and costume of the dancer might vary drastically posing a serious
70 challenge for the classifier.
PT
5
ACCEPTED MANUSCRIPT
T
IP
(a) (b)
CR
Figure 2: Variation in the enactment of Hasya Navarasa
US
In this work we attempt to decipher the meaning associated with the emo-
tions enacted in ICD forms. The proposed dataset of emotions/affect (Navarasas)
AN
enacted by dancers of ICD comprises both color images as well as corresponding
depth information. We propose a deep learning based approach to classify the
85 Navarasas which include facial expressions from both RGB and depth datasets
M
without the need for extracting any hand-crafted features as was reported by
prior works. To prove the robustness of our approach we use the proposed
convolutional neural network architecture on two standard facial expressions
ED
datasets, namely, Taiwanese Facial Expression Image Database (TFEID) [7] [8]
90 and extended Cohn-Kanade (CK+) dataset [9].
The primary contributions of this work are:
PT
• To the best of our knowledge the proposed algorithm is the first to use deep
learning for recognizing Navarasas in order to semantically understand
6
ACCEPTED MANUSCRIPT
T
IP
(a) (b)
CR
Figure 3: Variation in the enactment of Bibhatsya Navarasa
2. Prior work
recognition in the wild un-controlled environment is still far from solved due
110 to the complexities associated with it. The analysis of facial expression can be
CE
ually crop the appropriate portion of the input image in order to classify the
115 particular Navarasa. Generally, automatic facial expression analysis systems
work on a small set of prototypic emotions such as disgust, fear, joy, surprise,
sadness, and anger. The origin of such works can be traced back to [10]. The
7
ACCEPTED MANUSCRIPT
relation between facial expression and emotion as well as the information con-
veyed by emotions is explored in [11]. The work of [12] investigates the issues
120 in design and implementation of a system that could perform automated facial
expression analysis. A detailed five step approach for representation, detection,
T
identification, expression analysis and classification based on physical features is
IP
attempted in [13]. An exhaustive survey and comparison of recent techniques for
facial expression recognition in automated facial action coding system (FACS)
CR
125 is presented by [14]. For analyzing the emotional state of an individual, a signal
processing approach is proposed in [15] by analysing EEG signals after feature
smoothing, unrelated noise removal followed by tracking the trajectory of emo-
130
tion changes with manifold learning.
US
An optical flow based cue was used in [16] to obtain local parameterized
model for image motion. The work of [16] handled both rigid (planar model)
AN
and non-rigid facial motion (affine-plus-curvature model). A hierarchical optical
flow method was used by [17] to automatically track the facial fiducial points.
The subtle changes in facial muscle action units (AUs) were analyzed for
M
temporal behavior by [18]. Facial features have also been used in [19] for de-
135 termining the age of an individual. A scheme for head-pose invariant facial
ED
work in [23] has attempted to evaluate the efficiency for an automatic emotion
recognition system using a fused representation of face expression and body mo-
tion. The importance of emotions in decision making for robots was emphasized
AC
in the work of [24]. The importance of various modalities such as body postures,
145 speech intonation, and motion is further explored in the work in [25]. The work
in [26] showed that using all multimodal information to obtain a single feature
vector resulted in improved recognition accuracy as opposed to using individ-
ual modalities. The importance of information in each modality for recognizing
8
ACCEPTED MANUSCRIPT
emotional states was further analyzed in [27]. The work in [27] discussed the
150 importance of complementing facial expressions and body posture/motion in
non-verbal communication for determination of emotional states. A detailed
survey of the field of affect recognition is given in [5] which emphasizes the im-
T
portance of using information about affective behavior along with commands in
IP
human-centered interfaces. Several prior works have focused on affect-sensitive
155 multimodal HCI systems [28], [29], [30], and [31]. The work in [32] uses facial
CR
expressions, speech and multimodal information for emotion recognition.
Deep convolutional neural networks have been used by researchers for the
recognition of expressions too as reported by [33]. However, the work in [33]
160 US
only focuses on the recognition of the facial expression but not on their semantic
interpretation. Moreover, the algorithm in [33] relies on a deep architecture
which is computationally complex. A fully automated system for real-time
AN
facial expression recognition for six categories of monochrome frontal views in
cluttered and dynamic scene is proposed by [34]. Facial expression recognition
in image sequences is investigated in [35] using geometric deformation features
M
165 and support vector machines. A detailed overview of the research work in the
field of facial expression recognition is presented in [36]. For facial expression
ED
recognition using smart phones, a five layer CNN architecture with dropout and
data augmentation is proposed by [37].
The authors of [38] have attempted to recognize dynamic facial expressions
PT
170 using 3D CNN. Local binary patterns have been used for person independent
facial expression recognition with low-resolution video sequences captured in
CE
real-world environments in [39]. The MMI dataset for facial expressions is pro-
posed in [40]. From a single static image, the facial expressions are recognized
using the topographic context as the descriptor in [41]. A novel feature de-
AC
175 scriptor, local directional number pattern (LDN) which encodes the directional
information of face textures in a compact way is proposed for face and expres-
sion recognition by [42]. A toolbox for automatically coding the intensities of
19 different facial actions as well as 6 different prototypical facial expressions is
presented in [43]. The work of [43] also focuses on estimating the location of 10
9
ACCEPTED MANUSCRIPT
T
identification of facial expressions augmented by upper body posture and hand
IP
185 gestures. Unlike the work in [33], our approach uses a simpler CNN architecture
and focuses on the semantic interpretation of the expressions. Though the
CR
work in [37] uses a five layer CNN for facial expression recognition, it neither
focuses on the expressions associated with dance nor does it give a semantic
interpretation of the expressions considered. Moreover, the work in [37] does
190
US
not analyse the impact of pre-training and focuses mainly towards development
of a mobile application for recognizing facial expressions. Hand-crafted features
have been used in the past for recognition of facial expressions as in [39], [41].
AN
Although, the work in [42] presents an approach for face analysis, it does not
use deep learning. The proposed work here does not rely on any hand crafted
195 features and focuses on the semantic interpretation of Navarasas in ICD. We
M
believe that our work is the first to address the highly challenging problem of
semantically understanding emotions in ICD videos using a computer vision
ED
approach.
3. Proposed methodology
PT
200 We propose a deep learning based approach with both RGB images and
depth data collected using the Microsoft Kinect sensor, wherein we use con-
CE
volutional neural networks (CNN). CNNs were iniatially proposed by [44] and
were shown to be robust and effective for various complex real-world machine
learning problems [44, 45]. The emotions associated with ICD in unconstrained
AC
205 settings are affected by clutter in the scene, non-frontal nature of the scene as
well as illumination variations, etc. Hence, we are motivated to use CNNs in
our problem.
The proposed CNN architecture is shown in Fig. 4 which consists of two
10
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 4: Block diagram of the proposed architecture used on the proposed datasets (both
US
RGB and depth data) for emotions of ICD i.e Navarasas.
convolutional and two pooling layers. The enacted Navarasas in the form of
AN
210 images of size 32 × 32 pixels are fed as input to the network. The number of
nodes in the output layer depend on the number of classes as shown in Fig. 4.
The input image is convolved with 10 filter maps of size 5 × 5 pixels resulting
M
215 14 × 14 pixels in layer 2. The output maps of layer 2 are convolved with each
of the 20 kernels of size 5 × 5 pixels yielding 20 maps of size 10 × 10 pixels.
This is followed by sub-sampling by a factor of 2 via max-pooling to obtain 20
PT
the previous layer. Akin to the neurons in the convolutional layer, the output
layer neurons too are modulated by a non-linear activation function, namely,
AC
rectified linear unit i.e ReLU (f (x) = max(0, x)) to produce the resultant score
for each class. To avoid the problem of over-fitting, a dropout layer is used as a
225 regularizer and placed after the max-pooling layers. A major advantage of our
approach is that we do away with the generation of hand-crafted features since
the proposed CNN extracts the best features from the input images to optimize
11
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 5: Block diagram of the proposed architecture used with the CIFAR-10 [46] dataset
for generating the pre-trained model which is further trained using the datasets of Navarasas.
M
classification performance.
We show the impact of changing the architecture of the CNN used on the
230 recognition task in hand. We also show the advantages of using transfer learning
ED
trained model using the proposed CNN architecture in Fig. 5 was also obtained
by training the network with CIFAR-10 [46] dataset from randomly initialized
weights. We also used the publicly available pre-trained “Alexnet” [45] model
AC
which was already trained from random initial weights on ImageNet dataset
240 [47].
We now provide details of another pre-trained CNN using the CIFAR-10 [46]
dataset with a bigger architecture comprising 5 convolutional layers as shown
12
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 6: Block diagram of the AlexNet ([45]) architecture used on the ImageNet dataset [47]
US
for creating the pre-trained model. Note that this architecture does not show the use of 2
GPUs.
AN
in Fig. 5. Image patches of size 32 × 32 × 3 pixels were fed as input to the
architecture consisting of 10 number of kernels in the first convolutional layer
245 with stride 1 pixel. This layer is followed by a cross-channel normalization layer,
M
kernels with stride 1 pixel. The fourth and fifth convolutional layers consist of
128 number of kernels of size 3 × 3 × 128 and 256 number of kernels with size
CE
3 × 3 × 128, respectively. The third and fifth convolutional layers are followed
255 by dropout of 40% each. The fully connected layer has 500 neurons.
The pre-trained model obtained using the ImageNet dataset [47] has an
AC
13
ACCEPTED MANUSCRIPT
Neurons of fully connected layer are connected to all neurons in the previous
layer. ReLU non-linearity follows the convolutional layer as well as the fully
connected layers. The input to this architecture is an image of size 224 × 224 × 3
which is convolved with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels.
T
265 The output of the first convolutional layer is response-normalized and pooled
IP
to be fed as input to the second layer. The second layer has 256 filters of size
5 × 5 × 48. The convolutional third, fourth and the fifth layers are connected
CR
to each other without any pooling or normalization layers. In the third layer,
there are 384 kernels of size 3 × 3 × 256 connected to the normalized and pooled
270 outputs of the previous layer. Similarly, the fourth convolutional layer has 384
US
kernels of size 3×3×192 and the fifth layer has 256 kernels of size 3×3×192. The
fully connected layer has 4096 neurons. The AlexNet [45] is available publicly
[45] whose detailed architecture is represented in Fig. 6.
AN
Note that in our experiments we used the pre-trained AlexNet [45] without
275 the last classification layer only for extracting features from images of the pro-
posed datasets. The 4096 dimensional feature vector in the penultimate layer
M
of the pre-trained AlexNet [45] is fed as input to train a SVM classifier on the
proposed datasets. In contrast to the smaller pre-trained CNNs of Figs. 4
ED
and 5 whose weights in various layers were fine-tuned during training, for the
280 deeper and more complex AlexNet [45] we only train the SVM classifier without
touching the pre-trained weights in its layers.
PT
extended Cohn-Kanade (CK+) [9] data. The trained models whose architec-
285 tures are shown in Fig.4 and 6 were also tested on the real-world unconstrained
Navarasas obtained from professional dancers.
AC
4. Datasets
Various ICD forms can trace their roots from the ancient Indian text of Natya
Shastra [48]. For example, syntactic and semantic descriptions of body postures,
14
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
Figure 7:
(e) (f)
US (g) (h)
290 emotions (Navarasas), hand gestures, neck and eye movements of Bharatnatyam
have been detailed in the Natya Shastra [48]. This ancient treatise also describes
M
the art of conveying emotions (Rasas) with various body/hand movements along
with facial expressions [49]. Hence, our proposed dataset depicting emotions in
ED
ICD, namely, the Navarasas is captured considering the relevant portions of the
295 body.
CVLND-RGB: Images of eight Navarasas associated with ICD are col-
PT
lected under controlled laboratory settings. This dataset is collected from a total
of fourteen persons, each enacting an emotion (Rasa) 10 times. All images are of
CE
size 1920 × 1080 pixels captured with a Microsoft Kinect camera. We obtained
300 the data for eight distinct emotions, namely, Adbhuta, Bhayanaka, Bibhatsya,
Hasya, Roudra, Shaanta, Shringaar and Veera. We name this dataset as the
AC
15
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
Figure 8:
(e) (f)
US (g) (h)
These depth maps of size 512 × 424 pixels are captured using a Microsoft
310 Kinect sensor. The dataset is named as the Computational Vision Laboratory
Navarasas Data-Depth (CVLND-D) and comprises eight distict Navarasas iden-
PT
16
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 9: A snapshot of Navarasas collected under unconstrained real-world scenario for the
classes of Adbhuta, Bhayanaka, Bibhatsya, Hasya, Roudra, Shaanta, Shringaar and Veera,
respectively. Note the variation in illumination, presence of costume, make-up, and ornaments
AN
in the dataset captured in real concert.
Fig. 9. Note the challenges associated with this dataset since it is captured in
uncontrolled conditions. Particularly, one can observe variation in illumination,
costume, make-up and jewellery worn by the dancer. Also faces appear at dif-
ED
325 ferent spatial resolution and non-frontal to the camera. Background clutter due
to curtains in the backdrop adds to the complexity of this dataset.
PT
5. Experimental results
CE
17
ACCEPTED MANUSCRIPT
T
SURF 39.37%
IP
BRISK 15.0%
LBP+SVM 66.75%
CR
Table 1: Performance of hand-crafted features on the proposed CVLND-RGB dataset.
kernel is trained to recognize the emotions. While obtaining the HoG feature
US
vector, a dense grid was considered over the entire image by considering 9-bin
histogram of gradients over cells of size 8 × 8 pixels and blocks of 2 × 2 cells. The
SIFT, SURF and BRISK features used were keypoints based and the feature
AN
340 vector length depended on the number of keypoints detected per image. The
descriptor length for each keypoint for SIFT is 128 and 64 for SURF and BRISK
features.
M
that the SURF and BRISK detectors failed to perform on the depth data due
to their failure in identifying keypoints on the smooth, texture-less depth maps.
The SURF features performed poorly with accuracy of 39.37% and 12.5%
355 on the CVLND-RGB and CVLND-D data, respectively. The performance ob-
tained with BRISK detector is merely 15.0% and 12.5% on the CVLND-RGB
18
ACCEPTED MANUSCRIPT
T
SURF 12.5%
IP
BRISK 12.5%
LBP+SVM 60.625%
CR
Table 2: Classification results using hand-crafted features on the proposed CVLND-D dataset.
and CVLND-D data, respectively. The accuracy obtained with LBP features
US
followed by a SVM classifier is 66.75% on the proposed CVLND-RGB dataset
and 60.625% on the proposed CVLND-D data.
AN
360 5.2. Results on proposed real-world dataset of Navarasas in ICD
83.39%. The accuracies obtained with the keypoint detector based approaches
such as SIFT, SURF, and BRISK on this data are 59.10%, 47.32%, and 39.46%,
ED
365 respectively. Use of local binary patterns as features followed by a SVM classifier
resulted in an accuracy of 60.2% on the real-world Navarasa data as reported
in Table 3. In Fig. 10 we show the inverse HoG features [50] extracted from the
PT
images of Fig. 9. Note visualizations are perceptually intuitive for humans and
hence justifies the performance obtained using HoG features.
CE
Fig. 4, is now evaluated. The CVLND-RGB dataset of 1120 images was di-
vided into a training set of 800 images collected from 10 different individuals
each enacting an expression 10 times for the 8 Navarasas. Similarly, 160 images
375 of emotions collected from 2 individuals each enacting an expression 10 times
19
ACCEPTED MANUSCRIPT
T
SURF 47.32%
IP
BRISK 39.46%
LBP+SVM 60.1786%
CR
Table 3: Classification performance of hand-crafted features on the proposed real-world
Navarasas dataset.
US
AN
M
Figure 10: Inverse HoG [50] features extracted from images of various Rasas as shown in Fig.
9.
ED
for all the eight Navarasas was used for the validation set and the data from re-
maining 2 individuals were used for the testing set. Therefore, 800 images were
PT
used for training while 160 images each were used for validation and testing
datasets, respectively. The same split of data was repeated for the correspond-
CE
380 ing depth maps of CVLND-D. The proposed architecture, as shown in Fig. 4, is
implemented using the deep learning toolbox of Matlab R2017a and yielded an
accuracy of 90.63% on the test set for RGB images while an accuracy of 94.75%
AC
was obtained for the depth data of CVLND-D. Details of the experiment are re-
ported in first two rows of Table 4. Note that we used ReLU activation function
385 and two dropout layers after the pooling layers in the proposed architecture.
The weights of proposed CNN were randomly initialized.
20
ACCEPTED MANUSCRIPT
T
depth (0.001, 10000)
Navarasas 8 2800 560 100% 83.9%
IP
(real-world) (0.001, 5000)
Table 4: Performance of the proposed CNN model in Fig. 4 on our CVLND-RGB and
CR
CVLND-D dataset and real-world Navarasa images.
(a) (b)
US (c) (d)
AN
M
Figure 11: (a) through (d) represents the output of the first filter at the 1st and 2nd con-
volutional layers, at the output of penultimate layer and final layer, respectively for Adbhuta
Navarasa. (e) through (h) represent the same sequence of outputs for the Hasya Navarasa.
PT
The output obtained at the intermediate layers of the proposed CNN are
shown in Figs. 11 and 12 for the CVLND-RGB, and real-world Navarasas data,
CE
respectively. As evident from the snapshots of Fig. 11 and 12, the weigths of
390 CNN in various layers of the architecture extract the prominent features for
classification of the input data. Note the output in various layers consist of fine
AC
21
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
(e) (f) (g) (h)
Figure 12: (a) through (d) represent the output of the first filter at the 1st and 2nd convo-
US
lutional layers, at the output of penultimate layer and final layer, respectively for Adbhuta
Navarasa. (e) through (h) represent the same sequence of outputs for the Hasya Navarasa.
AN
sets. The training set comprises data collected from 9 professional dancers for
the eight Navarasas each enacting the expression 35 times. The validation set
is composed of data collected from 1 individual enacting the 8 expressions 35
M
400 times. The test data is composed of the data collected from 2 dancers each
enacting the 8 emotions 35 times. In summary, the real-world Navarasas data
ED
consists of 2520 images in the training set, 280 photographs in the validation
set and 560 images in the test dataset. The proposed CNN of Fig. 4, resulted
in an accuracy of 83.9% on the test dataset as shown in third row of Table
PT
405 4. Note that the accuracies reported in Table 4 are obtained using a six-fold
cross-validation approach to train the proposed CNN on the CVLND-RGB and
CVLND-D datasets, while a ten-fold cross-validation strategy is followed to
CE
train the CNN on the real-world Navarasas dataset. The values inside brackets
in Table 4 denote the learning rate and number of epochs, respectively.
AC
410 The proposed architecture shown in Fig. 4 was chosen after rigorous exper-
imentation by varying the architecture and experimenting with different ratios
of training, validation and test data. Details of several experiments in which
the architecture of the proposed CNN was varied are tabulated in Table 5. The
architecture in Fig. 4 was modified by increasing the number of convolutional
22
ACCEPTED MANUSCRIPT
T
CVLND 8 960 160 99% 86.25% 100.0% 91.87%
depth (0.001, 10000) (0.001, 10000)
IP
Navarasas 8 2800 560 100% 63.75% 100.0% 73.57%
(real-world) (0.001, 5000) (0.001, 5000)
CR
Table 5: Performance of the end to end CNN with 3 and 4 convolutional layers on the proposed
datasets. Note the results reported are on gray-scale images. The values inside brackets denote
the learning rate and number of epochs, respectively.
415
US
layers successively and tested on the proposed datasets using the deep learning
toolbox of Matlab R2017a resulting in the accuracies as reported in Table 5.
AN
Test accuracies of 66.87%, 86.25%, and 63.67% were obtained on the CVLND-
RGB, CVLND-D, and real-world unconstrained Navarasas data, respectively,
by using 3 convolutional layers in the proposed CNN architecture as reported
M
420 in the 6th column of Table 5. Similarly, the test accuracies obtained using the
four convolutional layer architecture is reported in the 8th column of Table 5.
Note that, there is a drop in performance for the proposed datasets as can be
ED
observed from the test accuracies reported in final column of Table 4 and 6th
and 8th columns of Table 5. The results in Tables 4 and 5 denote the accuracies
PT
425 obtained on the gray-scale images. The superior performance obtained using 2
convolutional layers in the architecture of Fig. 4, as reported in Table 4 acted
as the design motivation for the proposed architecture.
CE
The interesting impact of varying the split of training, validation and test
data is further demonstrated in Tables 6 and 7. Experiments were performed by
varying the number of images in the test set for the CVLND (RGB and depth)
AC
430
dataset as shown in Table 6. Similar experiment was performed also for the
real-world Navarasas data as reported in Table 7. Note accuracies of 91.25%
and 88.5% were obtained while considering 400 images in the test set for a 6 : 3
split of the training and validation data for the proposed dataset of CVLND-
23
ACCEPTED MANUSCRIPT
435 RGB and CVLND-D datasets, respectively as seen in 7th column of Table 6.
Column 5 and 6 of the Table 6 details the test accuracies obtained by varying the
training and validation split by considering 320 images in the test set. Similarly,
the 7th and 8th column of Table 6 detail the variation in accuracy considering
T
varying training-validation splits for the case of 400 images in the test set.
IP
440 Similar set of experimentation was performed for the real-world unconstrained
Navarasas dataset as in Table 7. Note the values reported in Table 6 and 7
CR
were obtained considering gray-scale images of the proposed datasets. Note,
the values inside bracket in the tables 6 and 7 denote the learning rate and
number of epochs, respectively. We concluded from the above experimentation
445
Table 6: Performance obtained using proposed CNN model in Fig. 4 by varying the split of
training, validation, and testing data for the proposed CVLND-RGB and CVLND-D datasets.
PT
Data Classes Total Number of 1120 images in Test 1400 images in Test
set persons 5:3 6:2 4:3 5:2
train:validation train:validation train:validation train:validation
CE
Table 7: Performance obtained using proposed CNN model in Fig. 4 by varying the split of
AC
training, validation, and testing data for the real-world unconstrained Navarasas data.
24
ACCEPTED MANUSCRIPT
T
Depth (0.001, 2000)
Navarasas 8 16800 560 100% 86.79%
IP
(real-world) (0.001, 3000)
Table 8: Performance of the proposed CNN model in Fig. 4 on the augmented proposed
CR
CVLND-RGB and CVLND-D datasets and real-world Navarasa images.
450 images in five different ways followed by flipping the pictures along the vertical
US
axis. Note, augmentation was performed only for the training dataset and no
augmentation was carried out for the test database. Note the improvement in
accuracies obtained on each of the proposed datasets in Table 8 using augmented
AN
gray-scale data. The values inside bracket in the table denote the learning
455 rate and number of epochs, respectively. Test accuracies of 91.25%, 95.1%,
and 86.79% were obtained on the augmented CVLND-RGB, CVLND-D and
M
pooling layers in the proposed architecture which is implemented with the deep
460 learning toolbox of Matlab R2017a. Again the weights of the proposed CNN
were randomly initialized.
PT
Table 9: Performance of the end to end CNN with 3 and 4 convolutional layers on the proposed
datasets. Note the results reported are on augmented gray-scale images. The values inside
bracket in the table denote the learning rate and number of epochs, respectively.
25
ACCEPTED MANUSCRIPT
T
cies of 81.25%, 92.3%, and 84.82% were obtained on the augmented datsets of
IP
CVLND-RGB, CVLND-D and real-world Navarasas data, respectively as shown
in the final column of Table 9. Note, a drop in performance of the test accura-
CR
cies for the augmented datasets as reported in 6th and 8th columns of Table 9
470 as compared to the test accuracies reported in final column of Table 8. Similar
to Tables 4 and 5, the results reported in Table 8 and 9 were obtained for the
gray-scale images.
Data Classes
US
Training and Testing 2 Convolutional
AN
Validation set set Training Testing
accuracy accuracy
CVLND 8 960 160 99.5% 97.5%
RGB (0.001, 5000)
M
Table 10: Performance of the proposed CNN model in Fig. 4 on the proposed CVLND-RGB
dataset and real-world Navarasas images. The values inside bracket in the table denote the
learning rate and number of epochs, respectively.
PT
The proposed architecture of Fig. 4 was modified take as input color images
of the proposed dataset. Test accuracies of 97.5% and 83.93% were obtained for
CE
posed CNN was initialized with random weights. Note, as in case of gray-scale
images, a drop in accuracy was observed for color images by varying the convo-
480 lutional layers as reported in 6th and 8th columns of Table 11. Test accuracies
of 81.25% and 86.61% were obtained for the architectures comprising 4 convo-
lutional layers for the CVLND-RGB and real-world unconstrained Navarasas
26
ACCEPTED MANUSCRIPT
data, respectively, as shown in final column of Table 11. The details of the
accuracies obtained by varying the proposed architecture is reported in 6th and
485 8th columns of Table 11. Note, the values inside bracket in the table denote the
learning rate and number of epochs, respectively.
T
Data Classes Training and Testing 3 Convolutional 4 Convolutional
IP
Validation set set Training Testing Training Testing
accuracy accuracy accuracy accuracy
CVLND 8 960 160 100% 71.25% 100.0% 81.25%
CR
RGB (0.001, 5000) (0.001, 5000)
Navarasas 8 2800 560 100% 75.54% 100.0% 86.61%
(real-world) (0.001, 5000) (0.001, 5000)
US
Table 11: Performance of the end to end CNN with 3 and 4 convolutional layers on the
proposed datasets. Note, that the proposed CNN takes input as color images and hence we
do not report on the CVLND depth data.
AN
Again similar to the set of experiments as performed for the augmented gray-
scale images (Tables 8 and 9), experiments were performed for the augmented
color images too as reported in Tables 12 and 13, respectively. The proposed
M
490 CNN was initialized with random weights. Note, that the same procedure was
followed for augmenting the color dataset as used for the gray-scale images.
ED
Test accuracies of 98.1% and 87.86% were obtained for the augmented proposed
color datasets of CVLND-RGB and real-world Navarasas dataset as reported in
final column Table 12. Since CVLND-D contains single channel data, the result
PT
495 reported in Table 12 and 13 are obtained only for CVLND-RGB and real-world
unconstrained Navarasas datasets where the values inside bracket in the tables
CE
27
ACCEPTED MANUSCRIPT
T
(real-world) (0.001, 5000)
IP
Table 12: Performance of the proposed CNN model in Fig. 4 on the augmented proposed
CVLND-RGB dataset and real-world Navarasa images.
CR
Data Classes Training and Testing 3 Convolutional 4 Convolutional
Validation set set Training Testing Training Testing
accuracy accuracy accuracy accuracy
US
CVLND 8 5760 160 100% 85.62% 100.0% 86.0%
RGB (0.001, 2000) (0.001, 4000)
Navarasas 8 16800 560 100% 86.8% 100.0% 85.36%
(real-world) (0.001, 3000) (0.001, 3000)
AN
Table 13: Performance of the end to end CNN with 3 and 4 convolutional layers, on the
proposed datasets. Note the results reported are on augmented color images.
M
Note that two dropout layers are placed after the respective pooling layers and
505 the third dropout layer appears after the fully connected layer in the proposed
architecture shown in Fig. 4. The test accuracy obtained by the proposed CNN
ED
on the CIFAR-10 [46] data is 61.6%. The converged weights obtained by training
on CIFAR-10 [46] data were used for initialization of the CNN which was trained
further on the augmented CVLND-RGB, CVLND-D and the real-world uncon-
PT
learning toolbox in Matlab R2017a. The details of the accuracies are reported
in Table 14. Note, we obtained accuracies of 94.0%, 95.5%, and 88.0% for the
augmented CVLND-RGB, CVLND-D, and real-world unconstrained Navarasas
AC
515 data, respectively, as opposed to the accuracies reported in final column of Ta-
ble 8. Additionally, the use of pre-trained model of the proposed CNN using
CIFAR-10[46] reduced the number of training epochs (reported in seventh col-
umn of Table 14) in contrast to the large number of epochs needed for training
28
ACCEPTED MANUSCRIPT
T
RGB
CVLND 8 5760 160 10 0.001 1000 95.5%
IP
depth
Navarasas 8 16800 560 10 0.001 1000 88.0%
CR
(real-world)
Table 14: Performance of the proposed CNN in Fig.4 (pre-trained on the CIFAR-10 [46] data)
on the augmented CVLND-RGB and CVLND-D datasets and real-world Navarasas database.
520
US
We also investigated the utility of a deeper CNN, shown in Fig. 5, pre-trained
on the CIFAR-10 [46] dataset after pre-processing the augmented images. Note
AN
that the input to the network was pre-processed augmented CIFAR-10 [46]
dataset obtained by considering the original data along with its flipped version.
The pre-processing is achieved by converting the RGB to YUV domain. Then
M
525 the individual channels were normalized by using the respective mean and stan-
dard deviation values. The pre-processing implemented for the CIFAR-10 [46]
dataset is inspired by the work in [51]. Hence, a total of 100,000 pre-processed
ED
training images (50,000 original and 50,000 flipped images) and 10,000 test im-
ages were fed as input to the deeper architecture in Fig. 5. Note, only the
training data was augmented. The network was trained for 100 epochs with a
PT
530
learning rate of 0.001 using the deep learning toolbox of Matlab R2017a resulting
in a training accuracy of 94.0% and a test accuracy of 81.53%. The advantage of
CE
using a deeper pre-trained CNN is evident from the results reported in Table 15.
Using this deeper pre-trained model of Fig. 5 for the un-augmented proposed
dataset of CVLND-RGB and real-world Navarasas we obtained test accuracies
AC
535
of 97.8% and 89.82%, respectively, as shown in 6th column of Table 15. The
accuracies obtained on the augmented CVLND-RGB and real-world Navarasas
dataset are 98.3% and 94.64% as shown in 10th column of Table 15.
Note that the pre-trained AlexNet [45] consists of eight layers and is ini-
29
ACCEPTED MANUSCRIPT
T
RGB (0.001, 500) (0.001, 2000)
Navarasas 8 2800 560 100% 89.82% 16800 560 100.0% 94.64%
IP
(real-world) (0.001, 2000) (0.001, 2000)
Table 15: Performance of deeper CNN model (Fig. 5) pre-trained on CIFAR-10[46] data on
CR
the un-augmented and augmented proposed CVLND-RGB dataset and real-world Navarasas
data. Note, the values inside bracket in the table denote the learning rate and number of
epochs, respectively.
540
US
tially trained on the ImageNet dataset [47] which has 1000 object categories
and 1.2 million images. This pre-trained AlexNet [45] model is tested on the
AN
proposed CVLND (both RGB and depth) datasets and the real-world Navarasas
database. The final layer of the architecture shown in Fig. 6 was removed and
the feature vector of length 4096 is extracted and fed to a SVM for classifi-
M
545 cation. The accuracy obtained on the test data of CVLND-RGB, CVLND-D
and real-world unconstrained Navarasas data are 97.50%, 95.63% and 88.04%,
ED
respectively, as reported in Table 16. Note that these results are obtained using
grayscale images in the proposed datasets.
The impact of the bigger AlexNet [45] architecture trained on the ImageNet
PT
550 data [47] is then analysed by fine-tuning the SVM to operate on the proposed
color datasets, and the results are reported in Table 17. Test accuracies of
98.5% and 95.2% were obtained on the CVLND-RGB and real-world Navarasas
CE
data, respectively, using the pre-trained AlexNet [45] of Fig. 6. Note there was
improvement in accuracy achieved by using the color images as reported in final
AC
30
ACCEPTED MANUSCRIPT
560 [45] will consume an inordinate amount of time and computational resources.
Note that we use the AlexNet pre-trained CNN (without the last classification
layer) only for extracting features from images of the proposed datasets. In the
AlexNet+SVM technique only the SVM classifier is fine tuned during training
T
whereas the weights of the different layers of the AlexNet are left untouched.
IP
565 This is unlike the smaller pre-trained CNNs of Figs. 4 and 5 wherein the weights
of various layers are fine-tuned during training on the proposed datasets. We
CR
adopt this procedure for transfer learning with the deeper and more complex
AlexNet since training its weights further on our datasets would be computa-
tionally expensive.
Data
CVLND
No. of
Classes
8
Total
data
1120
Training and
US
Validation set
960
Testing
set
160
Validation accuracy
CNN-SVM
98.18%
Testing accuracy
CNN-SVM
97.50%
AN
RGB
CVLND 8 1120 960 160 97.92% 95.63%
depth
Navarasas 8 3360 2800 560 99.91% 88.04%
M
(real-world)
Table 16: Performance of the pre-trained AlexNet [45]+ SVM classifier on the proposed
Navarasas datasets. Note that the results reported are on gray-scale images.
ED
Data No. of Total Training and Testing Validation accuracy Testing accuracy
Classes data Validation set set CNN-SVM CNN-SVM
PT
(real-world)
Table 17: Performance of the pre-trained AlexNet [45]+ SVM classifier on the proposed
datasets of CVLND-RGB and real-world Navarasas data. Note that the results reported are
AC
on color images.
31
ACCEPTED MANUSCRIPT
Taiwanese facial expression image database (TFEID) [7] [8] and a subset of the
extended Cohn-Kanade (CK+) dataset [9].
575 The TFEID database [7] [8] consists of 336 images corresponding to 8 facial
expressions enacted by both men and women. We select randomly a subset of
T
267 images as training data, 25 images in the validation set and 44 test images.
IP
We used the proposed CNN architecture shown in Fig. 4 with random weight
initialization. The average training accuracy achieved on the TFEID [7] [8]
CR
580 dataset for the above case is 99%. The average test accuracy achieved on the
above data using the proposed CNN is 98.1% which is obtained using a learning
rate of 0.001, batch size of 10, after 3000 epochs. Details of the experiment are
585
US
provided in Table 18. We also report the performance of the proposed CNN on
a subset of extended Cohn-Kanade (CK+) [9] dataset comprising 593 sequences
collected from 123 subjects. We report on a set of consisting of five expressions,
AN
namely, ‘Anger’, ‘Disgust’, ‘Happy’, ‘Surprise’ and ‘Neutral’. The data were
sorted using the annotations provided. The images categorised as “Neutral”
were obtained from the initial frame from a sequence. The proposed CNN in
M
Fig. 4 could achieve an accuracy of 92.0% on the test data with a learning
590 rate of 0.001, batch size 10 after 2000 epochs as reported in Table 18. The
ED
accuracies obtained on the TFEID and CK+ dataset are comparable to the
average state-of-the-art accuracies of 99.63% [8] and 94.09% [52], respectively.
Table 18: Performance of the proposed CNN model of Fig.4 (pre-trained on the CIFAR-10
[46] data) on the TFEID [7] [8] and CK+ [9] standard facial expression datasets.
AC
32
ACCEPTED MANUSCRIPT
respectively.
T
CVLND approx. 3 hours approx. 1 hour approx. 10 minutes approx. 8 hours
IP
RGB
CVLND approx. 3 hours approx. 1 hour approx. 10 minutes approx. 8 hours
depth
CR
Navarasas approx. 7 hours approx. 2 hours approx. 15 minutes approx. 13 hours
(real-world)
Table 19: Comparison of computation time for CNN of Fig.4(with and without pre-training)
US
with AlexNet [45] (with and without SVM classifier) on the proposed datasets. Note that all
results reported in the table are obtained using gray-scale images.
AN
Note that though AlexNet is a quite shallow network (compared to other
600 recently popular ones such as VGG [53]), its training/fine-tuning is not trivial.
We observe that the time complexity of the proposed CNN architecture in Fig.
4 is much less. In Table 19, we provide comparison details of computation time
M
605 19 shows the time incurred during training for the CNN architecture of Fig.
4 initialized with random weights. Third column shows the pre-trained CNN
of Fig. 4 during training on the proposed datasets. Fourth column shows the
PT
time incurred for training only the SVM classifier coupled with a pre-trained
AlexNet [45] model. Last column in the table depicts the time taken for training
CE
610 the AlexNet [45] architecture from random initial weights. Note that here the
last layer of the AlexNet [45] performs softmax classification. Note all values
reported in the Table 19 are obtained for gray-scale images. With the use of
AC
33
ACCEPTED MANUSCRIPT
T
time, (Table 19) as expected, only to yield marginal improvement in accuracies
IP
obtained by the pre-trained AlexNet [45] + SVM classifier (Table 16).
CR
6. Semantic interpretation of shloka
US
625
the work involving fine-grained activity recognition [55] [54]. The work in [55]
proposes dataset for semantic activities e.g. those involved during cooking.
AN
However, prior work does not address the problem of semantic interpretation of
short videos of ICD. The understanding of ICD videos would aid the choreogra-
630 phers and professional dancers in composing new dance pieces and training new
M
dancers. We believe that our work is the first to address the highly challenging
problem of semantically understanding ICD using a computer vision approach
with the aid of Navarasas.
ED
are enacted out by dancer using a blend of body postures, hand gestures and
emotions. These songs are written in the ancient Sanskrit language and are
known as Shlokas. We demonstrate the utility of our deep learning based ap-
CE
proach for interpreting one Shloka by recognizing the emotions enacted by the
640 dancer. Note that in this Shloka the dancer is expressing the emotions of the
consort of the Hindu god, Siva. This is the 51st Shloka of Saunarya Lahiri that
AC
Shive shringarardra
taditarajane kutsanapara
645 sarosham gangayam
34
ACCEPTED MANUSCRIPT
girishacharite vismayavati
harahibhyo bhita
sarasiruha sowbhagya janani
sakhi sushmera
T
650 te mayi janani drishti sakarunam
IP
The above Shloka expresses the emotions seen in Divine Mother’s eyes as:
Shive shringarardra denoting Love on seeing Siva. taditarajane kutsanapara -
CR
disgust at other men. sarosham gangayam - jealousy on seeing Ganga girishacharite
vismayavati - wonder when she hears the deeds of Siva harahibhyo bhita - fear
655
US
when she sees the snakes that adorn Siva as garlands sarasiruha sowbhagya
janani - indulgent smile when she sees her girl friends sakhi sushmera - she
looks with compassion at her devotees te mayi janani drishti sakarunam - Her
AN
face is as lovely as a lotus, symbolizing heroism.
As the Shloka is enacted using various Navarasas, we attempt to identify
660 them, thereby interpreting semantic meaning of the dance piece. The trained
M
classifiers obtained using the proposed datasets of Navarasas are used to under-
stand the semantic meaning of the real-world dance videos.
A video wherein a dancer is enacting the Sivey sringaradra shloka is taken
ED
from Youtube and is broken down into frames as shown in Fig. 13. We isolated
665 manually the frames and cropped them to separate the enactments of Navarasas
PT
depicted in the Shloka. Note that the images which are fed as input are man-
ually cropped and there is no automatic detection of face and hand regions to
facilitate the proposed approach. The enactment contains sevaral Navarasas,
CE
35
ACCEPTED MANUSCRIPT
T
IP
CR
(e) (f) (g)
Figure 13: Snapshot of the Navarasas enacted in the Shloka Shive Sringaradra. (a) through
(g) represents the Navarasas, namely, Shringaar, Bibhatsya, Adbhuta, Bhayanaka, Hasya,
Karunaa and Shaanta, respectively
US
Adbhuta, Hasya, and Shanta could be correctly identified by the trained CNN.
AN
7. Practical relevance
36
ACCEPTED MANUSCRIPT
T
IP
Figure 14: A snapshot of Navarasas collected from novices for the classes of Adbhuta,
Bhayanaka, Roudra, and Veera, respectively which were rated poorly by the proposed deep
CR
learning algorithm.
shows some images of a particular Navarasa from the training set to a novice
695
US
and when the trainee performs the Navarasa the system captures picture of the
dancer and feeds it to the proposed machine learning algorithm which then rates
the performed enactment.
AN
To illustrate this possible application of the proposed approach a set of im-
ages were collected from novice dancers for various Navarasas to obtain a test
700 set for evaluation purpose. The students were initially shown images from our
M
CVLND-RGB dataset. They then performed the Navarasa shown them and im-
ages were captured. The data comprised images collected from five trainees each
ED
enacting a particular Navarasa ten times. Hence, the test dataset is composed
of a total of 400 images with 50 images for each of the eight Navarasas. The
705 images collected from novices were then rated using the proposed deep learning
PT
approach. A snapshot of some of the images collected from novices which were
rated poorly by the proposed algorithm are shown in Fig. 14. For the images of
Adbhuta, Bhayanaka, Roudra, and Veera, shown in Fig. 14 classification prob-
CE
abilities of 0.5, 0.5, 0.5, and 0.4, respectively were obtained using the trained
710 CNN model in Fig. 5. By checking these scores, novice dancers can further
AC
8. Conclusion
37
ACCEPTED MANUSCRIPT
715 with ICD. We proposed two datasets, namely, CVLND-RGB and CVLND-D
consisting of RGB and the corresponding depth maps captured under controlled
laboratory conditions. We also captured a large real-world dataset using profes-
sional dancers enacting the Navarasas in unconstrained scanarios. We evaluated
T
the performance of the proposed CNN with a relatively smaller architecture on
IP
720 all our datasets. We also demonstrated the role of data augmentation, trans-
fer learning and deeper CNN architectures. To the best of our knowledge this
CR
work is the first to address the interesting problem of understanding ICD videos
by recognizing emotions enacted by the dancer. However, there exists ample
scope to improve the robustness of the proposed approach so as to handle the
725
US
complexities associated with real-world videos of ICD. We are presently investi-
gating handling occlusions and severe illumination variations for recognition of
Navarasas in real dance videos. The semantic understanding of ICD videos may
AN
aid choreographers and professional dancers in training new dancers as well as
for composing new dance pieces. In future our approach may find use in dance
730 concerts for judging the performances.
M
Acknowledgments
ED
We are grateful to Mr. Ashis Kumar Das and his team at ‘Odissi Nritya
Mandal’ for providing data enacted by professional dancers.
PT
References
47 (2016) 529–548.
38
ACCEPTED MANUSCRIPT
T
analysis from static face images, IEEE Transactions on Systems, Man, and
IP
Cybernetics, Part B (Cybernetics) 34 (3) (2004) 1449–1461.
CR
[5] Z. Zeng, M. Pantic, G. I. Roisman, T. S. Huang, A survey of affect recog-
750 nition methods: Audio, visual, and spontaneous expressions, IEEE trans-
actions on pattern analysis and machine intelligence 31 (1) (2009) 39–58.
US
[6] H. Gunes, M. Piccardi, Bi-modal emotion recognition from expressive face
and body gestures, Journal of Network and Computer Applications 30 (4)
AN
(2007) 1334–1345.
755 [7] L.-F. Chen, Y.-S. Yen, Taiwanese facial expression image database, Brain
Mapping Laboratory, Institute of Brain Science, National Yang-Ming Uni-
M
[10] C. Darwin, The expression of the emotions in man and animals, Oxford
University Press, USA, 1998.
39
ACCEPTED MANUSCRIPT
T
faces and facial expressions: A survey, Pattern Recognition 25 (1) (1992)
IP
775 65–77.
CR
[14] G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman, T. J. Sejnowski, Classi-
fying facial actions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 21 (10) (1999) 974–989.
780
US
[15] X.-W. Wang, D. Nie, B.-L. Lu, Emotional state classification from EEG
data using machine learning approach, Neurocomputing 129 (2014) 94–106.
AN
[16] M. J. Black, Y. Yacoob, Recognizing facial expressions in image sequences
using local parameterized models of image motion, International Journal
of Computer Vision 25 (1) (1997) 23–48.
M
[18] M. Valstar, M. Pantic, Fully automatic facial action unit detection and
temporal analysis, in: IEEE Conference on Computer Vision and Pattern
790 Recognition Workshop, 2006, pp. 149–149.
CE
[19] L.-L. Shen, Z. Ji, Modelling geiometric features for face based age classifi-
cation, in: IEEE Conference on Machine Learning and Cybernetics, Vol. 5,
AC
40
ACCEPTED MANUSCRIPT
[21] R. W. Picard, R. Picard, Affective computing, Vol. 252, MIT press Cam-
bridge, 1997.
T
800
IP
[23] H. Gunes, M. Piccardi, Automatic temporal segment detection and affect
CR
recognition from face and body display, IEEE Transactions on Systems,
Man, and Cybernetics, Part B (Cybernetics) 39 (1) (2009) 64–84.
US
specific influences on judgement and choice, Cognition & Emotion 14 (4)
(2000) 473–493.
AN
[25] M. E. Kret, K. Roelofs, J. J. Stekelenburg, B. de Gelder, Emotional signals
from faces, bodies and scenes influence observers’ face expressions, fixations
810 and pupil-size, Frontiers in human neuroscience 7 810–850.
M
[27] Y. Gu, X. Mai, Y.-j. Luo, Do bodily expressions compete with facial ex-
PT
815 pressions? time course of integration of emotional signals from the face and
the body, PLoS One 8 (7) (2013) 736–762.
CE
41
ACCEPTED MANUSCRIPT
T
single-user office scenarios, in: Artifical Intelligence for Human Computing,
IP
Springer, 2007, pp. 251–271.
CR
830 [32] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh,
S. Lee, U. Neumann, S. Narayanan, Analysis of emotion recognition using
facial expressions, speech and multimodal information, in: Proceedings of
US
the 6th International Conference on Multimodal Interfaces, ACM, 2004,
pp. 205–211.
AN
835 [33] P. Burkert, F. Trier, M. Z. Afzal, A. Dengel, M. Liwicki, Dexpression: Deep
convolutional neural network for expression recognition, arXiv preprint
arXiv:1509.05371.
M
[36] C.-D. Căleanu, Face expression recognition: A brief overview of the last
CE
[37] I. Song, H.-J. Kim, P. B. Jeon, Deep learning for real-time robust facial
expression recognition on a smartphone, in: IEEE Conference on Consumer
Electronics, 2014, pp. 564–567.
42
ACCEPTED MANUSCRIPT
850 [38] Y.-H. Byeon, K.-C. Kwak, Facial expression recognition using 3D convolu-
tional neural network, International Journal of Advanced Computer Science
and Applications 5 (12) (2014) 107–112.
T
local binary patterns: A comprehensive study, Image and Vision Comput-
IP
855 ing 27 (6) (2009) 803–816.
CR
[40] M. Pantic, M. Valstar, R. Rademaker, L. Maat, Web-based database for
facial expression analysis, in: IEEE Conference on Multimedia and Expo,
2005, pp. 5–10.
860
US
[41] J. Wang, L. Yin, Static topographic modeling for facial expression recog-
nition and analysis, Computer Vision and Image Understanding 108 (1)
(2007) 19–34.
AN
[42] A. R. Rivera, J. R. Castillo, O. O. Chae, Local directional number pattern
for face analysis: Face and expression recognition, IEEE Transactions on
M
2278–2324.
875 [46] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny im-
ages, Master’s thesis, Computer Science Department, University of Toronto
(2009).
43
ACCEPTED MANUSCRIPT
[48] B. Gupt, Dramatic Concepts Greek & Indian: A Study of the Poetics and
T
the Natyasastra, DK Printworld, 1994.
IP
[49] A. Lal, The Oxford Companion to Indian Theatre, Oxford University Press,
CR
USA, 2004.
890 [52] W. Xie, L. Shen, M. Yang, Z. Lai, Active AU based patch weighting for
facial expression recognition, Sensors 17 (2) (2017) 275–297.
M
44
ACCEPTED MANUSCRIPT
900 Aparna Mohanty received the B.Tech. and M.Tech degree in electronics
and Communication Engineering from Biju Patnaik University of Technology.
She is currently working as a research scholar in department of electrical engi-
neering at the Indian Institute of Technology, Kharagpur, India. She is currently
T
working in the field of computer vision and machine learning.
IP
905 Rajiv Ranjan Sahay received the B.Tech. degree in electrical engineer-
ing from the National Institute of Technology, Hamirpur, India, in 1998, the
CR
M.Tech. in electronics and communication engineering from the Indian Insti-
tute of Technology, Guwahati, India, in 2001, and the Ph.D. degree from the
Indian Institute of Technology, Madras, in 2009. He was employed as a research
910
US
engineer at the Indian Space Research Organization (ISRO) from 2001 to 2003.
Presently, he is a postdoctoral researcher at the School of Computing, National
University of Singapore. His research interests include 3-D shape reconstruc-
AN
tion using real-aperture cameras, super-resolution, and machine learning. His
dissertation investigated several extensions to the shape-from-focus technique
915 including exploiting defocus and parallax effects in a sequence of images.
M
ED
PT
CE
AC
45