Understanding Indian classical dance emotions using deep learning

Accepted Manuscript
Rasabodha: Understanding Indian classical dance by recognizing

emotions using deep learning
Aparna Mohanty, Rajiv R. Sahay
PII: S0031-3203(18)30030-X
DOI: 10.1016/j.patcog.2018.01.035
Reference: PR 6444
To appear in: Pattern Recognition
Received date: 15 December 2016

Revised date: 24 November 2017
Accepted date: 23 January 2018
Please cite this article as: Aparna Mohanty, Rajiv R. Sahay, Rasabodha: Understanding Indian
classical dance by recognizing emotions using deep learning, Pattern Recognition (2018), doi:
10.1016/j.patcog.2018.01.035
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
• A deep learning based approach using CNN to recognize facial expressions.
• Three datasets are proposed for Indian classical dance Navarasas.
T
IP
CR
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
Rasabodha: Understanding Indian classical dance by

recognizing emotions using deep learning
Aparna Mohanty1 , Rajiv R. Sahay1
T
aparnamhnty@gmail.com and rajivsahay@gmail.com
Computational Vision Laboratory, Department of Electrical Engineering, Indian Institute of
IP
Technology, Kharagpur, India
CR
Abstract
US
Understanding human behaviour using computer vision techniques for recogniz-
ing body posture, gait, hand gesture, and facial expressions has recently wit-
nessed significant research activity. Emotions/affect have a direct correlation
AN
with the mental state, as well as intention of a person, based on which his/her
present and future states can be understood and predicted. As a case study in
this work, we demonstrate the utility of deep learning in understanding videos
M
of Indian classical dance (ICD) forms. ICD comprises hand gestures, body poses
and facial expressions enacted by the performer along with the accompanying
ED
music and songs/shlokas. In this work we attempt to decipher the meaning of

Navarasas associated with Indian classical dance (ICD). Recognizing these emo-
tions from images/videos of ICD is a challenge due to factors such as ambiguity
PT
in the enactment, costume, make-up, clutter, etc. Here we propose a dataset

of various emotions (Navarasas) enacted in ICD comprising RGB images along
CE
with associated depth information collected using the Microsoft Kinect sensor.
We propose a deep learning framework using convolutional neural networks to
understand the semantic meaning associated with videos of ICD by recognizing
AC
Navarasas enacted by the performer.

Keywords: Deep learning, convolutional neural network, facial expression
recognition, emotion recognition
Preprint submitted to Journal of LATEX Templates February 2, 2018

ACCEPTED MANUSCRIPT
1. Introduction
Communication is an integral part of life and can be either verbal or non-

verbal. Non-verbal communication is generally achieved with hand gestures or
facial expressions. Emotion/affect recognition finds utility in human computer
T
5 interaction (HCI). It is interesting to observe that classical dance forms across
IP
the world also use non-verbal communication to engage the audience. Evidence
of dance exist from pre-historic times and is still found in Bhimbetka rock shel-
CR
ters paintings in India. Early Egyptian tomb paintings depict dancing figures
also. Dance has always been a part and parcel of the Indian heritage. This is
evident from the archaeological excavations of Harappa and Mohenjodaro civil-
US
10
isations wherein sculptures striking various dance poses were unearthed. Dance
poses can still be found inscribed on the walls of temples of India. The Chi-
AN
dambaram temple in a southern province of India has several dance poses of
Bharatnatyam sculpted on its walls.
15 Dance has been a prominent means of transferring knowledge from one gen-
M
eration to the other in earlier times when written scripts were not in use and
has always remained a means of emotionally connecting with the audience. Ini-
tially, dance was used to narrate stories to people with the associated hand
ED
gestures, poses and facial expressions. Understanding Indian classical dance

20 (ICD) depends on the recognition of hand gestures, poses and facial expressions
PT
associated with it. Prior work in understanding ICD has attempted to inter-
pret the meaning of dance through the hand gestures and poses enacted by the
performer [1].
CE
In this work, we propose a deep learning based approach using convolutional

25 neural networks (CNNs) for recognition of emotions associated with typical
ICD forms which convey semantic meaning in accordance with the context.
AC
Bharatnatyam is a famous dance form in India which has been practised since
time immemorial and is a living example of our heritage. It is associated with
hand gestures, body postures, facial expressions, neck and eye movements, etc.
30 The emotions/affect associated with Bhratnatyam dance are called Navarasas.
3
ACCEPTED MANUSCRIPT
The word Navarasas is basically a combination of Nava i.e nine and Rasas or
emotions. Akin to sign languages wherein meaningful information is conveyed
with the usage of hand gestures as well as facial expressions [2], Navarasas in
ICD too are mainly enacted with the aid of facial expressions which may also be
T
35 augmented with appropriate upper body postures and hand gestures. Recently,
IP
affect recognition has gained popularity because of its applications in cognitive
science, psychology, and in human computer interaction (HCI) [3], [4], [5]. Our
CR
work differs from several works [3], [4] in the field of affect recognition which uses
a multi-modal approach for recognition of emotions. In fact, we do not use two
40 cameras to capture multiple modalities independently such as facial expressions
US
or body postures or hand gestures as in [6]. Also, we do not use the audio cue for
affect recognition in ICD. Rather the analysis of Navarasas in this work involves
affect recognition in ICD using both facial expressions as well as hand gestures
AN
coupled with upper body postures in some cases and only facial expressions in
45 other Rasas. In fact, we use a single Kinect sensor to capture both RGB and
depth data while performers enact Navarasas. The word ‘Rasabodha’ used in
M
the title of this paper refers to understanding the meaning associated with short
videos of of ICD by recognizing Navarasas enacted in them. Since there are no
ED
publicly available datasets for affect in ICD, in this work we propose a dataset of
50 Navarasas consisting of a subset (eight out of of the nine classes) of emotions,
namely, Adbhuta, Bhayanaka, Bibhatsa, Hasya, Roudra, Shaanta, Shringaar,
PT
Veera and Karunaa. Note that we exclude the Navarasas corresponding to

Karunaa from the proposed dataset since it is difficult to enact.
CE
Though significant literature exists in the field of facial expression recog-

55 nition, the prior works mainly revolved around the prototypical (basic) facial
expressions which had limited meanings associated with them. Until now, the
AC
works on facial expression recognition have mostly focused on the six basic
emotions anger, disgust, fear, happiness, sadness and surprise recorded under
controlled conditions. Generally, the approaches proposed in the literature are
60 not robust enough to work accurately in unconstrained scenarios. Facial ex-
pressions convey intricate details about the emotion associated with them. This
4
ACCEPTED MANUSCRIPT
T
IP
(a) (b) (c)
CR
Figure 1: Data captured in real-world unconstrained settings. (a) Poor illumination. (b) Low
resolution of face due to large distance between the camera and the performer. (c) Non-frontal
appearance of face of performer.
US
is more evident in ICD since performers of ICD rely heavily on the non-verbal
mode of communication using hand gestures, body poses and facial expressions.
AN
In real-world scenario ICD is performed in an unconstrained environment.
65 Recognising emotions enacted by the dancer in such unconstrained environments
is always a challenge due to:
M
• Illumination variation and poor spatial resolution as shown in Figs. 1 (a)

and (b), respectively.
ED
• Make-up and costume of the dancer might vary drastically posing a serious
70 challenge for the classifier.
PT
• Non-frontal enactment by the performer can be difficult to classify as

shown in Fig. 1 (c).
CE
• Ambiguity among various enactments of the same Navarasa can result

in mis-classification. A particular Navarasa might be enacted in vari-
75 ous ways, by dancers practising various styles of Bharatnatyam such as
AC
Melattur, Pandanallur, Vazhuvoor or Kalakshetra streams. This poses

a serious challenge for classification. For example, the enactment of the
Hasya Navrasa can be done in two ways, as shown in Figs. 2 (a) and (b).
Similarly, the Bibhatsya Navarasa can be performed in either of the ways
80 shown in Figs. 3 (a) and (b).
5
ACCEPTED MANUSCRIPT
T
IP
(a) (b)
CR
Figure 2: Variation in the enactment of Hasya Navarasa
US
In this work we attempt to decipher the meaning associated with the emo-
tions enacted in ICD forms. The proposed dataset of emotions/affect (Navarasas)
AN
enacted by dancers of ICD comprises both color images as well as corresponding
depth information. We propose a deep learning based approach to classify the
85 Navarasas which include facial expressions from both RGB and depth datasets
M
without the need for extracting any hand-crafted features as was reported by
prior works. To prove the robustness of our approach we use the proposed
convolutional neural network architecture on two standard facial expressions
ED
datasets, namely, Taiwanese Facial Expression Image Database (TFEID) [7] [8]
90 and extended Cohn-Kanade (CK+) dataset [9].
The primary contributions of this work are:
PT
• A deep learning based approach is formulated using convolutional neural

networks to recognize emotions (Navarasas) to understand videos of ICD.
CE
• Two datasets are captured under controlled laboratory conditions using

95 the Microsoft Kinect sensor which records both RGB image and depth
AC
data. We also propose a dataset of Navarasas captured using professional

dancers of ICD in an unconstrained environment.
• To the best of our knowledge the proposed algorithm is the first to use deep
learning for recognizing Navarasas in order to semantically understand
6
ACCEPTED MANUSCRIPT
T
IP
(a) (b)
CR
Figure 3: Variation in the enactment of Bibhatsya Navarasa
100 videos of ICD.

US
A detailed survey of related literature is given in section 2. The proposed
AN
methodology is described in section 3. A detailed description of the proposed
datasets is given in section 4. The experimental results are presented in sec-
tion 5. A semantic interpretation of a Shloka using the proposed algorithm is
M
105 presented in section 6. Section 7 concludes the work.

ED
2. Prior work
Emotion recognition in humans by identification of facial expressions has

been an active area of research for a long time. But the problem of expression
PT
recognition in the wild un-controlled environment is still far from solved due
110 to the complexities associated with it. The analysis of facial expression can be
CE
broadly summarized as a combination of face acquisition, facial data extrac-

tion and expression recognition. Note that the proposed pipeline here does not
involve detection of faces, hands, or upper body from the images. We man-
AC
ually crop the appropriate portion of the input image in order to classify the
115 particular Navarasa. Generally, automatic facial expression analysis systems
work on a small set of prototypic emotions such as disgust, fear, joy, surprise,
sadness, and anger. The origin of such works can be traced back to [10]. The
7
ACCEPTED MANUSCRIPT
relation between facial expression and emotion as well as the information con-
veyed by emotions is explored in [11]. The work of [12] investigates the issues
120 in design and implementation of a system that could perform automated facial
expression analysis. A detailed five step approach for representation, detection,
T
identification, expression analysis and classification based on physical features is
IP
attempted in [13]. An exhaustive survey and comparison of recent techniques for
facial expression recognition in automated facial action coding system (FACS)
CR
125 is presented by [14]. For analyzing the emotional state of an individual, a signal
processing approach is proposed in [15] by analysing EEG signals after feature
smoothing, unrelated noise removal followed by tracking the trajectory of emo-
130
tion changes with manifold learning.
US
An optical flow based cue was used in [16] to obtain local parameterized
model for image motion. The work of [16] handled both rigid (planar model)
AN
and non-rigid facial motion (affine-plus-curvature model). A hierarchical optical
flow method was used by [17] to automatically track the facial fiducial points.
The subtle changes in facial muscle action units (AUs) were analyzed for
M
temporal behavior by [18]. Facial features have also been used in [19] for de-
135 termining the age of an individual. A scheme for head-pose invariant facial
ED
expression recognition based on a set of characterstic facial points using a cou-

pled scaled Gaussian process regression (CSGPR) model was proposed by [20].
The need for affective computers for analysing emotions have also been em-
PT
phasised by [21]. An improvement in recognition accuracy is achieved with the

140 use of hierarchical features and multimodal information in the work of [22]. The
CE
work in [23] has attempted to evaluate the efficiency for an automatic emotion
recognition system using a fused representation of face expression and body mo-
tion. The importance of emotions in decision making for robots was emphasized
AC
in the work of [24]. The importance of various modalities such as body postures,
145 speech intonation, and motion is further explored in the work in [25]. The work
in [26] showed that using all multimodal information to obtain a single feature
vector resulted in improved recognition accuracy as opposed to using individ-
ual modalities. The importance of information in each modality for recognizing
8
ACCEPTED MANUSCRIPT
emotional states was further analyzed in [27]. The work in [27] discussed the
150 importance of complementing facial expressions and body posture/motion in
non-verbal communication for determination of emotional states. A detailed
survey of the field of affect recognition is given in [5] which emphasizes the im-
T
portance of using information about affective behavior along with commands in
IP
human-centered interfaces. Several prior works have focused on affect-sensitive
155 multimodal HCI systems [28], [29], [30], and [31]. The work in [32] uses facial
CR
expressions, speech and multimodal information for emotion recognition.
Deep convolutional neural networks have been used by researchers for the
recognition of expressions too as reported by [33]. However, the work in [33]
160 US
only focuses on the recognition of the facial expression but not on their semantic
interpretation. Moreover, the algorithm in [33] relies on a deep architecture
which is computationally complex. A fully automated system for real-time
AN
facial expression recognition for six categories of monochrome frontal views in
cluttered and dynamic scene is proposed by [34]. Facial expression recognition
in image sequences is investigated in [35] using geometric deformation features
M
165 and support vector machines. A detailed overview of the research work in the
field of facial expression recognition is presented in [36]. For facial expression
ED
recognition using smart phones, a five layer CNN architecture with dropout and
data augmentation is proposed by [37].
The authors of [38] have attempted to recognize dynamic facial expressions
PT
170 using 3D CNN. Local binary patterns have been used for person independent
facial expression recognition with low-resolution video sequences captured in
CE
real-world environments in [39]. The MMI dataset for facial expressions is pro-
posed in [40]. From a single static image, the facial expressions are recognized
using the topographic context as the descriptor in [41]. A novel feature de-
AC
175 scriptor, local directional number pattern (LDN) which encodes the directional
information of face textures in a compact way is proposed for face and expres-
sion recognition by [42]. A toolbox for automatically coding the intensities of
19 different facial actions as well as 6 different prototypical facial expressions is
presented in [43]. The work of [43] also focuses on estimating the location of 10
9
ACCEPTED MANUSCRIPT
180 facial features along with the 3D orientation of head.

Prior work in [1] has attempted to semantically infer the meaning associated
with ICD videos by recognizing body postures and hand gestures. In contrast,
the work here presents an approach for understanding short videos of ICD by
T
identification of facial expressions augmented by upper body posture and hand
IP
185 gestures. Unlike the work in [33], our approach uses a simpler CNN architecture
and focuses on the semantic interpretation of the expressions. Though the
CR
work in [37] uses a five layer CNN for facial expression recognition, it neither
focuses on the expressions associated with dance nor does it give a semantic
interpretation of the expressions considered. Moreover, the work in [37] does
190
US
not analyse the impact of pre-training and focuses mainly towards development
of a mobile application for recognizing facial expressions. Hand-crafted features
have been used in the past for recognition of facial expressions as in [39], [41].
AN
Although, the work in [42] presents an approach for face analysis, it does not
use deep learning. The proposed work here does not rely on any hand crafted
195 features and focuses on the semantic interpretation of Navarasas in ICD. We
M
believe that our work is the first to address the highly challenging problem of
semantically understanding emotions in ICD videos using a computer vision
ED
approach.
3. Proposed methodology
PT
200 We propose a deep learning based approach with both RGB images and
depth data collected using the Microsoft Kinect sensor, wherein we use con-
CE
volutional neural networks (CNN). CNNs were iniatially proposed by [44] and
were shown to be robust and effective for various complex real-world machine
learning problems [44, 45]. The emotions associated with ICD in unconstrained
AC
205 settings are affected by clutter in the scene, non-frontal nature of the scene as
well as illumination variations, etc. Hence, we are motivated to use CNNs in
our problem.
The proposed CNN architecture is shown in Fig. 4 which consists of two
10
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 4: Block diagram of the proposed architecture used on the proposed datasets (both
US
RGB and depth data) for emotions of ICD i.e Navarasas.
convolutional and two pooling layers. The enacted Navarasas in the form of
AN
210 images of size 32 × 32 pixels are fed as input to the network. The number of
nodes in the output layer depend on the number of classes as shown in Fig. 4.
The input image is convolved with 10 filter maps of size 5 × 5 pixels resulting
M
in 10 output maps of size 28 × 28 pixels in layer 1. These feature maps are

followed by a max-pooling layer of 2 × 2 regions to obtain output maps of size
ED
215 14 × 14 pixels in layer 2. The output maps of layer 2 are convolved with each
of the 20 kernels of size 5 × 5 pixels yielding 20 maps of size 10 × 10 pixels.
This is followed by sub-sampling by a factor of 2 via max-pooling to obtain 20
PT
output maps of size 5 × 5 pixels of layer 4. The output of layer 4 is concatenated

yielding a single vector of size 500 while training which is fed to the next layer.
220 There is full connectivity between neurons in output layer and the weights of
CE
the previous layer. Akin to the neurons in the convolutional layer, the output
layer neurons too are modulated by a non-linear activation function, namely,
AC
rectified linear unit i.e ReLU (f (x) = max(0, x)) to produce the resultant score
for each class. To avoid the problem of over-fitting, a dropout layer is used as a
225 regularizer and placed after the max-pooling layers. A major advantage of our
approach is that we do away with the generation of hand-crafted features since
the proposed CNN extracts the best features from the input images to optimize
11
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 5: Block diagram of the proposed architecture used with the CIFAR-10 [46] dataset
for generating the pre-trained model which is further trained using the datasets of Navarasas.
M
classification performance.
We show the impact of changing the architecture of the CNN used on the
230 recognition task in hand. We also show the advantages of using transfer learning
ED
in this work. We report the accuracies obtained on the proposed datasets of

Navarasas using the relatively smaller proposed CNN architecture of Fig. 4
along with deeper pre-trained models of Fig. 5 and 6. Specifically, we obtained
PT
a pre-trained CNN by training the proposed CNN architecture in Fig. 4 with

235 CIFAR-10 [46] dataset from random weight initialization. Next a deeper pre-
CE
trained model using the proposed CNN architecture in Fig. 5 was also obtained
by training the network with CIFAR-10 [46] dataset from randomly initialized
weights. We also used the publicly available pre-trained “Alexnet” [45] model
AC
which was already trained from random initial weights on ImageNet dataset
240 [47].
We now provide details of another pre-trained CNN using the CIFAR-10 [46]
dataset with a bigger architecture comprising 5 convolutional layers as shown
12
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 6: Block diagram of the AlexNet ([45]) architecture used on the ImageNet dataset [47]
US
for creating the pre-trained model. Note that this architecture does not show the use of 2
GPUs.
AN
in Fig. 5. Image patches of size 32 × 32 × 3 pixels were fed as input to the
architecture consisting of 10 number of kernels in the first convolutional layer
245 with stride 1 pixel. This layer is followed by a cross-channel normalization layer,
M
along with a ReLU non-linearity and a dropout of 30%. A window considering

5 adjacent feature maps was used to normalize each element in a convolutional
layer. The second convolutional layer comprising 64 number of filters of size
ED
3 × 3 × 10 pixels is followed by a cross-channel normalization layer (considering

250 5 feature maps per element), ReLU, and max-pooling of 2×2 pixels with stride 2
pixels. The next convolutional layer is then comprised of 128 number of 3×3×64
PT
kernels with stride 1 pixel. The fourth and fifth convolutional layers consist of
128 number of kernels of size 3 × 3 × 128 and 256 number of kernels with size
CE
3 × 3 × 128, respectively. The third and fifth convolutional layers are followed
255 by dropout of 40% each. The fully connected layer has 500 neurons.
The pre-trained model obtained using the ImageNet dataset [47] has an
AC
architecture [45] as shown in Fig 6. It consists of eight layers comprising five

convolutional layers followed by three fully connected layers. The final layer
output is fed to a softmax classifier that produces an output over 1000 class
260 labels. The network maximizes the multinomial logistic regression objective.
13
ACCEPTED MANUSCRIPT
Neurons of fully connected layer are connected to all neurons in the previous
layer. ReLU non-linearity follows the convolutional layer as well as the fully
connected layers. The input to this architecture is an image of size 224 × 224 × 3
which is convolved with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels.
T
265 The output of the first convolutional layer is response-normalized and pooled
IP
to be fed as input to the second layer. The second layer has 256 filters of size
5 × 5 × 48. The convolutional third, fourth and the fifth layers are connected
CR
to each other without any pooling or normalization layers. In the third layer,
there are 384 kernels of size 3 × 3 × 256 connected to the normalized and pooled
270 outputs of the previous layer. Similarly, the fourth convolutional layer has 384
US
kernels of size 3×3×192 and the fifth layer has 256 kernels of size 3×3×192. The
fully connected layer has 4096 neurons. The AlexNet [45] is available publicly
[45] whose detailed architecture is represented in Fig. 6.
AN
Note that in our experiments we used the pre-trained AlexNet [45] without
275 the last classification layer only for extracting features from images of the pro-
posed datasets. The 4096 dimensional feature vector in the penultimate layer
M
of the pre-trained AlexNet [45] is fed as input to train a SVM classifier on the
proposed datasets. In contrast to the smaller pre-trained CNNs of Figs. 4
ED
and 5 whose weights in various layers were fine-tuned during training, for the
280 deeper and more complex AlexNet [45] we only train the SVM classifier without
touching the pre-trained weights in its layers.
PT
We also show the robustness of our proposed approach on the standard

datasets such as Taiwanese facial expression image database (TFEID) [7] [8] and
CE
extended Cohn-Kanade (CK+) [9] data. The trained models whose architec-
285 tures are shown in Fig.4 and 6 were also tested on the real-world unconstrained
Navarasas obtained from professional dancers.
AC
4. Datasets
Various ICD forms can trace their roots from the ancient Indian text of Natya
Shastra [48]. For example, syntactic and semantic descriptions of body postures,
14
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
Figure 7:
(e) (f)
US (g) (h)
A snapshot of data collected under constrained laboratory setting. (a) through

(h) represent sample images depicting the enactments of eight Navarasas, namely, Adbhuta,
AN
Bhayanaka, Bibhatsya, Hasya, Roudra, Shaanta, Shringaar and Veera, respectively.
290 emotions (Navarasas), hand gestures, neck and eye movements of Bharatnatyam
have been detailed in the Natya Shastra [48]. This ancient treatise also describes
M
the art of conveying emotions (Rasas) with various body/hand movements along
with facial expressions [49]. Hence, our proposed dataset depicting emotions in
ED
ICD, namely, the Navarasas is captured considering the relevant portions of the
295 body.
CVLND-RGB: Images of eight Navarasas associated with ICD are col-
PT
lected under controlled laboratory settings. This dataset is collected from a total
of fourteen persons, each enacting an emotion (Rasa) 10 times. All images are of
CE
size 1920 × 1080 pixels captured with a Microsoft Kinect camera. We obtained
300 the data for eight distinct emotions, namely, Adbhuta, Bhayanaka, Bibhatsya,
Hasya, Roudra, Shaanta, Shringaar and Veera. We name this dataset as the
AC
Computational Vision Laboratory Navarasas Dataset-RGB (CVLND-RGB). A

snapshot of the CVLND-RGB dataset is shown in Fig. 7. In summary, a total
of 1120 images were collected for eight distinct Navarasas wherein performers
305 were not wearing proper costume and make-up.
15
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
Figure 8:
(e) (f)
US (g) (h)
A snapshot of depth data collected under constrained laboratory setting. (a)

through (h) represents the depth maps corresponding to the RGB data in Fig. 7 for Navarasas
AN
belonging to the categories of Adbhuta, Bhayanaka, Bibhatsya, Hasya, Roudra, Shaanta,
Shringaar and Veera, respectively.
M
CVLND-D: Corresponding depth maps for the RGB images of CVLND-

RGB dataset are collected in artificial laboratory conditions from the same 14
volunteers each enacting a single emotion 10 times for all the eight Navarasas.
ED
These depth maps of size 512 × 424 pixels are captured using a Microsoft
310 Kinect sensor. The dataset is named as the Computational Vision Laboratory
Navarasas Data-Depth (CVLND-D) and comprises eight distict Navarasas iden-
PT
tical to those in CVLND-RGB dataset. A snapshot of the CVLND-D dataset

for depth is shown in Fig. 8. A total of 1120 depth maps were obtained for
CE
corresponding RGB images of the eight distinct Navarasas.

315 Real-world unconstrained Navarasas data: Emotion data is also col-
lected from performers of ICD in an unconstrained environment. This data is
AC
a challenge to work with due to the associated complexities as outlined in sec-

tion 1. The data is collected from 12 professional dancers for a total of eight
Navarasas identical to those in CVLND-RGB dataset. A total of 35 enact-
320 ments are captured from each dancer resulting in a total of 3360 images for
16
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 9: A snapshot of Navarasas collected under unconstrained real-world scenario for the
classes of Adbhuta, Bhayanaka, Bibhatsya, Hasya, Roudra, Shaanta, Shringaar and Veera,
respectively. Note the variation in illumination, presence of costume, make-up, and ornaments
AN
in the dataset captured in real concert.
eight Navarasas as a whole. We show the image corresponding to each Rasa in

M
Fig. 9. Note the challenges associated with this dataset since it is captured in
uncontrolled conditions. Particularly, one can observe variation in illumination,
costume, make-up and jewellery worn by the dancer. Also faces appear at dif-
ED
325 ferent spatial resolution and non-frontal to the camera. Background clutter due
to curtains in the backdrop adds to the complexity of this dataset.
PT
5. Experimental results
CE
One of the primary objectives of this work is to propose a deep learning

framework for recognition of Navarasas in ICD. In order to comprehensively
330 evaluate the proposed approach we provide comparisons with traditional fea-
AC
ture engineering based shallow machine learning algorithms. Hand-crafted op-

erators such as histogram of oriented gradients (HoG), scale-invariant feature
transform (SIFT), speeded up robust features (SURF), binary robust invariant
scalable keypoints (BRISK), and local binary pattern (LBP) were initially used
335 to extract features from the input images and a SVM classifier with a linear
17
ACCEPTED MANUSCRIPT
No. of Training Test Feature Accuracy

Classes set set vector
HoG + SVM 93.125%
8 960 160 SIFT 69.375%
T
SURF 39.37%
IP
BRISK 15.0%
LBP+SVM 66.75%
CR
Table 1: Performance of hand-crafted features on the proposed CVLND-RGB dataset.
kernel is trained to recognize the emotions. While obtaining the HoG feature
US
vector, a dense grid was considered over the entire image by considering 9-bin
histogram of gradients over cells of size 8 × 8 pixels and blocks of 2 × 2 cells. The
SIFT, SURF and BRISK features used were keypoints based and the feature
AN
340 vector length depended on the number of keypoints detected per image. The
descriptor length for each keypoint for SIFT is 128 and 64 for SURF and BRISK
features.
M
5.1. Results on CVLND-RGB and CVLND-D datasets

ED
The details of the accuracies obtained using the hand-crafted features on

345 the CVLND-RGB and CVLND-D data are summarized in Table 1 and Table
2, respectively. For the CVLND-RGB images, using the HoG feature extractor
PT
followed by a SVM classifier we obtained an accuracy of 93.125% whereas the

accuracy on the CVLND-depth data is 91.87%. The performance of the keypoint
CE
based hand-crafted features were affected because of the complexities associated

350 with the dataset. Using SIFT features we obtained an accuracy of 69.375% and
64.37% on the CVLND-RGB and CVLND-D data, respectively. It was observed
AC
that the SURF and BRISK detectors failed to perform on the depth data due
to their failure in identifying keypoints on the smooth, texture-less depth maps.
The SURF features performed poorly with accuracy of 39.37% and 12.5%
355 on the CVLND-RGB and CVLND-D data, respectively. The performance ob-
tained with BRISK detector is merely 15.0% and 12.5% on the CVLND-RGB
18
ACCEPTED MANUSCRIPT

HoG + SVM 91.87%
8 960 160 SIFT 64.37%
T
SURF 12.5%
IP
BRISK 12.5%
LBP+SVM 60.625%
CR
Table 2: Classification results using hand-crafted features on the proposed CVLND-D dataset.
and CVLND-D data, respectively. The accuracy obtained with LBP features
US
followed by a SVM classifier is 66.75% on the proposed CVLND-RGB dataset
and 60.625% on the proposed CVLND-D data.
AN
360 5.2. Results on proposed real-world dataset of Navarasas in ICD
We extracted HoG features from real-world data collected in unconstrained

conditions which was fed as input to the SVM classifier to obtain an accuracy of
M
83.39%. The accuracies obtained with the keypoint detector based approaches
such as SIFT, SURF, and BRISK on this data are 59.10%, 47.32%, and 39.46%,
ED
365 respectively. Use of local binary patterns as features followed by a SVM classifier
resulted in an accuracy of 60.2% on the real-world Navarasa data as reported
in Table 3. In Fig. 10 we show the inverse HoG features [50] extracted from the
PT
images of Fig. 9. Note visualizations are perceptually intuitive for humans and
hence justifies the performance obtained using HoG features.
CE
370 5.3. Results with Deep Learning
The performance of the proposed convolutional neural network, shown in

AC
Fig. 4, is now evaluated. The CVLND-RGB dataset of 1120 images was di-
vided into a training set of 800 images collected from 10 different individuals
each enacting an expression 10 times for the 8 Navarasas. Similarly, 160 images
375 of emotions collected from 2 individuals each enacting an expression 10 times
19
ACCEPTED MANUSCRIPT

HoG + SVM 83.39%
8 2800 560 SIFT 59.10%
T
SURF 47.32%
IP
BRISK 39.46%
LBP+SVM 60.1786%
CR
Table 3: Classification performance of hand-crafted features on the proposed real-world
Navarasas dataset.
US
AN
M
Figure 10: Inverse HoG [50] features extracted from images of various Rasas as shown in Fig.
9.
ED
for all the eight Navarasas was used for the validation set and the data from re-
maining 2 individuals were used for the testing set. Therefore, 800 images were
PT
used for training while 160 images each were used for validation and testing
datasets, respectively. The same split of data was repeated for the correspond-
CE
380 ing depth maps of CVLND-D. The proposed architecture, as shown in Fig. 4, is
implemented using the deep learning toolbox of Matlab R2017a and yielded an
accuracy of 90.63% on the test set for RGB images while an accuracy of 94.75%
AC
was obtained for the depth data of CVLND-D. Details of the experiment are re-
ported in first two rows of Table 4. Note that we used ReLU activation function
385 and two dropout layers after the pooling layers in the proposed architecture.
The weights of proposed CNN were randomly initialized.
20
ACCEPTED MANUSCRIPT
Data Classes Training and Testing 2 Convolutional

Validation set set Training accuracy Testing accuracy
CVLND 8 960 160 100% 90.63%
RGB (0.001, 5000)
CVLND 8 960 160 100% 94.75%
T
depth (0.001, 10000)
Navarasas 8 2800 560 100% 83.9%
IP
(real-world) (0.001, 5000)
Table 4: Performance of the proposed CNN model in Fig. 4 on our CVLND-RGB and
CR
CVLND-D dataset and real-world Navarasa images.
(a) (b)
US (c) (d)
AN
M
(e) (f) (g) (h)

ED
Figure 11: (a) through (d) represents the output of the first filter at the 1st and 2nd con-
volutional layers, at the output of penultimate layer and final layer, respectively for Adbhuta
Navarasa. (e) through (h) represent the same sequence of outputs for the Hasya Navarasa.
PT
The output obtained at the intermediate layers of the proposed CNN are
shown in Figs. 11 and 12 for the CVLND-RGB, and real-world Navarasas data,
CE
respectively. As evident from the snapshots of Fig. 11 and 12, the weigths of
390 CNN in various layers of the architecture extract the prominent features for
classification of the input data. Note the output in various layers consist of fine
AC
details including eye grooves, nose, and mouth regions, etc.

We now report the performance of the proposed CNN shown in Fig. 4 on
the real-world unconstrained Navrasas dataset. This database consists of 3360
395 images collected from twelve professional dancers. Each dancer enacts eight
expressions 35 times. The total data is split into training, validation and test
21
ACCEPTED MANUSCRIPT
T
(a) (b) (c) (d)
IP
CR
(e) (f) (g) (h)
Figure 12: (a) through (d) represent the output of the first filter at the 1st and 2nd convo-
US
lutional layers, at the output of penultimate layer and final layer, respectively for Adbhuta
Navarasa. (e) through (h) represent the same sequence of outputs for the Hasya Navarasa.
AN
sets. The training set comprises data collected from 9 professional dancers for
the eight Navarasas each enacting the expression 35 times. The validation set
is composed of data collected from 1 individual enacting the 8 expressions 35
M
400 times. The test data is composed of the data collected from 2 dancers each
enacting the 8 emotions 35 times. In summary, the real-world Navarasas data
ED
consists of 2520 images in the training set, 280 photographs in the validation
set and 560 images in the test dataset. The proposed CNN of Fig. 4, resulted
in an accuracy of 83.9% on the test dataset as shown in third row of Table
PT
405 4. Note that the accuracies reported in Table 4 are obtained using a six-fold
cross-validation approach to train the proposed CNN on the CVLND-RGB and
CVLND-D datasets, while a ten-fold cross-validation strategy is followed to
CE
train the CNN on the real-world Navarasas dataset. The values inside brackets
in Table 4 denote the learning rate and number of epochs, respectively.
AC
410 The proposed architecture shown in Fig. 4 was chosen after rigorous exper-
imentation by varying the architecture and experimenting with different ratios
of training, validation and test data. Details of several experiments in which
the architecture of the proposed CNN was varied are tabulated in Table 5. The
architecture in Fig. 4 was modified by increasing the number of convolutional
22
ACCEPTED MANUSCRIPT
Data Classes Training and Testing 3 Convolutional 4 Convolutional

Validation set set Training Testing Training Testing
accuracy accuracy accuracy accuracy
CVLND 8 960 160 100% 66.87% 100.0% 76.88%
RGB (0.001, 10000)
T
CVLND 8 960 160 99% 86.25% 100.0% 91.87%
depth (0.001, 10000) (0.001, 10000)
IP
Navarasas 8 2800 560 100% 63.75% 100.0% 73.57%
(real-world) (0.001, 5000) (0.001, 5000)
CR
Table 5: Performance of the end to end CNN with 3 and 4 convolutional layers on the proposed
datasets. Note the results reported are on gray-scale images. The values inside brackets denote
the learning rate and number of epochs, respectively.
415
US
layers successively and tested on the proposed datasets using the deep learning
toolbox of Matlab R2017a resulting in the accuracies as reported in Table 5.
AN
Test accuracies of 66.87%, 86.25%, and 63.67% were obtained on the CVLND-
RGB, CVLND-D, and real-world unconstrained Navarasas data, respectively,
by using 3 convolutional layers in the proposed CNN architecture as reported
M
420 in the 6th column of Table 5. Similarly, the test accuracies obtained using the
four convolutional layer architecture is reported in the 8th column of Table 5.
Note that, there is a drop in performance for the proposed datasets as can be
ED
observed from the test accuracies reported in final column of Table 4 and 6th
and 8th columns of Table 5. The results in Tables 4 and 5 denote the accuracies
PT
425 obtained on the gray-scale images. The superior performance obtained using 2
convolutional layers in the architecture of Fig. 4, as reported in Table 4 acted
as the design motivation for the proposed architecture.
CE
The interesting impact of varying the split of training, validation and test
data is further demonstrated in Tables 6 and 7. Experiments were performed by
varying the number of images in the test set for the CVLND (RGB and depth)
AC
430
dataset as shown in Table 6. Similar experiment was performed also for the
real-world Navarasas data as reported in Table 7. Note accuracies of 91.25%
and 88.5% were obtained while considering 400 images in the test set for a 6 : 3
split of the training and validation data for the proposed dataset of CVLND-
23
ACCEPTED MANUSCRIPT
435 RGB and CVLND-D datasets, respectively as seen in 7th column of Table 6.
Column 5 and 6 of the Table 6 details the test accuracies obtained by varying the
training and validation split by considering 320 images in the test set. Similarly,
the 7th and 8th column of Table 6 detail the variation in accuracy considering
T
varying training-validation splits for the case of 400 images in the test set.
IP
440 Similar set of experimentation was performed for the real-world unconstrained
Navarasas dataset as in Table 7. Note the values reported in Table 6 and 7
CR
were obtained considering gray-scale images of the proposed datasets. Note,
the values inside bracket in the tables 6 and 7 denote the learning rate and
number of epochs, respectively. We concluded from the above experimentation
445
that the architecture of Fig. 4 performs the best. US

by varying the architecture and split of training, validation and testing data
AN
Data Classes Total Number of 320 images in Test 400 images in Test
data persons 6:4 7:3 6:3 7:2
train:validation train:validation train:validation train:validation
CVLND 8 1120 14 86.56% 88.75% 91.25% 86.5%
M
RGB (0.001,5000) (0.001,5000) (0.001,5000) (0.001,5000)

CVLND 8 1120 14 86.25% 92.81% 88.5% 89.25%
depth (0.001,10000) (0.001,10000) (0.001,10000) (0.001,10000)
ED
Table 6: Performance obtained using proposed CNN model in Fig. 4 by varying the split of
training, validation, and testing data for the proposed CVLND-RGB and CVLND-D datasets.
PT
Data Classes Total Number of 1120 images in Test 1400 images in Test
set persons 5:3 6:2 4:3 5:2
train:validation train:validation train:validation train:validation
CE
Navarasas 8 3360 12 78.39% 80.45% 80.5% 81.43%

(real-world) (0.001,5000) (0.001,5000) (0.001,5000) (0.001,5000)
Table 7: Performance obtained using proposed CNN model in Fig. 4 by varying the split of
AC
training, validation, and testing data for the real-world unconstrained Navarasas data.
To overcome the problem of over-fitting, more data is required to train the

network. This is more evident in the results summarized in Table 8 which
are obtained on the augmented data generated by cropping and resizing the
24
ACCEPTED MANUSCRIPT

CVLND 8 5760 160 99% 91.25%
RGB (0.001, 2000)
CVLND 8 5760 160 100% 95.1%
T
Depth (0.001, 2000)
Navarasas 8 16800 560 100% 86.79%
IP
(real-world) (0.001, 3000)
Table 8: Performance of the proposed CNN model in Fig. 4 on the augmented proposed
CR
CVLND-RGB and CVLND-D datasets and real-world Navarasa images.
450 images in five different ways followed by flipping the pictures along the vertical
US
axis. Note, augmentation was performed only for the training dataset and no
augmentation was carried out for the test database. Note the improvement in
accuracies obtained on each of the proposed datasets in Table 8 using augmented
AN
gray-scale data. The values inside bracket in the table denote the learning
455 rate and number of epochs, respectively. Test accuracies of 91.25%, 95.1%,
and 86.79% were obtained on the augmented CVLND-RGB, CVLND-D and
M
real-world Navarasas dataset, respectively, using the proposed architecture in

Fig. 4. We used ReLU activation function and two dropout layers after the
ED
pooling layers in the proposed architecture which is implemented with the deep
460 learning toolbox of Matlab R2017a. Again the weights of the proposed CNN
were randomly initialized.
PT

CE
CVLND 8 5760 160 100% 78.75% 100.0% 81.25%

RGB (0.001, 2000) (0.001, 2000)
CVLND 8 5760 160 100% 90.63% 100.0% 92.3%
AC
depth (0.001, 2000) (0.001, 10000)

Navarasas 8 16800 560 100% 73.93% 100.0% 84.82%
(real-world) (0.001, 2000) (0.001, 2000)
Table 9: Performance of the end to end CNN with 3 and 4 convolutional layers on the proposed
datasets. Note the results reported are on augmented gray-scale images. The values inside
bracket in the table denote the learning rate and number of epochs, respectively.
25
ACCEPTED MANUSCRIPT
Akin to the experimentation with respect to varying the architecture of the

CNN on the original proposed datasets, a similar set of experiments were per-
formed for the corresponding augmented datasets as reported in Table 9. By
465 using a modified architecture comprising 4 convolutional layers, test accura-
T
cies of 81.25%, 92.3%, and 84.82% were obtained on the augmented datsets of
IP
CVLND-RGB, CVLND-D and real-world Navarasas data, respectively as shown
in the final column of Table 9. Note, a drop in performance of the test accura-
CR
cies for the augmented datasets as reported in 6th and 8th columns of Table 9
470 as compared to the test accuracies reported in final column of Table 8. Similar
to Tables 4 and 5, the results reported in Table 8 and 9 were obtained for the
gray-scale images.
Data Classes
US
Training and Testing 2 Convolutional
AN
Validation set set Training Testing
accuracy accuracy
CVLND 8 960 160 99.5% 97.5%
RGB (0.001, 5000)
M
Navarasas 8 2800 560 100% 83.93%

(real-world) (0.001, 5000)
ED
Table 10: Performance of the proposed CNN model in Fig. 4 on the proposed CVLND-RGB
dataset and real-world Navarasas images. The values inside bracket in the table denote the
learning rate and number of epochs, respectively.
PT
The proposed architecture of Fig. 4 was modified take as input color images
of the proposed dataset. Test accuracies of 97.5% and 83.93% were obtained for
CE
475 the CVLND-RGB and real-world Navarasas dataset, respectively, as reported

in Table 10. These results are obtained by using ReLU activation function and
two dropout layers after the pooling layers in proposed architecture. The pro-
AC
posed CNN was initialized with random weights. Note, as in case of gray-scale
images, a drop in accuracy was observed for color images by varying the convo-
480 lutional layers as reported in 6th and 8th columns of Table 11. Test accuracies
of 81.25% and 86.61% were obtained for the architectures comprising 4 convo-
lutional layers for the CVLND-RGB and real-world unconstrained Navarasas
26
ACCEPTED MANUSCRIPT
data, respectively, as shown in final column of Table 11. The details of the
accuracies obtained by varying the proposed architecture is reported in 6th and
485 8th columns of Table 11. Note, the values inside bracket in the table denote the
learning rate and number of epochs, respectively.
T
IP
CVLND 8 960 160 100% 71.25% 100.0% 81.25%
CR
RGB (0.001, 5000) (0.001, 5000)
Navarasas 8 2800 560 100% 75.54% 100.0% 86.61%
(real-world) (0.001, 5000) (0.001, 5000)
US
Table 11: Performance of the end to end CNN with 3 and 4 convolutional layers on the
proposed datasets. Note, that the proposed CNN takes input as color images and hence we
do not report on the CVLND depth data.
AN
Again similar to the set of experiments as performed for the augmented gray-
scale images (Tables 8 and 9), experiments were performed for the augmented
color images too as reported in Tables 12 and 13, respectively. The proposed
M
490 CNN was initialized with random weights. Note, that the same procedure was
followed for augmenting the color dataset as used for the gray-scale images.
ED
Test accuracies of 98.1% and 87.86% were obtained for the augmented proposed
color datasets of CVLND-RGB and real-world Navarasas dataset as reported in
final column Table 12. Since CVLND-D contains single channel data, the result
PT
495 reported in Table 12 and 13 are obtained only for CVLND-RGB and real-world
unconstrained Navarasas datasets where the values inside bracket in the tables
CE
denote the learning rate and number of epochs, respectively.
5.4. Effect of Transfer Learning

AC
A trained model was obtained using the proposed architecture of CNN in

500 Fig. 4 (with random weight initialization) on the CIFAR-10 [46] dataset con-
taining 50, 000 training images and 10, 000 test images belonging to objects of 10
different classes, namely, airplane, automobile, bird, cat, deer, dog, frog, horse,
ship and truck. The CNN was trained for 500 epochs with three dropout layers.
27
ACCEPTED MANUSCRIPT

CVLND 8 5760 160 100% 98.1%
RGB (0.001, 2000)
Navarasas 8 16800 560 100% 87.86%
T
(real-world) (0.001, 5000)
IP
Table 12: Performance of the proposed CNN model in Fig. 4 on the augmented proposed
CVLND-RGB dataset and real-world Navarasa images.
CR
US
CVLND 8 5760 160 100% 85.62% 100.0% 86.0%
RGB (0.001, 2000) (0.001, 4000)
Navarasas 8 16800 560 100% 86.8% 100.0% 85.36%
(real-world) (0.001, 3000) (0.001, 3000)
AN
Table 13: Performance of the end to end CNN with 3 and 4 convolutional layers, on the
proposed datasets. Note the results reported are on augmented color images.
M
Note that two dropout layers are placed after the respective pooling layers and
505 the third dropout layer appears after the fully connected layer in the proposed
architecture shown in Fig. 4. The test accuracy obtained by the proposed CNN
ED
on the CIFAR-10 [46] data is 61.6%. The converged weights obtained by training
on CIFAR-10 [46] data were used for initialization of the CNN which was trained
further on the augmented CVLND-RGB, CVLND-D and the real-world uncon-
PT
510 strained Navarasas dataset. An improvement in performance was achieved by

pre-training the proposed CNN on the CIFAR-10 [46] dataset using the deep
CE
learning toolbox in Matlab R2017a. The details of the accuracies are reported
in Table 14. Note, we obtained accuracies of 94.0%, 95.5%, and 88.0% for the
augmented CVLND-RGB, CVLND-D, and real-world unconstrained Navarasas
AC
515 data, respectively, as opposed to the accuracies reported in final column of Ta-
ble 8. Additionally, the use of pre-trained model of the proposed CNN using
CIFAR-10[46] reduced the number of training epochs (reported in seventh col-
umn of Table 14) in contrast to the large number of epochs needed for training
28
ACCEPTED MANUSCRIPT
the network from random weight initialization (Table 8).
Data No. of Training and Testing Batch α Epochs CIFAR-10 [46]

Classes Validation set set size pre-training
CVLND 8 5760 160 10 0.001 1000 94.0%
T
RGB
CVLND 8 5760 160 10 0.001 1000 95.5%
IP
depth
Navarasas 8 16800 560 10 0.001 1000 88.0%
CR
(real-world)
Table 14: Performance of the proposed CNN in Fig.4 (pre-trained on the CIFAR-10 [46] data)
on the augmented CVLND-RGB and CVLND-D datasets and real-world Navarasas database.
520
US
We also investigated the utility of a deeper CNN, shown in Fig. 5, pre-trained
on the CIFAR-10 [46] dataset after pre-processing the augmented images. Note
AN
that the input to the network was pre-processed augmented CIFAR-10 [46]
dataset obtained by considering the original data along with its flipped version.
The pre-processing is achieved by converting the RGB to YUV domain. Then
M
525 the individual channels were normalized by using the respective mean and stan-
dard deviation values. The pre-processing implemented for the CIFAR-10 [46]
dataset is inspired by the work in [51]. Hence, a total of 100,000 pre-processed
ED
training images (50,000 original and 50,000 flipped images) and 10,000 test im-
ages were fed as input to the deeper architecture in Fig. 5. Note, only the
training data was augmented. The network was trained for 100 epochs with a
PT
530
learning rate of 0.001 using the deep learning toolbox of Matlab R2017a resulting
in a training accuracy of 94.0% and a test accuracy of 81.53%. The advantage of
CE
using a deeper pre-trained CNN is evident from the results reported in Table 15.
Using this deeper pre-trained model of Fig. 5 for the un-augmented proposed
dataset of CVLND-RGB and real-world Navarasas we obtained test accuracies
AC
535
of 97.8% and 89.82%, respectively, as shown in 6th column of Table 15. The
accuracies obtained on the augmented CVLND-RGB and real-world Navarasas
dataset are 98.3% and 94.64% as shown in 10th column of Table 15.
Note that the pre-trained AlexNet [45] consists of eight layers and is ini-
29
ACCEPTED MANUSCRIPT
Data Classes Un-augmented Un-augmented Performance of Augmented Un-augmented Performance of

pre-trained CNN on pre-trained CNN on
Un-augmented color data augmented color data
Training and Testing Training Testing Training and Testing Training Testing
Validation set set accuracy accuracy Validation set set accuracy accuracy
CVLND 8 960 160 100% 97.8% 5760 160 100.0% 98.3%
T
RGB (0.001, 500) (0.001, 2000)
Navarasas 8 2800 560 100% 89.82% 16800 560 100.0% 94.64%
IP
(real-world) (0.001, 2000) (0.001, 2000)
Table 15: Performance of deeper CNN model (Fig. 5) pre-trained on CIFAR-10[46] data on
CR
the un-augmented and augmented proposed CVLND-RGB dataset and real-world Navarasas
data. Note, the values inside bracket in the table denote the learning rate and number of
epochs, respectively.
540
US
tially trained on the ImageNet dataset [47] which has 1000 object categories
and 1.2 million images. This pre-trained AlexNet [45] model is tested on the
AN
proposed CVLND (both RGB and depth) datasets and the real-world Navarasas
database. The final layer of the architecture shown in Fig. 6 was removed and
the feature vector of length 4096 is extracted and fed to a SVM for classifi-
M
545 cation. The accuracy obtained on the test data of CVLND-RGB, CVLND-D
and real-world unconstrained Navarasas data are 97.50%, 95.63% and 88.04%,
ED
respectively, as reported in Table 16. Note that these results are obtained using
grayscale images in the proposed datasets.
The impact of the bigger AlexNet [45] architecture trained on the ImageNet
PT
550 data [47] is then analysed by fine-tuning the SVM to operate on the proposed
color datasets, and the results are reported in Table 17. Test accuracies of
98.5% and 95.2% were obtained on the CVLND-RGB and real-world Navarasas
CE
data, respectively, using the pre-trained AlexNet [45] of Fig. 6. Note there was
improvement in accuracy achieved by using the color images as reported in final
AC
555 column of Table 17 compared to the gray-scale counterpart as shown in final

column of Table 16.
The advantage of the proposed pre-trained CNNs (using CIFAR-10 [46] data)
is that we can easily fine-tune the weights of all the layers during training on
the proposed datasets. However, such a pre-training procedure for AlexNet
30
ACCEPTED MANUSCRIPT
560 [45] will consume an inordinate amount of time and computational resources.
Note that we use the AlexNet pre-trained CNN (without the last classification
layer) only for extracting features from images of the proposed datasets. In the
AlexNet+SVM technique only the SVM classifier is fine tuned during training
T
whereas the weights of the different layers of the AlexNet are left untouched.
IP
565 This is unlike the smaller pre-trained CNNs of Figs. 4 and 5 wherein the weights
of various layers are fine-tuned during training on the proposed datasets. We
CR
adopt this procedure for transfer learning with the deeper and more complex
AlexNet since training its weights further on our datasets would be computa-
tionally expensive.
Data
CVLND
No. of
Classes
8
Total
data
1120
Training and
US
Validation set
960
Testing
set
160
Validation accuracy
CNN-SVM
98.18%
Testing accuracy
CNN-SVM
97.50%
AN
RGB
CVLND 8 1120 960 160 97.92% 95.63%
depth
Navarasas 8 3360 2800 560 99.91% 88.04%
M
(real-world)
Table 16: Performance of the pre-trained AlexNet [45]+ SVM classifier on the proposed
Navarasas datasets. Note that the results reported are on gray-scale images.
ED
Data No. of Total Training and Testing Validation accuracy Testing accuracy
Classes data Validation set set CNN-SVM CNN-SVM
PT
CVLND 8 1120 960 160 99.48% 98.5%

RGB
Navarasas 8 3360 2800 560 100.0% 95.2%
CE
(real-world)
Table 17: Performance of the pre-trained AlexNet [45]+ SVM classifier on the proposed
datasets of CVLND-RGB and real-world Navarasas data. Note that the results reported are
AC
on color images.
570 5.5. Results on standard emotion datasets

In order to evaluate the performance of the proposed deep learning algo-
rithms on standard emotion datasets we choose to report our results on the
31
ACCEPTED MANUSCRIPT
Taiwanese facial expression image database (TFEID) [7] [8] and a subset of the
extended Cohn-Kanade (CK+) dataset [9].
575 The TFEID database [7] [8] consists of 336 images corresponding to 8 facial
expressions enacted by both men and women. We select randomly a subset of
T
267 images as training data, 25 images in the validation set and 44 test images.
IP
We used the proposed CNN architecture shown in Fig. 4 with random weight
initialization. The average training accuracy achieved on the TFEID [7] [8]
CR
580 dataset for the above case is 99%. The average test accuracy achieved on the
above data using the proposed CNN is 98.1% which is obtained using a learning
rate of 0.001, batch size of 10, after 3000 epochs. Details of the experiment are
585
US
provided in Table 18. We also report the performance of the proposed CNN on
a subset of extended Cohn-Kanade (CK+) [9] dataset comprising 593 sequences
collected from 123 subjects. We report on a set of consisting of five expressions,
AN
namely, ‘Anger’, ‘Disgust’, ‘Happy’, ‘Surprise’ and ‘Neutral’. The data were
sorted using the annotations provided. The images categorised as “Neutral”
were obtained from the initial frame from a sequence. The proposed CNN in
M
Fig. 4 could achieve an accuracy of 92.0% on the test data with a learning
590 rate of 0.001, batch size 10 after 2000 epochs as reported in Table 18. The
ED
accuracies obtained on the TFEID and CK+ dataset are comparable to the
average state-of-the-art accuracies of 99.63% [8] and 94.09% [52], respectively.
Data No. of Training Validation Testing Batch α Epochs CIFAR-10 [46]

PT
Classes set set set size pre-training

TFEID 8 267 25 44 10 0.001 3000 98.1%
CK+ (subset) 5 733 72 144 10 0.001 2000 92.0%
CE
Table 18: Performance of the proposed CNN model of Fig.4 (pre-trained on the CIFAR-10
[46] data) on the TFEID [7] [8] and CK+ [9] standard facial expression datasets.
AC
We also investigated the utility of deeper CNN architectures along with

transfer learning for the problem at hand. For this purpose, we used the deep
595 AlexNet [45] CNN which has been pre-trained with ImageNet data [47]. The
testing accuracy obtained using the deep AlexNet [45] network followed by a
SVM classifier on the TFEID [7] [8] and CK+ [9] datasets are 98.2% and 92.3%,
32
ACCEPTED MANUSCRIPT
respectively.
Data Computation time

Proposed CNN with 2 pre-training using pre-trained AlexNet [45] AlexNet [45]
convolutional layer (Fig. 4) CIFAR-10 [46] + SVM classifier fine-tuning
(with random weight initialization) (Fig. 4)
T
CVLND approx. 3 hours approx. 1 hour approx. 10 minutes approx. 8 hours
IP
RGB
CVLND approx. 3 hours approx. 1 hour approx. 10 minutes approx. 8 hours
depth
CR
Navarasas approx. 7 hours approx. 2 hours approx. 15 minutes approx. 13 hours
(real-world)
Table 19: Comparison of computation time for CNN of Fig.4(with and without pre-training)
US
with AlexNet [45] (with and without SVM classifier) on the proposed datasets. Note that all
results reported in the table are obtained using gray-scale images.
AN
Note that though AlexNet is a quite shallow network (compared to other
600 recently popular ones such as VGG [53]), its training/fine-tuning is not trivial.
We observe that the time complexity of the proposed CNN architecture in Fig.
4 is much less. In Table 19, we provide comparison details of computation time
M
incurred by the CNN architecture of Fig. 4 (with and without pre-training)

versus AlexNet [45] (with and without SVM classifier). Second column of Table
ED
605 19 shows the time incurred during training for the CNN architecture of Fig.
4 initialized with random weights. Third column shows the pre-trained CNN
of Fig. 4 during training on the proposed datasets. Fourth column shows the
PT
time incurred for training only the SVM classifier coupled with a pre-trained
AlexNet [45] model. Last column in the table depicts the time taken for training
CE
610 the AlexNet [45] architecture from random initial weights. Note that here the
last layer of the AlexNet [45] performs softmax classification. Note all values
reported in the Table 19 are obtained for gray-scale images. With the use of
AC
AlexNet [45] architecture corresponding to the last column of Table 19 accura-

cies of 97.8%, 96% and 88.9%, respectively were obtained for the CVLND-RGB,
615 CVLND-depth, and Navarasas real-world data. This architecture of AlexNet
[45] was trained from random initial weights and softmax classification was per-
33
ACCEPTED MANUSCRIPT
formed in its last layer.

It is interesting to compare these accuracies with those obtained by a pre-trained
AlexNet [45] + SVM classifier reported in Table 16. Observe that the AlexNet
620 [45] trained from random initial weights consumed an inordinate amount of
T
time, (Table 19) as expected, only to yield marginal improvement in accuracies
IP
obtained by the pre-trained AlexNet [45] + SVM classifier (Table 16).
CR
6. Semantic interpretation of shloka
In the recent literature, several works address the problem of automatic

interpretation of short videos [54]. The proposed work bears resemblance to
US
625
the work involving fine-grained activity recognition [55] [54]. The work in [55]
proposes dataset for semantic activities e.g. those involved during cooking.
AN
However, prior work does not address the problem of semantic interpretation of
short videos of ICD. The understanding of ICD videos would aid the choreogra-
630 phers and professional dancers in composing new dance pieces and training new
M
dancers. We believe that our work is the first to address the highly challenging
problem of semantically understanding ICD using a computer vision approach
with the aid of Navarasas.
ED
Shloka with Navarasas on Siva: Sivey Sringaradra. Generally, performers

635 enacting dance pieces from ICD are accompanied by songs/music. These songs
PT
are enacted out by dancer using a blend of body postures, hand gestures and
emotions. These songs are written in the ancient Sanskrit language and are
known as Shlokas. We demonstrate the utility of our deep learning based ap-
CE
proach for interpreting one Shloka by recognizing the emotions enacted by the
640 dancer. Note that in this Shloka the dancer is expressing the emotions of the
consort of the Hindu god, Siva. This is the 51st Shloka of Saunarya Lahiri that
AC
describes the various emotions of Divine Mother (Maa Tripurasundari).
Shive shringarardra
taditarajane kutsanapara
645 sarosham gangayam
34
ACCEPTED MANUSCRIPT
girishacharite vismayavati
harahibhyo bhita
sarasiruha sowbhagya janani
sakhi sushmera
T
650 te mayi janani drishti sakarunam
IP
The above Shloka expresses the emotions seen in Divine Mother’s eyes as:
Shive shringarardra denoting Love on seeing Siva. taditarajane kutsanapara -
CR
disgust at other men. sarosham gangayam - jealousy on seeing Ganga girishacharite
vismayavati - wonder when she hears the deeds of Siva harahibhyo bhita - fear
655
US
when she sees the snakes that adorn Siva as garlands sarasiruha sowbhagya
janani - indulgent smile when she sees her girl friends sakhi sushmera - she
looks with compassion at her devotees te mayi janani drishti sakarunam - Her
AN
face is as lovely as a lotus, symbolizing heroism.
As the Shloka is enacted using various Navarasas, we attempt to identify
660 them, thereby interpreting semantic meaning of the dance piece. The trained
M
classifiers obtained using the proposed datasets of Navarasas are used to under-
stand the semantic meaning of the real-world dance videos.
A video wherein a dancer is enacting the Sivey sringaradra shloka is taken
ED
from Youtube and is broken down into frames as shown in Fig. 13. We isolated
665 manually the frames and cropped them to separate the enactments of Navarasas
PT
depicted in the Shloka. Note that the images which are fed as input are man-
ually cropped and there is no automatic detection of face and hand regions to
facilitate the proposed approach. The enactment contains sevaral Navarasas,
CE
namely, Shringara, Bibhatsya, Adbhuta, Bhayanaka, Hasya, Karunaa, Shanta

670 which were tested using the proposed CNN model (Fig. 4) trained on the
AC
real-world Navarasas dataset. Among the Navarasas contained in the shloka,

Shringara is enacted in a different way as compared to the data present in our
proposed dataset and hence could not be recognised by our trained model. Also,
the enactment of Karuna could not be recognised by the trained model due to
675 its absence in the proposed dataset. However, the enactments for Bibhatsya,
35
ACCEPTED MANUSCRIPT
(a) (b) (c) (d)
T
IP
CR
(e) (f) (g)
Figure 13: Snapshot of the Navarasas enacted in the Shloka Shive Sringaradra. (a) through
(g) represents the Navarasas, namely, Shringaar, Bibhatsya, Adbhuta, Bhayanaka, Hasya,
Karunaa and Shaanta, respectively
US
Adbhuta, Hasya, and Shanta could be correctly identified by the trained CNN.
AN
7. Practical relevance
We foresee several applications of the proposed algorithm which can aid

M
choreographers and trainee dancers of ICD. Since we propose a deep learning

680 approach to identify Navarasas in various frames extracted from the video of a
ED
dance performance, it is possible to map words (or text description) to images

containing particular enactments of emotions. This fact has been illustrated
in the detailed description of semantically interpreting a shloka in the previ-
PT
ous section. Hence, choreographers can synthesize/visualize short dance pieces

685 containing enactments of emotions/Navarasas corresponding to a poem/shloka
CE
which is provided as text description. Our proposed CVLND-RGB and real-

world Navarasa databases can serve as a lexicon using which dancers can create
the visual enactments. Our datasets could also prove useful to build query
AC
based retrieval systems. Herein, a sample image showing an enactment of a

690 Navarasa can be used to query a repository containing several videos of dance
performance. Trainee dancers can study how several accomplished performers
of ICD enact Navarasas in real-world concerts.
The algorithm can also be used as an online tutor wherein initially the system
36
ACCEPTED MANUSCRIPT
T
IP
Figure 14: A snapshot of Navarasas collected from novices for the classes of Adbhuta,
Bhayanaka, Roudra, and Veera, respectively which were rated poorly by the proposed deep
CR
learning algorithm.
shows some images of a particular Navarasa from the training set to a novice
695
US
and when the trainee performs the Navarasa the system captures picture of the
dancer and feeds it to the proposed machine learning algorithm which then rates
the performed enactment.
AN
To illustrate this possible application of the proposed approach a set of im-
ages were collected from novice dancers for various Navarasas to obtain a test
700 set for evaluation purpose. The students were initially shown images from our
M
CVLND-RGB dataset. They then performed the Navarasa shown them and im-
ages were captured. The data comprised images collected from five trainees each
ED
enacting a particular Navarasa ten times. Hence, the test dataset is composed
of a total of 400 images with 50 images for each of the eight Navarasas. The
705 images collected from novices were then rated using the proposed deep learning
PT
approach. A snapshot of some of the images collected from novices which were
rated poorly by the proposed algorithm are shown in Fig. 14. For the images of
Adbhuta, Bhayanaka, Roudra, and Veera, shown in Fig. 14 classification prob-
CE
abilities of 0.5, 0.5, 0.5, and 0.4, respectively were obtained using the trained
710 CNN model in Fig. 5. By checking these scores, novice dancers can further
AC
improve their enactments of Navarasas and train themselves effectively.
8. Conclusion
In this work we proposed a deep learning framework using convolutional neu-

ral networks to recognize emotions corresponding to the Navarasas associated
37
ACCEPTED MANUSCRIPT
715 with ICD. We proposed two datasets, namely, CVLND-RGB and CVLND-D
consisting of RGB and the corresponding depth maps captured under controlled
laboratory conditions. We also captured a large real-world dataset using profes-
sional dancers enacting the Navarasas in unconstrained scanarios. We evaluated
T
the performance of the proposed CNN with a relatively smaller architecture on
IP
720 all our datasets. We also demonstrated the role of data augmentation, trans-
fer learning and deeper CNN architectures. To the best of our knowledge this
CR
work is the first to address the interesting problem of understanding ICD videos
by recognizing emotions enacted by the dancer. However, there exists ample
scope to improve the robustness of the proposed approach so as to handle the
725
US
complexities associated with real-world videos of ICD. We are presently investi-
gating handling occlusions and severe illumination variations for recognition of
Navarasas in real dance videos. The semantic understanding of ICD videos may
AN
aid choreographers and professional dancers in training new dancers as well as
for composing new dance pieces. In future our approach may find use in dance
730 concerts for judging the performances.
M
Acknowledgments
ED
We are grateful to Mr. Ashis Kumar Das and his team at ‘Odissi Nritya
Mandal’ for providing data enacted by professional dancers.
PT
References
735 [1] A. Mohanty, P. Vaishnavi, P. Jana, A. Majumdar, A. Ahmed, T. Goswami,

CE
R. R. Sahay, Nrityabodha: Towards understanding indian classical dance

using a deep learning approach, Signal Processing: Image Communication
AC
47 (2016) 529–548.
[2] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan,

740 A. Thangali, The american sign language lexicon video dataset, in: IEEE
Conference on Computer Vision and Pattern Recognition Workshops, 2008,
pp. 1–8.
38
ACCEPTED MANUSCRIPT
[3] S. Chen, Y. Tian, Q. Liu, D. N. Metaxas, Recognizing expressions from face

and body gesture by temporal normalized motion and appearance features,
745 Image and Vision Computing 31 (2) (2013) 175–185.
[4] M. Pantic, L. J. Rothkrantz, Facial action recognition for facial expression
T
analysis from static face images, IEEE Transactions on Systems, Man, and
IP
Cybernetics, Part B (Cybernetics) 34 (3) (2004) 1449–1461.
CR
[5] Z. Zeng, M. Pantic, G. I. Roisman, T. S. Huang, A survey of affect recog-
750 nition methods: Audio, visual, and spontaneous expressions, IEEE trans-
actions on pattern analysis and machine intelligence 31 (1) (2009) 39–58.
US
[6] H. Gunes, M. Piccardi, Bi-modal emotion recognition from expressive face
and body gestures, Journal of Network and Computer Applications 30 (4)
AN
(2007) 1334–1345.
755 [7] L.-F. Chen, Y.-S. Yen, Taiwanese facial expression image database, Brain
Mapping Laboratory, Institute of Brain Science, National Yang-Ming Uni-
M
versity, Taipei, Taiwan.

URL http://bml.ym.edu.tw/tfeid/modules/wfdownloads/
ED
[8] M. J. Cossetin, J. C. Nievola, A. L. Koerich, Facial expression recognition

760 using a pairwise feature selection and classification approach, in: IEEE
International Joint Conference on Neural Networks, 2016, pp. 5149–5155.
PT
[9] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews,

The extended cohn-kanade dataset (CK+): A complete dataset for action
CE
unit and emotion-specified expression, in: IEEE Conference on Computer

765 Vision and Pattern Recognition Workshops, 2010, pp. 94–101.
AC
[10] C. Darwin, The expression of the emotions in man and animals, Oxford
University Press, USA, 1998.
[11] P. Ekman, Facial expression and emotion, American Psychologist 48 (4)

(1993) 384–392.
39
ACCEPTED MANUSCRIPT
770 [12] M. Pantic, L. J. M. Rothkrantz, Automatic analysis of facial expressions:

The state of the art, IEEE Transactions on Pattern Analysis and Machine
Intelligence 22 (12) (2000) 1424–1445.
[13] A. Samal, P. A. Iyengar, Automatic recognition and analysis of human
T
faces and facial expressions: A survey, Pattern Recognition 25 (1) (1992)
IP
775 65–77.
CR
[14] G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman, T. J. Sejnowski, Classi-
fying facial actions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 21 (10) (1999) 974–989.
780
US
[15] X.-W. Wang, D. Nie, B.-L. Lu, Emotional state classification from EEG
data using machine learning approach, Neurocomputing 129 (2014) 94–106.
AN
[16] M. J. Black, Y. Yacoob, Recognizing facial expressions in image sequences
using local parameterized models of image motion, International Journal
of Computer Vision 25 (1) (1997) 23–48.
M
[17] J. F. Cohn, A. J. Zlochower, J. J. Lien, T. Kanade, Feature-point track-

785 ing by optical flow discriminates subtle differences in facial expression, in:
ED
Third IEEE Conference on Automatic Face and Gesture Recognition, 1998,

pp. 396–401.
PT
[18] M. Valstar, M. Pantic, Fully automatic facial action unit detection and
temporal analysis, in: IEEE Conference on Computer Vision and Pattern
790 Recognition Workshop, 2006, pp. 149–149.
CE
[19] L.-L. Shen, Z. Ji, Modelling geiometric features for face based age classifi-
cation, in: IEEE Conference on Machine Learning and Cybernetics, Vol. 5,
AC
2008, pp. 2927–2931.
[20] O. Rudovic, M. Pantic, I. Patras, Coupled Gaussian processes for pose-

795 invariant facial expression recognition, IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 35 (6) (2013) 1357–1369.
40
ACCEPTED MANUSCRIPT
[21] R. W. Picard, R. Picard, Affective computing, Vol. 252, MIT press Cam-
bridge, 1997.
[22] P. Barros, D. Jirak, C. Weber, S. Wermter, Multimodal emotional state

recognition using sequence-dependent deep hierarchical features, Neural
T
800
Networks 72 (2015) 140–151.
IP
[23] H. Gunes, M. Piccardi, Automatic temporal segment detection and affect
CR
recognition from face and body display, IEEE Transactions on Systems,
Man, and Cybernetics, Part B (Cybernetics) 39 (1) (2009) 64–84.
805 [24] J. S. Lerner, D. Keltner, Beyond valence: Toward a model of emotion-
US
specific influences on judgement and choice, Cognition & Emotion 14 (4)
(2000) 473–493.
AN
[25] M. E. Kret, K. Roelofs, J. J. Stekelenburg, B. de Gelder, Emotional signals
from faces, bodies and scenes influence observers’ face expressions, fixations
810 and pupil-size, Frontiers in human neuroscience 7 810–850.
M
[26] G. Castellano, L. Kessous, G. Caridakis, Emotion recognition through mul-

tiple modalities: face, body gesture, speech, Affect and emotion in human-
ED
computer interaction (2008) 92–103.
[27] Y. Gu, X. Mai, Y.-j. Luo, Do bodily expressions compete with facial ex-
PT
815 pressions? time course of integration of emotional signals from the face and
the body, PLoS One 8 (7) (2013) 736–762.
CE
[28] Z. Duric, W. D. Gray, R. Heishman, F. Li, A. Rosenfeld, M. J. Schoelles,

C. Schunn, H. Wechsler, Integrating perceptual and cognitive modeling for
adaptive and intelligent human-computer interaction, Proceedings of the
AC
820 IEEE 90 (7) (2002) 1272–1289.
[29] A. Kapoor, W. Burleson, R. W. Picard, Automatic prediction of frustra-

tion, International journal of human-computer studies 65 (8) (2007) 724–
736.
41
ACCEPTED MANUSCRIPT
[30] C. L. Lisetti, F. Nasoz, Maui: a multimodal affective user interface, in:

825 Proceedings of the tenth ACM international conference on Multimedia,
ACM, 2002, pp. 161–170.
[31] L. Maat, M. Pantic, Gaze-x: Adaptive, affective, multimodal interface for
T
single-user office scenarios, in: Artifical Intelligence for Human Computing,
IP
Springer, 2007, pp. 251–271.
CR
830 [32] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh,
S. Lee, U. Neumann, S. Narayanan, Analysis of emotion recognition using
facial expressions, speech and multimodal information, in: Proceedings of
US
the 6th International Conference on Multimodal Interfaces, ACM, 2004,
pp. 205–211.
AN
835 [33] P. Burkert, F. Trier, M. Z. Afzal, A. Dengel, M. Liwicki, Dexpression: Deep
convolutional neural network for expression recognition, arXiv preprint
arXiv:1509.05371.
M
[34] K. Anderson, P. W. McOwan, A real-time automated system for the recog-

nition of human facial expressions, IEEE Transactions on Systems, Man,
ED
840 and Cybernetics, Part B (Cybernetics) 36 (1) (2006) 96–105.
[35] I. Kotsia, I. Pitas, Facial expression recognition in image sequences using

geometric deformation features and support vector machines, IEEE Trans-
PT
actions on Image Processing 16 (1) (2007) 172–187.
[36] C.-D. Căleanu, Face expression recognition: A brief overview of the last
CE
845 decade, in: 8th IEEE International Symposium on Applied Computational

Intelligence and Informatics, 2013, pp. 157–161.
AC
[37] I. Song, H.-J. Kim, P. B. Jeon, Deep learning for real-time robust facial
expression recognition on a smartphone, in: IEEE Conference on Consumer
Electronics, 2014, pp. 564–567.
42
ACCEPTED MANUSCRIPT
850 [38] Y.-H. Byeon, K.-C. Kwak, Facial expression recognition using 3D convolu-
tional neural network, International Journal of Advanced Computer Science
and Applications 5 (12) (2014) 107–112.
[39] C. Shan, S. Gong, P. W. McOwan, Facial expression recognition based on
T
local binary patterns: A comprehensive study, Image and Vision Comput-
IP
855 ing 27 (6) (2009) 803–816.
CR
[40] M. Pantic, M. Valstar, R. Rademaker, L. Maat, Web-based database for
facial expression analysis, in: IEEE Conference on Multimedia and Expo,
2005, pp. 5–10.
860
US
[41] J. Wang, L. Yin, Static topographic modeling for facial expression recog-
nition and analysis, Computer Vision and Image Understanding 108 (1)
(2007) 19–34.
AN
[42] A. R. Rivera, J. R. Castillo, O. O. Chae, Local directional number pattern
for face analysis: Face and expression recognition, IEEE Transactions on
M
Image Processing 22 (5) (2013) 1740–1752.
865 [43] G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank, J. Movellan,

ED
M. Bartlett, The computer expression recognition toolbox ((CERT), in:

IEEE Conference Automatic Face & Gesture Recognition and Workshops,
2011, pp. 298–305.
PT
[44] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-

870 plied to document recognition, Proceedings of the IEEE 86 (11) (1998)
CE
2278–2324.
[45] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with

AC
deep convolutional neural networks, in: Advances in Neural Information

Processing Systems, 2012, pp. 1097–1105.
875 [46] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny im-
ages, Master’s thesis, Computer Science Department, University of Toronto
(2009).
43
ACCEPTED MANUSCRIPT
[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A

large-scale hierarchical image database, in: IEEE Conference on Computer
880 Vision and Pattern Recognition, 2009, pp. 248–255.
[48] B. Gupt, Dramatic Concepts Greek & Indian: A Study of the Poetics and
T
the Natyasastra, DK Printworld, 1994.
IP
[49] A. Lal, The Oxford Companion to Indian Theatre, Oxford University Press,
CR
USA, 2004.
885 [50] C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba, Hoggles: Visualizing

object detection features, in: IEEE Conference on Computer Vision, 2013,
pp. 1–8.
US
[51] CIFAR-10 in Torch, https://github.com/szagoruyko/cifar.torch/
AN
blob/master/provider.lua, [Online; accessed 29-May-2017].
890 [52] W. Xie, L. Shen, M. Yang, Z. Lai, Active AU based patch weighting for
facial expression recognition, Sensors 17 (2) (2017) 275–297.
M
[53] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-

ED
scale image recognition, arXiv preprint arXiv:1409.1556.
[54] J. Lei, X. Ren, D. Fox, Fine-grained kitchen activity recognition using

895 RGB-D, in: Proceedings of the 2012 ACM Conference on Ubiquitous Com-
PT
puting, 2012, pp. 208–211.
[55] M. Rohrbach, S. Amin, M. Andriluka, B. Schiele, A database for fine

CE
grained activity detection of cooking activities, in: IEEE Conference Com-

puter Vision and Pattern Recognition, 2012, pp. 1194–1201.
AC
44
ACCEPTED MANUSCRIPT
900 Aparna Mohanty received the B.Tech. and M.Tech degree in electronics
and Communication Engineering from Biju Patnaik University of Technology.
She is currently working as a research scholar in department of electrical engi-
neering at the Indian Institute of Technology, Kharagpur, India. She is currently
T
working in the field of computer vision and machine learning.
IP
905 Rajiv Ranjan Sahay received the B.Tech. degree in electrical engineer-
ing from the National Institute of Technology, Hamirpur, India, in 1998, the
CR
M.Tech. in electronics and communication engineering from the Indian Insti-
tute of Technology, Guwahati, India, in 2001, and the Ph.D. degree from the
Indian Institute of Technology, Madras, in 2009. He was employed as a research
910
US
engineer at the Indian Space Research Organization (ISRO) from 2001 to 2003.
Presently, he is a postdoctoral researcher at the School of Computing, National
University of Singapore. His research interests include 3-D shape reconstruc-
AN
tion using real-aperture cameras, super-resolution, and machine learning. His
dissertation investigated several extensions to the shape-from-focus technique
915 including exploiting defocus and parallax effects in a sequence of images.
M
ED
PT
CE
AC
45

Understanding Indian classical dance emotions using deep learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Indian classical dance emotions using deep learning

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Rasabodha: Understanding Indian classical dance by recognizing

Aparna Mohanty, Rajiv R. Sahay

To appear in: Pattern Recognition

Received date: 15 December 2016

• A deep learning based approach using CNN to recognize facial expressions.

• Three datasets are proposed for Indian classical dance Navarasas.

Rasabodha: Understanding Indian classical dance by

Aparna Mohanty1 , Rajiv R. Sahay1

music and songs/shlokas. In this work we attempt to decipher the meaning of

in the enactment, costume, make-up, clutter, etc. Here we propose a dataset

Navarasas enacted by the performer.

Preprint submitted to Journal of LATEX Templates February 2, 2018

Communication is an integral part of life and can be either verbal or non-

gestures, poses and facial expressions. Understanding Indian classical dance

In this work, we propose a deep learning based approach using convolutional

Veera and Karunaa. Note that we exclude the Navarasas corresponding to

Though significant literature exists in the field of facial expression recog-

• Illumination variation and poor spatial resolution as shown in Figs. 1 (a)

• Non-frontal enactment by the performer can be difficult to classify as

• Ambiguity among various enactments of the same Navarasa can result

Melattur, Pandanallur, Vazhuvoor or Kalakshetra streams. This poses

• A deep learning based approach is formulated using convolutional neural

• Two datasets are captured under controlled laboratory conditions using

data. We also propose a dataset of Navarasas captured using professional

100 videos of ICD.

105 presented in section 6. Section 7 concludes the work.

Emotion recognition in humans by identification of facial expressions has

broadly summarized as a combination of face acquisition, facial data extrac-

expression recognition based on a set of characterstic facial points using a cou-

phasised by [21]. An improvement in recognition accuracy is achieved with the

180 facial features along with the 3D orientation of head.

in 10 output maps of size 28 × 28 pixels in layer 1. These feature maps are

output maps of size 5 × 5 pixels of layer 4. The output of layer 4 is concatenated

in this work. We report the accuracies obtained on the proposed datasets of

a pre-trained CNN by training the proposed CNN architecture in Fig. 4 with

along with a ReLU non-linearity and a dropout of 30%. A window considering

3 × 3 × 10 pixels is followed by a cross-channel normalization layer (considering

architecture [45] as shown in Fig 6. It consists of eight layers comprising five

We also show the robustness of our proposed approach on the standard

A snapshot of data collected under constrained laboratory setting. (a) through

Computational Vision Laboratory Navarasas Dataset-RGB (CVLND-RGB). A

A snapshot of depth data collected under constrained laboratory setting. (a)

CVLND-D: Corresponding depth maps for the RGB images of CVLND-

tical to those in CVLND-RGB dataset. A snapshot of the CVLND-D dataset

corresponding RGB images of the eight distinct Navarasas.

a challenge to work with due to the associated complexities as outlined in sec-

eight Navarasas as a whole. We show the image corresponding to each Rasa in

One of the primary objectives of this work is to propose a deep learning

ture engineering based shallow machine learning algorithms. Hand-crafted op-

No. of Training Test Feature Accuracy

5.1. Results on CVLND-RGB and CVLND-D datasets

The details of the accuracies obtained using the hand-crafted features on

followed by a SVM classifier we obtained an accuracy of 93.125% whereas the

based hand-crafted features were affected because of the complexities associated

No. of Training Test Feature Accuracy

We extracted HoG features from real-world data collected in unconstrained

370 5.3. Results with Deep Learning

The performance of the proposed convolutional neural network, shown in

No. of Training Test Feature Accuracy

Data Classes Training and Testing 2 Convolutional

(e) (f) (g) (h)