You are on page 1of 4

International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.

-2018
http://iraj.in
FACIAL EXPRESSION RECOGNITION USING CNN: A SURVEY
1
RAVICHANDRA GINNE, 2KRUPA JARIWALA
1
M.Tech student, Department of Computer Engineering, SVNIT, Surat
2
Assistant Professor, Department of Computer Engineering, SVNIT, Surat
E-mail: 1ravichandraginne@yahoo.com, 2knj@coed.svnit.ac.in

Abstract - Facial expression recognition (FER) has become an active research area that finds a lot of applications in areas
like human-computer interfaces, human emotion analysis, psychological analysis, medical diagnosis etc. Popular methods
used for this purpose are based on geometry and appearance. Deep convolutional neural networks (CNN) have shown to
outperform traditional methods in various visual recognition tasks including Facial Expression Recognition. Even though
efforts are made to improve the accuracy of FER systems using CNN, for practical applications existing methods might not
be sufficient. This study includes a generic review of FER systems using CNN and their strengths and limitations which help
us understand and improve the FER systems further.

Keywords - CNN, FER, feature map, ReLU, MLP, BN.

I. INTRODUCTION weights using back propagation and optimization of


errors.
People have been trying to build artificial intelligent II. CONVOLUTION NEURAL NETWORKS
(AI) systems that are equivalent to humans since a
long time. The increased availability of CNN are biologically-inspired variants of multi-layer
computational power and data for training have perceptron (MLP) networks. They use an architecture
helped for developing machine learning techniques which is particularly well suitable to classify images.
and fast learning machines. Among all, deep learning The connections between the layers and the weights
is widely considered as a promising technique to associated with some form subsampling results in
make intelligent machines. Facial expression features that are invariant to translation for
recognition is the process of identifying human classifying images. Their architecture makes
emotion based on facial expressions. While humans convolutional networks fast to train.
are naturally capable of understanding the emotions,
it stayed as a challenging task for machines. Facial A. Architecture of CNN
expression recognition process contains features A CNN contains input layer, multiple hidden layers
extraction and classification as shown in Fig. 1. and an output layer. Multiple Convolutional layers,
sub-sampling (pooling) layers, normalization layers
and fully connected layers are treated as hidden
layers. Fig.2 shows the architecture of a 5-layered
CNN.

Fig.1 Facial Expression Recognition System

Facial Expression Recognition system is a


classification task. Classifier takes a set of features
that are retrieved from the input image as input.
These features describe the facial expression.
Choosing a good feature set, efficient learning
technique and a diversified database for training are
the important factors in classification. Feature set
should contain.
CNN is a combination of deep learning technology
and artificial neural networks. A massive
development in deep learning and applying CNN for
Fig. 2Architecture of a 5-layered CNN [4]
a classification problem has attained a great success
[1,2,3]. The success comes from the fact that feature
extraction and classification can be performed  In convolutional layer, input image is convolved
simultaneously on them. Critical features are with kernel(s) to produce feature map(s) or
extracted by deep learning methods by updating the activation map(s).
 Pooling reduces the dimensionality of each
feature map but retains the most important

Facial Expression Recognition using CNN: A Survey

13
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
information. It partitions the image into non-
overlapping regions, and each region is
subsampled (down-sampled) by a non-linear
function such as maximum, minimum or
average.Max pooling is used as the most
commondown-sampling functionthatgives
maximum value for each region as output.
 In Rectified Linear Units (ReLU) layer an
activation function f(x) = max (0, x) is applied
element wise. ReLU introduces non-linearity in
the network. Other functions used to increase
non-linearity are hyperbolic tangent, sigmoid etc.
 The fully connected layer of CNN is located after
several convolutional and pooling layers andit is
a traditional multi-layer perceptron (MLP). All
neurons in this layer are fully connected to all
activations in the previous layer.
 In the loss layer, different loss functions suitable
for different tasks are used. A Softmax loss
function is used for classification of an image
into multiple classes.

III. CNN FOR FACIAL EXPRESSION


RECOGNITION

In recent years various architectures and models of


CNN are created and used for facial expression
recognition. This research includes a summery review
of researches related to facial expression recognition
using CNN.

A. Deep CNN for FER


AyşegülUçar [5] proposed a CNN model with 10
layers for FER as shown in Fig.3. In their
architecture, a convolutional layer with kernel size 5
* 5, stride 1 and pad 2 was applied. Next a max
pooling layer with kernel size 3 * 3, stride 2 and pad
1 is applied. This process is repeated 3 times with
different strides and pads as shown in Fig. 3.
Followed by a convolutional layer with kernel size 2
* 2, stride 1 and pad 0 is applied which is followed by
another convolution layer of kernel size 1 * 1, stride 1
and pad 0. Finally a fully connected layer is added to Fig.3 Pipeline of proposed CNN [5]
the network.
B. Baseline CNN structure for FER
Their model was evaluated on images from JAFFE Minchul Shin et al. [8] have looked into four network
database [6] and CK database [7]. In first evaluation models that are known to perform well in facial
seven facial expressions of various images from the expression recognition to analyse the most efficient
JAFFE database were used for training. In the second network structure. The effect of techniques for pre-
evaluation Cohn-Kanadedatabase which contains processing the input image on the performance was
images of all races was employed for six expressions. also investigated.
They resized the images to 16x16 and the proposed The first network that was looked into was Tang’s
model was trained for 30 epochs with a batch size of CNN structure [9]. It contains a layer for input
40. They used 0.001 as the learning rate for first transformation and three convolutional and pooling
eleven epochs, 0.002 for 13-29 epochs and 0.00001 layers and a fully connected two-layer perceptron at
for the last epoch. Their results shows that their the end.
proposed method outperforms traditional geometric
and appearance based methods with high accuracy.

Facial Expression Recognition using CNN: A Survey

14
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
The second network is Yu’s structure [2] which In the SCAE emotion recognition model, each
contains five convolutional layers, three stochastic convolutional layer and its subsequent layers: BN,
pooling layers, and three fully connected layers. The ReLU and Max Pooling are treated as a single block
network has two convolutional layers prior to and an Auto-Encoder is created for each one of these
pooling, except for the first layer. blocks. The first Auto-Encoder learns to reconstruct
The third network that is investigated is Kahou’s raw pixel data. The second Auto-Encoder learns to
structure [3] which contains three convolutional and reconstruct the output of the first encoder and so on.
pooling layers followed by a MLP of two layers. Finally, the fully connected layer is trained to
The last one is the Caffe-Image Net structure [10]. It associate the output of the last convolutional encoder
was designed for the classification of images taken with its corresponding label.
from the ImageNet dataset into 1000 classes. But the Their CNN with BN and the SCAE emotion
output nodes were reduced to seven in Baseline CNN recognizers are trained and tested using KDEF [13]
approach. Every convolutional and fully connected dataset. Applying the pre-training technique to give
layers of all the four networks is applied with a ReLU weights to the CNN using Auto-Encoders increased
layer and a dropout layer. their model’s performance to 92.53% and
Five test sets (FET – 2013, SFEW2.0, CK+, KDEF dramatically reduced the training time.
and Jaffe) are chosen to perform tests with the four
network structures. For the pre-processing of input OBSERVATIONS
image it was found that the Histogram equalization
method shows the most reliable performance for all From the above study it is clear that even though
the four networks. It was alsoobserved that Tang’s there are numerous approaches for facial expression
network could achieve reasonably high accuracy recognition still models are being developed
forhistogram equalized images compared to the other continuously. The reason for this is the accuracy.
network models. Based on this observation they Researches are continuously trying to improve the
suggested Tang’s simple network along with accuracy of the FER by proposing various
histogram equalization as the baseline model for architectures and model. They also adopted some of
carrying out further research. the other techniques in their architectures to improve
the accuracy as discussed in Section 1V to account
C. FER with CNN ensemble for the problem of accuracy. Efforts are done to
Kaung Liu et al. [11] have proposed a model reduce the training time for better performance.
consisting of many subnets that are structured Ensemble CNNs are used to improve the accuracy of
differently. Each of these subnets are separately facial expression recognition.
trained on a training set and combined together.The
output layers are removed and the layers before the CONCLUSIONS
last layer are concatenated together. Finally, this
connected network is trained to output the final This paper includes a study of some of the facial
expression labels. expression recognition systems based on CNNs.
They evaluated their model using Facial Expression Different architectures, approaches, requirements,
Recognition 2013 (FER – 2013) dataset. It databases for training/testing images and their
containsgrayscale image of faces of size 48 x 48 performance have been studied here. Each method
pixels. They divided the dataset intoan 80 % training has its own strengths and limitations. This study helps
set and 20 % validation set. They trained the subnets to understand different kinds of models for facial
separately. Each of the subnets achieved a different expression recognition and to develop new CNN
accuracy on the dataset. By combining and averaging architectures for better performance and accuracy.
the outputs of CNNs of different structures their
network reports betterment in performance when REFERENCES
compared to single CNN structure.
[1] Ian J. Goodfellow et al. “Challenges in representation
learning: A report on three machine learning contests”, in
D. Stacked Deep Convolutional Auto-Encoders for Neural Information Processing, ICONIP, 2013, pp. 117-124.
Emotion Recognition from Facial Expressions [2] Z. Yu, C. Zhang, "Image based static facial expression
Ariel Ruiz-Garcia et al. [12] have studied the effect of recognition with multiple deep network
reducing the number of convolutional layers and pre- learning", Proceedings of the 2015 ACM on International
training the Deep CNN as a Stacked Convolutional Conference on Multimodal Interaction, pp. 435-442, 2015,
November.
Auto-Encoder (SCAE) in a greed layer-wise [3] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Glehre,
unsupervised fashion for emotion recognition using R. Memisevic, M. Mirza, "Combining modality specific
facial expression. They incorporated Batch deep neural networks for emotion recognition in
video", Proceedings of the 15th ACM on International
Normalization (BN) for both convolutional and fully
conference on multimodal interaction, pp. 543-550, 2013,
connected layers in their model to accelerate training December.
and improve classification performance. [4] B. Fasel, "Head-pose invariant facial expression recognition
using convolutional neural networks," Proceedings of the

Facial Expression Recognition using CNN: A Survey

15
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
Fourth IEEE International Conference on Multimodal [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
Interfaces, 2002, pp. 529-534. Girshick, T. Darrell, "Caffe: Convolutional architecture for
[5] A. Uçar, "Deep Convolutional Neural Networks for facial fast feature embedding", Proceedings of the ACM
expression recognition," 2017 IEEE International International Conference on Multimedia, pp. 675-678, 2014,
Conference on INnovations in Intelligent SysTems and November.
Applications (INISTA), Gdynia, 2017, pp. 371-375. [11] K. Liu, M. Zhang and Z. Pan, "Facial Expression
[6] M. Lyons, J. Budynek, S. Akamatsu, "Automatic Recognition with CNN Ensemble," 2016 International
classification of single facial images", IEEE Trans Pattern Conference on Cyberworlds (CW), Chongqing, 2016, pp.
Anal Mach Intell., vol. 21, no. 12, pp. 1357-1362, 1999. 163-166.
[7] T. Kanade, J. F. Cohn, T. Yingli, "Comprehensive database [12] A. Ruiz-Garcia, M. Elshaw, A. Altahhan and V. Palade,
for facial expression analysis", in: IEEE 4th international "Stacked deep convolutional auto-encoders for emotion
conference on automatic face and gesture recognition, pp. recognition from facial expressions", 2017 International
46-53, 2000. Joint Conference on Neural Networks (IJCNN), Anchorage,
[8] M. Shin, M. Kim and D. S. Kwon, "Baseline CNN structure AK, 2017, pp. 1586-1593.
analysis for facial expression recognition," 2016 25th IEEE [13] D. Lundqvist, A. Flykt, A. Ohman, The Karolinska Directed
International Symposium on Robot and Human Interactive Emotional Faces — KDEF CD ROM from Department of
Communication (RO-MAN), New York, NY, 2016, pp. 724- Clinical Neuroscience Psycology section, Karolinska
729. Institute, pp. 3-5, 1998.
[9] Y. Tang, "Deep learning using support vector
machines", CoRR abs/1306.0239, 2013.



Facial Expression Recognition using CNN: A Survey

16

You might also like