You are on page 1of 4

Spontaneous Facial Expression Recognition by

Heterogeneous Convolutional Networks


Xianlin Peng1 , Lei Li1 , Xiaoyi Feng1 , and Jianping Fan2
1 School of Electronics and Information, Northwestern Polytechnical University
e-mail: 397448497@qq.com
2 School of Information Science and Technology, Northwest University
e-mail: pjy@nwu.edu.cn

Abstract— Spontaneous facial expression achieves much at- system (FACS) proposed by Ekman. All the above approaches
tention recently as it has potential applications in the field of refer to the acted expressions, however, the spontaneous facial
computer vision and pattern recognition. Although the convolu- expressions are very different from the acted ones, which are
tional networks have been applied for recognizing acted facial
expressions and obtained the state-of-the-art performance, the acquired in a highly-controlled environment and mimic the
performance of recognizing spontaneous facial expressions still spontaneous expressions. Therefore, the approaches learned
needs to be improved. In this paper, a heterogeneous deep from the acted expressions suffer from low-performance in
model is presented to recognize spontaneous expressions. The real-world applications.
deep model consists of two types of convolutional networks with
different architectures. To leverage the acted data, these two Many approaches have made their efforts to improve the
deep sub-networks are pre-trained over acted data and then generalization of recognizing expressions. Shan et al. [9]
transferred to the spontaneous data. Experiments have shown the trained their methods on across-datasets. After extracting local
advantages of the proposed method on the dataset of spontaneous binary pattern (LBP) features from the Cohn-Kanade database,
facial expression.
the Support Vector Machine (SVM) algorithm is then tested
Keywords— Spontaneous facial expression, Deep learning, Con- on the MMI database and the JAFFE database, respectively.
volutional networks, Heterogeneous architecture
On the other hand, the deep learning techniques are utilized
to recognize the acted expressions as the deep features can
I. I NTRODUCTION learn robust features of expressions. A 3Dconvolutional neural
networks (CNNs) [10] is used to recognize dynamic expres-
Facial expression recognition (FER) is a popular topic in
sions with strong spatial structural constraints of the action
the fields of computer vision and pattern recognition [1]. For
parts towards the video data. In [11], the conventional deep
last three decades, the FER technologies have been applied
models, i.e., CNNs and deep neural networks (DNNs), were
to many scenes, such as news analysis and human-computer
jointly used to learn the appearance and geometric features for
interface. An amount of approaches have been presented in the
acted expressions. The above deep models can achieve good
applications and various visual features for expressions have
performance on video based expression data and focus on the
been proposed to describe the diversity of expressions. Since
acted data.
the facial expressions refer to the race, culture, personality,
etc., classifying facial expressions into basic types is still Recently, a spontaneous image dataset leveraging web im-
challenging. ages have been constructed [12]. Since these spontaneous-
expression images were labeled by a large amount of web
Existing approaches of recognizing facial expressions can be
users and have large diversity, it is difficult to recognize the ba-
classified into two categories. The first category concentrates
sic expressions even with deep models (verified in subsequent
on the six basic emotions, which are happiness, sadness,
experiments). In this paper, a novel facial expression method
surprise, anger, disgust and fear. Many approaches are de-
with deep models is proposed to address the problem of
voted to recognize these basic expressions over conventional
spontaneous FER. As the variations in spontaneous-expression
datasets, such as the MMI facial expression database [2]
images are greatly large, a heterogeneous architecture is used
and the Japanese Female Facial Expression (JAFFE) database
to describe the variations of spontaneous expressions. In the
[3]. The MMI has more than 2,000 images or frames of
heterogeneous model, the CNNs based on VGG-Face [13]
expressions and more than 500 expression images of 50
and ResNet [14] are employed to automatically extract the
persons. The Japanese Female Facial Expression (JAFFE)
visual features. These two models are then learned jointly with
database [3] contains 213 expression images of 10 Japanese
shared classification layer. To leverage the acted expressions,
females. The second category focuses on extracting fine-
we utilize the acted data from Cohn-Kanade database [4] to
grained description for facial expression, such as action units
pre-train the deep model.
(AUs) [4], neutral [5], [6], [7], [8], wink [6], [7], sleepy
[6], talk [7], scream [8] . The most famous dataset, namely The rest of this paper is organized as follows. The proposed
Cohn-Kanade database [4], contains many action units and method is presented in Section II. Then we discuss the
combinations of action units based on the Facial action coding experimental results for algorithm evaluation in Section III.

978-1-5386-3148-5/17/$31.00 © 2017 IEEE 70


1 Face 2 Heterogeneous 3 Automatic
Alignment Convolutional Networks Recognition

Fig. 1. The flowchart of recognizing the spontaneous expressions.

At last, Section IV describes our conclusions. points, the ASM model predicts the accurate positions for the
facial images.
II. P ROPOSED M ETHOD
Finally, all the images in the spontaneous dataset are aligned
Fig. 1 illustrates the framework of our proposed deep to the first images with the five landmarks. The alignment of
recognition method. Given a still image, the feature vector ith image is calculated by
x = (x1 , x2 , ..., xd )T can be automatically learned by the deep ( )T
framework from the aligned images and then classified into six s · cosθ · xik − s · sinθ · yik − tix
z′ i = · zi
basic types c = {1, 2, ..., 6}. s · sinθ · xik + s · cosθ · yik − tiy (1)
zi = (xi1 , yi1 , . . . , xik , yik , . . . , xi5 , yi5 )T
A. Face Alignment
where ti = (tix , tiy )T represents the translation vector of
Although the web images have been cropped by the Viola-
shape zi along the axis x and y. θ can be calculated by the eye
Jones face detection algorithm, the face images need to be
positions of two images. From Fig. 2, it can be seen that the
aligned to reduce the influence of expression-unaware facial
aligned image has less regions than the original image. Before
regions. In this paper, we use an effective method to align the
performing the convolution operations, the raw images would
facial images, which are shown in Fig. 2.
be aligned and cropped to the size 224 × 224 for extracting
the features.

Eye B. Heterogeneous Convolutional Networks


Localization
The heterogeneous convolutional networks (HCN) contains
two heterogeneous layers: (1) VGG-Face based layers; (2)
Original
Image ResNet based layers. The Fig. 3 shows our proposed archi-
tecture, which can extract the deep features automatically.
Five keypoints
Localization VGG-Face. Conventional VGG-Face model [13] is trained on
Aligned the data with size of 1 million images. Thus, we propose a new
Image expression recognition model based on pre-trained model in
this work. The VGG-face model is initially designed for face
recognition [13]. We utilize the deep model to pre-train our
Automatic model and change the final layer for expression recognition
Alignment problem.
To adapt the VGG-face model for our facial expression
Fig. 2. The face alignment for expressions. recognition problem, we fine-tune the model using the con-
structed natural expression dataset. As in the problem of facial
First, an accurate eye localization method [15] is used to expression recognition, we have six classes (i.e., six basic
detect the positions of eyes for active shape model (ASM). expressions), we change the output of the last full connected
The conventional projection algorithm is employed to obtain layer from 2622 to 6 (2622 is the number of the subjects used
the candidate position. Then the positions of eyes and gray for face recognition). After making these changes, we retrain
values are used as the features and further determined to be the the deep model by freezing certain layers in the CNN model
final position by a SVM model. The SVM model was trained and learn the weights and biases of the unfrozen layers. To
on the face image dataset, namely YALEB dataset [16]. Using select which layers we should learn we repeat the experiment
the fast learning based model, the initial points for ASM model with different frozen layers.
can be obtained. ResNet. Standard ResNet model [14] is trained on the data
Second, the ASM model is used to detect the five keypoints of ImageNet [] and used for image classification. So, we
(landmarks) in facial regions, i.e., the centers of eyes, the the propose a new expression recognition model based on pre-
tip of the nose and the corners of the mouth. Based on the trained model in this work.
trained model used in [17], we merely select five-points ASM The standard ResNet mostly has 3×3 filters in each convolu-
model for facial expression in this context. With two initial tional layers. Following the configure for convolutional layers

978-1-5386-3148-5/17/$31.00 © 2017 IEEE 71


VGG-face model Prob

濷 濷 Global
濷 Pooling
224灤224 濷 濷 濷 ‫ܺܣܯ‬
angry
Conv Layer ReLU Pool disgust
FC
Heterogeneous Convolutional Networks fear
happy
Resnet model sadness
Input aligned surprise
face image Global
濷 Pooling
濷 濷 濷

Conv Layer Conv Layer Prob


Conv Layer FC

Fig. 3. The architecture of heterogeneous convolutional networks.

[], the downsampling is performed directly by convolutional learn the parameters by the stochastic gradient decent (SGD)
layers that have a stride of 2. At the end of convolutional algorithm.
layers, the global average pooling layer is used to reduce the In SGD procedure for parameter learning, the momentum
dimensionality of features. In the standard, a 1000-way fully- is set to 0.9 and weight decay 0.0005. The stopping criterion
connected (FC) layer with softmax is employed to predict the for SGD is set to 10−4 for iterations. The learning rate is set
classes while we use 6-way FC layer to recognize expressions. to 10−3 in the beginning and will be multiplied with damping
The total number of weighted layers is 50 in this context. factor 0.5 when all mini-batches are traversed and re-allocated
In convolutional layers, the shortcut connections between randomly.
each two layers are inserted for obtaining the residual rep-
resentations. Through the convolution without downsampling III. E XPERIMENTS
in these two adjacent layers, the identity shortcuts can be
A. Experimental Setup
directly employed and the identity mapping is performed. In
the residual addition, the element-wise addition is performed In our experiment, the spontaneous facial expression
on two feature maps, channel by channel. database [12] with 1648 images is employed to evaluate
the performance. Totally, all images are classified into six
C. Classification and Learning basic categories (i.e., angry, disgust, fear, happy, sadness,
As shown in Fig. 3, the common classification layer with surprise) and about 300 images can be used for each kind
fully-connections are used to fuse the predictions from two of expressions. In this dataset, the spontaneous expressions
heterogeneous networks. The Softmax function is employed are obtained from different persons in diverse cultures and
to predict the probability of recognizing expressions for two countries. The images are of different resolutions and most of
deep models. It is defined by them are frontal images.
exp(wcT v) To assess the effectiveness of our proposed deep model
yc = ∑C (2) on the spontaneous expression database, we conduct exper-
T
c=1 exp(wc v) iments evaluating the performance of the proposed model
where yc is the predicted probability of cth expression, and and comparing the results against those of using traditional
v denotes the output feature vector of last pooling layer. shallow models and deep models. Since the Gabor feature and
The MAX operation is performed on the outputs of two LBP feature [18] have been used in FER problem, these two
deep models and the larger probability is used to predict the approaches are used as the shallow models. The conventional
expressions. VGG-Face model and ResNet model are used as the deep
When learning the parameters of deep models, we employ models.
pre-trained models to avoid over-fitting before fine tuning the The recognition accuracy is utilized to evaluate all ap-
deep models for FER problem. Two deep models, i.e., VGG- proaches on the spontaneous dataset. The entire dataset is
Face based CNN and ResNet based CNN, are pre-trained on randomly splitted into the training set and test set. In each
the Cohn-Kanade database [4]. Then we fin tune the models category of expressions, the spontaneous images are assigned
in the spontaneous dataset. The logloss function is used to the training set and test set with fixed ratio of 90% : 10%.

978-1-5386-3148-5/17/$31.00 © 2017 IEEE 72


Table 1. The recognition accuracy of all approaches for comparison on the spontaneous-expression dataset.

Spontaneous Expression
Approach Anger Disgust Fear Happy Sadness Surprise Average
Gabor+KNN 0.567 0.408 0.300 0.667 0.533 0.308 0.464
LBP+SVM 0.700 0.400 0.575 0.733 0.633 0.508 0.593
AU+ KNN 0.721 0.440 0.564 0.701 0.633 0.617 0.613
DCNN [12] 0.672 0.754 0.755 0.983 0.908 0.936 0.834
ResNet [14] 0.654 0.552 0.738 0.750 0.892 0.823 0.788
Our proposed method (HCN) 0.723 0.792 0.762 0.971 0.935 0.942 0.874

B. Experimental Results [5] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial
expressions with gabor wavelets,” in IEEE International Conference on
The recognition rates obtained using different methods are Automatic Face and Gesture Recognition (FG), 1998, pp. 200–205.
shown in Table 1. As one can expect, the performance of [6] P. N. Belhumeur, J. P. Hespanha, and D. Kriegman, “Eigenfaces vs.
fisherfaces: recognition using class specific linear projection,” IEEE
traditional LBP and Gabor features is much lower than that Transactions on Pattern Analysis & Machine Intelligence, vol. 19, no.
of the learned features with the deep model. This confirms 7, pp. 711–720, 2013.
the relative sensitivity of traditional LBP and Gabor feature [7] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination, and
expression (pie) database,” in IEEE International Conference on
to diverse expressions. In most configures, the deep method Automatic Face and Gesture Recognition (FG), 2002, pp. 46 – 51.
outperforms the shallow models, while in some cases the [8] A. M. Martinez, “The ar face database,” Cvc Technical Report, vol. 24,
shallow models are competitive with deep models. That might 1998.
[9] C. Shan, S. Gong, and P.W. Mcowan, “Facial expression recognition
be due to the insufficient samples for large variations in these based on local binary patterns: A comprehensive study,” Image & Vision
expressions. Computing, vol. 27, no. 6, pp. 803–816, 2009.
[10] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen, “Deeply learning
The conventional CNNs, i.e., VGG-Face [12] and ResNet deformable facial action parts model for dynamic expression analysis,”
[14], are used to compare with our proposed method. As in Asian Conference on Computer Vision (ACCV), 2014, pp. 143–157.
shown in Table 1, our proposed method outperform other [11] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep
neural networks for facial expression recognition,” in IEEE International
two deep models. It is worth noting that the ResNet cannot Conference on Computer Vision (ICCV), 2015, pp. 2983–2991.
outperform the VGG-Face in most configures for the problem [12] X. Peng, Z. Xia, L. Li, and X. Feng, “Towards facial expression
of FER. Although the ResNet achieved better performance recognition in the wild: A new database and deep recognition system,”
in IEEE Conference on Computer Vision and Pattern Recognition
than VGG-series models for image classification, the less Workshops (CVPRW), 2016, pp. 1544–1550.
number of samples in the problem of FER still limit the [13] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
learning its ability compared to ImageNet dataset. Through the in British Machine Vision Conference (BMVC), 2015.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
combination of two heterogeneous datasets, the performance image recognition,” in IEEE Conference on Computer Vision and Pattern
can be improved for recognizing the spontaneous expressions. Recognition (CVPR), 2016, pp. 770–778.
[15] Z. Xia, W. Zhang, F. Tan, X. Feng, and A. Hadid, “An accurate eye
localization approach for smart embedded system,” in International
IV. C ONCLUSION Conference on Image Processing Theory (IPTA), 2016, pp. 1–5.
[16] Kuang Chih Lee, Jeffrey Ho, and David J Kriegman, “Acquiring
A heterogeneous deep architecture is developed to spon- linear subspaces for face recognition under variable lighting,” IEEE
taneous FER. In the heterogeneous model, the CNNs based Transactions on Pattern Analysis & Machine Intelligence, vol. 27, no.
on VGG-Face and ResNet are employed to automatically 5, pp. 684–98, 2005.
[17] Z. Xia, X. Feng, J. Peng, X. Peng, and G. Zhao, “Spontaneous micro-
extract the visual features. These two models are then learned expression spotting via geometric deformation modeling,” Computer
jointly with shared classification layer. To leverage the acted Vision & Image Understanding, vol. 147, no. C, pp. 87–94, 2016.
expressions, we utilize the acted data to pre-train the deep [18] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale
and rotation invariant texture classification with local binary patterns,”
model. Experimental results assessed the effectiveness of our IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 24,
proposed approach which outperformed state-of-the-art meth- no. 7, pp. 971–987, 2002.
ods.

R EFERENCES
[1] S.Z. Li and A.K. Jain, Handbook of Face Recognition, Springer
Publishing Company, Incorporated, 2nd edition, 2011.
[2] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database
for facial expression analysis,” in IEEE International Conference on
Multimedia and Expo (ICME), 2005, pp. 1–5.
[3] M. Kamachi, M. Lyons, and J. Gyoba, “The japanese female facial
expression (jaffe) database,” 1998, http://www.kasrl.org/jaffe.html.
[4] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for
facial expression analysis,” in Fourth IEEE International Conference
on Automatic Face and Gesture Recognition (FG), 2000, pp. 46–53.

978-1-5386-3148-5/17/$31.00 © 2017 IEEE 73

You might also like