You are on page 1of 7

Facial Emotion Recognition in Real Time

Dan Duncan Gautam Shine Chris English
duncand@stanford.edu gshine@stanford.edu chriseng@stanford.edu

Abstract

We have developed a convolutional neural network for
classifying human emotions from dynamic facial expres-
sions in real time. We use transfer learning on the fully-
connected layers of an existing convolutional neural net-
work which was pretrained for human emotion classifica-
tion. A variety of datasets, as well as our own unique
image dataset, is used to train the model. An overall
training accuracy of 90.7% and test accuracy of 57.1%
is achieved. Finally, a live video stream connected to a
face detector feeds images to the neural network. The net-
work subsequently classifies an arbitrary number of faces
per image simultaneously in real time, wherein appro- Figure 1. a) Input image to the convolutional neural network b)
priate emojis are superimposed over the subjects’ faces Output of the application, with emojis superimposed over the sub-
(https://youtu.be/MDHtzOdnSgA). The results jects’ faces to indicate the detected emotion
demonstrate the feasibility of implementing neural networks
in real time to detect human emotion.
transition frame into a facial expression. The latter issue
is particularly difficult for real-time detection where facial
1. Introduction expressions vary dynamically.
Most applications of emotion recognition examine static
Emotions often mediate and facilitate interactions images of facial expressions. We investigate the applica-
among human beings. Thus, understanding emotion often tion of convolutional neural networks (CNNs) to emotion
brings context to seemingly bizarre and/or complex social recognition in real time with a video input stream. Given
communication. Emotion can be recognized through a va- the computational requirements and complexity of a CNN,
riety of means such as voice intonation, body language, optimizing a network for efficient computation for frame-
and more complex methods such electroencephalography by-frame classification is necessary. In addition, account-
(EEG) [1]. However, the easier, more practical method is ing for variations in lighting and subject position in a non-
to examine facial expressions. There are seven types of hu- laboratory environment is challenging. We have developed
man emotions shown to be universally recognizable across a system for detecting human emotions in different scenes,
different cultures [2]: anger, disgust, fear, happiness, sad- angles, and lighting conditions in real-time. The result is a
ness, surprise, contempt. Interestingly, even for complex novel application where an emotion-indicating emoji is su-
expressions where a mixture of emotions could be used as perimposed over the subjects’ faces, as shown in Fig. 1.
descriptors, cross-cultural agreement is still observed [3].
Therefore a utility that detects emotion from facial expres- 2. Background/Related Work
sions would be widely applicable. Such an advancement
could bring applications in medicine, marketing and enter- Over the last two decades, researchers have significantly
tainment [4]. advanced human facial emotion recognition with computer
The task of emotion recognition is particularly difficult vision techniques. Historically, there have been many ap-
for two reasons: 1) There does not exist a large database proaches to this problem, including using pyramid his-
of training images and 2) classifying emotion can be dif- tograms of gradients (PHOG) [5], AU aware facial features
ficult depending on whether the input image is static or a [6], boosted LBP descriptors [7], and RNNs [8]. However,

1

an improve- ment of 15% over baseline scores. happy. disgust. 17] and 2) the Japanese Female Facial Expres- sion (JAFFE) database [18]. A recent development by G. dataset. Another issue is that there are few obtain similar results and at best could obtain a test accu. images labeled with ‘fear’ and ‘disgust’ in both the JAFFE 2 . The JAFFE database pro. we chose to use VGG S as a starting point in developing our own model. Approach on VGG S. Many different facial expressions were incorrectly classi- neutral. Oullet [15]. This special data pre- processing was applied to various publicly available mod- els such as VGG S [12]. Figure 3. We subsequently fied as ‘fear’ by VGG S for both datasets. Figure 2 visualizes the first convolutional layer of VGG S. Many images were recorded for each CK+ datasets are shown in Figures 4 and 5. discussion available datasets: the extended Cohn-Kanade dataset (CK+) [16. VGG S neural network is pre-trained for facial recognition and freely available. which is a smaller database of labeled facial emo- tions released for the EmotiW 2015 challenge [14]. surprise) from each subject. reducing the effects of variation and noise. respectively. The resulting confusion matrices for the JAFFE and from five individuals. One issue is that the facial expressions in Initially we directly implemented the VGG S network the JAFFE dataset are quite subtle. The model was then re-trained on the large CASIA WebFace data-set [13] and transfer- learned on the Static Facial Expressions in the Wild (SFEW) dataset. Visualization of the filters used in the first convolutional kernels optimized for feature detection. See section 3 for To develop a working model. al [11] showed sig- nificant improvement in facial emotion recognition using a CNN. sad. where a CNN was ap- plied to an input video stream to capture the subject’s facial expressions. The authors addressed two salient problems: 1) a small amount of data available for training deep CNNs and 2) appearance variation usually caused by variations in illu- mination. but the results showed no improve- own new (home-brewed) database that consists of images ment. The CK+ dataset. appeared to clearly convey a particular emotion. acting as a control for the game. [10] to the 2015 Emotions in the Wild (EmotiW 2015) contest for static images all used deep convolutional neural networks (CNNs). exacerbating the ability from ref. applied jitter to these images to account for variations in It is not clear why the accuracy was lower from that re- lighting and subject position in the final implementation. fear. We pre-processed the data by subtracting the with laboratory conditions. Levi et. generating up to 62% test accuracy. See section 2 for discussion. We also uniquely developed our mean image (Figure 3b). 3D space that could serve as an input to a CNN. b) Mean image of the JAFFE dataset. The author implemented a game.recent top submissions [9]. There were incorrect classifications for images that trolled laboratory environment. The image was labeled as ‘happy’ and predicted as ‘angry’. Final results showed a test accuracy up to 54. ported in ref [11]. A notable implementation of a CNN to real-time detec- tion of emotions from facial expressions is by S. Since this modified layer of VGG S net. This work demonstrated the feasibility of implementing a CNN in real- time by using a running-average of the detected emotions from the input stream. revealing the different Figure 2. although racy of 24% on the CK+ dataset. and 14% on the JAFFE small. such as that vides additional images with more subtle facial expressions in Figure 3a.56%. provides well-defined facial expressions in a con. They used Local Binary Patterns (LBP) to trans- form the images to an illumination invariant. [11] for image classification. a) Example image from the JAFFE dataset that was tested 3. of the seven primary emotions (anger. we use two different freely. We were unable to to differentiate emotions.

we implemented randomized jitter in the home-brewed dataset. VGG S could learn to account for these effects. Thus. solution with substantially reduced run-time. A small batch size was required for training due to lengthy computation times and is the cause of the ‘jagged’ profile of the curve. The loss versus training iteration is shown in Figure 8. The run-time for image cropping using the face-detector was 150 ms and that for a forward pass Figure 5. other machines with optimized graphics to 150 ms + N x 200 ms to account for all persons in the capabilities. as well as our own dataset. Experiment Once a newly trained version of VGG S was obtained. followed by a softmax classifier. resulting in a to- tal of 2118 labeled images. which significantly improved test and training accuracy. fc7. [11] may have included some data pre-processing they neglected to describe in their paper. Multiple images were recorded for each of the seven primary emotions (anger. We also applied a Haar-Cascade filter (see Figure 6) provided by OpenCV to crop the input image faces.and CK+ datasets. In any implementation of a CNN in a real environ- ment. disgust. we connected a video stream to the network using a stan- dard webcam.5 frames/second. For our own ‘home-brewed’ dataset. for a total of approximately 7x106 con- volutional parameters and 7x107 fully-connected parame- ters. By re-examining the same images with differ- ent variations in cropping and lighting. as demonstrated Figure 4. Due to time constraints. The jitter was applied by ran- domly changing both cropping (10% variation) and bright- ness (20% variation) in the input images. dis- tance from the camera. We tried to sufficient GPU capabilities (for VGG S) to utilize CUDNN. The architecture (see Figure 6) of VGG S consists of five con- volutional layers. effects that are usually omitted in a laboratory must be accounted for. we only trained the last few fully-connected layers of the network (fc6. neutral. sufficient for a real-time demonstration. Convergence to low loss appears to occur after 100 itera- tions. and vari- ations in orientation of the subjects face. with a 14% accuracy. These include variations in lighting. happy. three fully-connected layers. 4. we ap- plied transfer learning to the network on the JAFFE and CK+ datasets. These operations limited the CK+ dataset. the authors in ref. but found that the 3 . incorrect face cropping. Also. fear. We omitted the emotions ‘dis- gust’ and ‘contempt’. we recorded images from 5 differ- ent individuals. Confusion matrix for the unaltered VGG S net on the JAFFE dataset. Confusion matrix for the unaltered VGG S net on the in VGG S was 200 ms. such as the Jetson TX1. Our network was developed on a laptop that had in. could implement our frame. sad. since they are very difficult to clas- sify. Note that for any number N of subjects in the camera’s a GPU-accelerated library provided by Nvidia for deep neu- view. with a 24% accuracy. frame-rate of our emotion-recognition algorithm to 2. surprise) from each subject. fc8). making it difficult to train the network to recognize these two emotions correctly. To account for some of these issues. the run-time for a single frame would be increased ral networks. in Figure 7. implement VGG S on the Jetson TK1. To improve the classification accuracy of VGG S.

Only the fc6-fc8 layers were adjusted in training.94. The worst perfor- ness. if one exam- ines the cosine similarity for the same subjects but different Figure 8. The resulting classification from the network chooses a particular emoji to apply over the subjects’ faces. the resulting similarity is only s = 0. run-time memory requirements (> 2 GB) were too large for the board to process without crashing. mance is obtained on the JAFFE dataset. regardless of the individual in the in- put image.04. They accomplish this by highlighting the salient features of the corresponding emotion necessary for classi- fication. Each image re. revealing an average similarity of s = 0. with a training 4 . different emotions (see Figure 10). Figure 9 shows the output of the second fully-connected layer for two different input images displaying the emotion ‘happy’. The input image and the cropped faces using a Haar-Cascade detector are also shown. defined as s = A·B/|A||B|. This large difference in similarity demon- strates that the layers in the network have “learned” to dif- ferentiate emotions. Architecture of the convolutional neural network used in this project.Figure 6. home-brewed datasets. It is interesting to examine the second fully-connected layer’s neurons’ responses to different facial expressions. Figures 11-13 show the confusion matrices for our cus- Figure 7. CK+. The cosine similarity be- tween each of the layers in the output is quite high. to measure the similarity between two outputs A and B from the second layer (fc7). tom VGG S network trained on the JAFFE. and sults from changing the offset in the face cropping and the bright. Effect of jitter applied to the input image. We use the cosine similarity. Training loss versus iteration for the VGG S convolu- tional neural network. However. respectively.

(a-c): input image. Overall. and cosine similarity for ‘happy’ emotion. (a-c): input image. we success- fully implemented an application wherein an emoji indicat- 5 . Here. which typically involved exag- gerated expressions from subjects. (d-f) same se. with some incorrect classification for angry. The lack of RGB for- mat can sometimes exacerbate the ability of the network to distinguish between important features and background elements. output of second fully-connected VGG S network on the JAFFE dataset. layer. we ob- tained 57% test accuracy on the remaining images in our own dataset. these images are in grey-scale format.9%. Fig- ure 13 shows the confusion matrix for our home-brewed dataset. accuracy of 62%. Here. Also. whereas our VGG S net- work is optimized for RGB images. the differences between differ- ent emotions among the subjects is very subtle. with the pre- trained ‘off-the-shelf’ VGG S network from ref. the overall training accuracy is as high as 90. Note that. 5. Figure 10. disgust. Figure 11. [11]. The CK+ dataset shows much improved per- formance (90. Confusion matrix during training for the custom Figure 9. (d-f) same se- quence of figures for a different subject showing the same emo- tion. respectively. Conclusions quence of figures for the same subject showing the ‘happy’ emo- tion The goal of this project was to implement real-time facial emotion recognition. VGG S excelled with classifying the home-brewed dataset. we could only obtain training accuracies of 14% and 24% on the JAFFE and CK+ datasets. This dataset was effective for the implementa- tion of our demonstration.7% training accuracy). neutral and sad emotions due to the low number of available labeled images. Using our custom trained VGG S net- work with a face-detector provided by OpenCV. particu- larly between the emotions ‘disgust’ and ‘fear’. output of second fully-connected layer. and cosine similarity for ‘surprise’ emotion. which consisted primarily of exaggerated facial ex- pressions.

Samples of the video demonstration. Some of the results are shown in Figure 14. Figure 14. sad. While we achieved > 90% accuracy in laboratory conditions (perfect lighting. our implementation appeared to classify a subject’s surpirse) is superimposed over a user’s face in real time (see emotion reliably. Heavier pre-processing of the data would have certainly improved test-time accuracy. signif. Regard- ing one of six expressions (anger. tion). Unfortunately. Confusion matrix during training for the custom VGG S network on the CK+ dataset. Figure 12. issues. The emotion detected by the CNN is indicated by the type of emoji superimposed over the subject’s face. the relatively slow frame-rate of our demonstration made this solution untenable. neutral. For example.be/MDHtzOdnSgA for a demonstra. Each pair of im- ages indicates a pair of images. frame-rates would require a running average. a much larger dataset should be designed to improve the model’s generality. source code can be downloaded from a github repository icant improvements can be made by addressing several key [19]. fear. Our project was implemented under Python 2. Augmenting our home-brewed dataset to include off-center faces could have addressed this problem. First. happy. less. As mentioned previously. any shadow on a subject’s face would cause an incorrect classification of ‘angry’. sions [15]. In addition. Future implementations that run at higher https://youtu. since VGG S is relatively slow. the webcam needs to be level to the subjects’ faces for accurate classification. subject facing camera with an exaggerated ex- pression). One viable solution is to use a running average of the top classes reported by each frame which would amelio- Figure 13.7 and the While we achieved a successful implementation. Confusion matrix during training for the custom rate the problem of noise/errors caused by dynamic expres- VGG S network on out homebrew dataset. In particular. ad- justing the brightness to the same level on all the images might have removed the requirement for providing jittered input images. a particularly difficult aspect of real-time recognition is deciding how to classify tran- sition frames from neutral to fully formed expressions of emotion. any deviation from that caused the accuracy to fall significantly. 6 . fully training a network other than VGG S might yield substantial improvements in computa- tion speed. Also. camera at eye level.

R. and S. Proceedings. [1] P. Gyoba. J. [3] P. (New York. pp. S. Goecke. 451–458. Krizhevsky. 2106–2112. [12] K. Kim. Hassner. Human-Computer Systems Interaction: Back- [18] M. E. Dong. [10] B. Full text available. vol. 27. November 2015. Third IEEE In- lishing. N.References [15] S. “Coding grounds and Applications 3. 2000. Gawali. [13] D. “Capturing au-aware facial features and their latent relations for emotion recogni- tion in the wild. 467–474. Yi. Matthews.” in Proceedings of the 2015 ACM on Inter- national Conference on Multimodal Interaction. “Universals and cultural differ- [17] T. [8] S. Szwoch. and S. T. S. A. J.” in Computer Vision Work- shops (ICCV Workshops). pp. K. M. 2016. pp. 435–442. 2015. Lei. and I. 7 . Lyons. S. B. Li. Landowska.” International Journal of Computer Applications. “Static fa- cial expression analysis in tough conditions: Data. Roh. [5] A. M. pp. 51–62. Apr 1998. “Article: vol. Ouellet.” ICMI. 1998. NY. pp. Emotion Recognition and Its facial expressions with gabor wavelets. vol.3750. ICMI ’15. Rao. 200–205. 2014. Tian. Zhang. pp. Simonyan.” in British Machine Vision Conference. Szwoch. Cohn. 2009.” Journal on Multimodal User Inter- faces. Kanade. R. and P. “Hierarchical commit- tee of deep convolutional neural networks for robust facial expression recognition. Saragih. Michalski.” in Computer Vision and Pattern Recognition pressions of emotion.” CoRR. Lucey. pp. [11] G. Nov 2011.7923. 1971. Shan. “Image based static facial expression recognition with multiple deep network learning. Kołakowska. S. 1987. Lee. “Comprehensive database ences in the judgements of facial expressions of emotion. 37–40. ICMI ’15. J. Akamatsu. Shao. Yao. [9] Z. and C. Ekman. ternational Conference on. S. Dhall. Kahou. pp.” Image and Vision Computing. Konda. Gedeon.” in Automatic Face and Ges- Journal of Personality and Social Psychology. W. 53. Yu and C. Universals and cultural differences in facial ex- expression. M. [16] P. USA). 2010 IEEE Computer Society Confer- Nebraska Press.” in Pro- ceedings of the 2015 ACM on International Conference on Multimodal Interaction. pp. Nebraska. ture Recognition. ACM. and Y. and P. “Learning face represen- tation from scratch. V. 4. “Facial expression recognition based on local binary patterns: A comprehensive study. ch. 2014. and T. 6. evalu- ation protocol and benchmark. Ekman and W. abs/1411. W. USA: Lincoln University of Workshops (CVPRW). “Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. Pal. pp. A. 46–53. “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified [2] P. I. and tional Conference on. Memisevic.” for facial expression analysis. and A. W. and J. vol. 2011 IEEE International Confer- ence on. 94–101. Chatfield. vol. Z. (New York. Friesen. V. 2015. and Y. McOwan. ence on.” CoRR. J. Emotion recognition using speech and eeg signal a review. Cham: Springer International Pub- and Gesture Recognition. 15. 2015. Kamachi. 2000. S. Proceedings. Gong. Chen. Liao. Z. J. Vedaldi. Ambadar. F. 1–17. ACM International Conference on Multimodal In- teraction (ICMI). pp. Wróbel. Zisserman. p. February 2011.” in Proc. Sutskever. NY. 2012. abs/1408. Lucey.” [19] https://github.” in Automatic Face Applications. Abhang. 1. [6] A. 2014. Kanade. [14] A. R. ACM. Fourth IEEE Interna- [4] A. Z. F. Ma. 2014. Levi and T. June 2010. and G. “Imagenet classification with deep convolutional neural networks. “Return of the devil in the details: Delving deep into convo- lutional nets. K. no. vol. Rokade. Cohn. 803 – 816. “Real-time emotion recognition for gam- ing using deep convolutional network features. [7] C. Hinton.com/GautamShine/emotion-conv net NIPS. “Recurrent neural networks for emotion recognition in video. USA).