You are on page 1of 8

Artificial Intelligence

Final Project Report

CNN Based Facial Detection and Expression


Recognition and its Deployment to real-time
Images and Videos

Under the guidance of Dr. Annapurna Jonnalagadda

Presented By:
Sudesha Basu Majumder Parichay Singh Sreyan Ghosh
19BEE0228 19BEE0229 19BEE0232

1. Problem Statement

Even though humans have had an easy time detecting emotions from
facial expressions, performing the same feat with a computer program
is rather difficult. It is now possible to discern emotions from photos
because of recent advances in computer vision and machine learning.
In this paper, we propose a unique facial emotion identification
methodology based on convolutional neural networks.
2. Introduction

The automatic facial expression recognition system has many applications,


including, but not limited to, understanding human behavior, detecting
mental disorders, and synthesizing human expressions. Two popular
methods used primarily in the literature for automatic FER systems are
based on geometry and appearance. Although there are many studies that
use static images, research continues to develop new methods that are
very easy to calculate and use less memory compared to previous
methods.

Emotion recognition is used in society for many reasons. Affectiva, which


emerged from the Massachusetts Institute of Technology, provides artificial
intelligence software that can more efficiently complete tasks previously
performed manually by humans, mainly collecting facial expressions and
voice expression information related to the specific environment in which
the audience agrees to share this information. For example, instead of
completing a lengthy questionnaire to understand your feelings at each
moment of watching an educational video or advertisement, you can accept
having the camera look you in the face, listen to what you are saying, and
pay attention to which parts of the experience what you are doing Express
expressions of boredom, interest, confusion, or smiles. (Note that this does
not mean that you are reading your inner feelings, you only read what you
express externally.)

Other uses of Affectiva include helping children with autism, helping blind
people read facial expressions, helping the robot interact more with people
intelligently, and monitor attention signs while driving to improve driver
safety.

A patent filed by Snapchat in 2015 describes a method for extracting public


activity crowd data through algorithmic recognition of the emotions of users’
geotagged selfies.

Emotient is a start-up company that applies emotion recognition to reading


facial frowns, smiles, and other expressions, especially artificial intelligence
to predict "attitudes and behaviors based on facial expressions." Apple
acquired Emotient in 2016 and used emotion recognition technology to
enhance the emotional intelligence of its products.
nViso provides real-time emotion recognition for web and mobile
applications through real-time APIs. Visage Technologies AB provides
sentiment estimates as part of its Visage SDK for marketing and scientific
research and similar purposes.

3. Literature Survey

Despite the notable success of traditional facial recognition methods through


the extraction of handcrafted features, over the past decade researchers have
directed to the deep learning approach due to its high automatic recognition
capacity. In this context, we will present some recent studies in FER, which
show proposed methods of deep learning in order to obtain better detection.
Train and test on several static or sequential databases.

Some of the gaps identified are:


Given an image, the system should be able to predict the expression
immediately and transfer the result. Hence, there is a low latency requirement.
This has been achieved in our model but not up to the level we expected.
Moreover, interpretability is important for still images but not in real-time.

For still images, the probability of predicted expressions can be given. This
feature has not been included in our project. Our goal was to predict the
expression of a face in the image as accurately as possible. The higher the
test accuracy, the better our model will perform in the real world.

With the advent of intelligent conversational agents who can understand


human speech, we have stepped into a new era in computational intelligence.
But this is still very much in its initial stages where only about 60%-70% of
human speech is recognized correctly. This is mainly due to one primary
reason. A reason that has been on the minds of the poets and storytellers
alike for centuries. “What a man says is not what he means”. Words used in a
sentence have multiple connotations and meanings. For a mathematical
algorithm to understand this is impossible without visual cues. Elements of
speech like irony, sarcasm, and rhetorics can only be understood by these
artificially intelligent systems only if visual cues are recognized. In this project,
we have built and trained a convolutional neural network (CNN) from scratch
to recognize facial expressions. This will enable us to understand what people
actually “mean” when they “say” something.
4. Implementation

Data

The data consists of 48x48 pixel grayscale images of faces. The faces have
been automatically registered so that the face is more or less centered and
occupies about the same amount of space in each image. The task is to
categorize each face based on the emotion shown in the facial expression into
one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad,
5=Surprise, 6=Neutral).

“train.csv” contains two columns, "emotion" and "pixels". The "emotion"


column contains a numeric code ranging from 0 to 6, inclusive, for the emotion
that is present in the image. The column contains a string surrounded by
quotes for each image. The contents of this string are space-separated pixel
values in row-major order. “test.csv” contains only the "pixels" column and
your task is to predict the emotion column. The training set consists of 28,709
examples.

Fig 1.
Model
The model being used is going to be created using Tensorflow 2.0 and Keras.
It is a convolutional neural network (CNN) consisting of 4 Conv2D layers
followed by MaxPooling2D layers and 2 Dense layers at the end to output the
probability distribution. The activation function in the Conv2D layers is ReLU
and in the last dense layer is softmax. We will take the padding of images into
consideration as a hyperparameter and see if the model performs better with
or without it.

Fig 2.
Algorithm

We apply two separate kinds of techniques to obtain the final real-time


emotion recognition. We use a Convolutional Neural Network (CNN) to extract
the features from the image data and successfully classify emotions. Here we
are just given training data as images of faces. But in real-world scenarios, we
might be given the image of the entire human body. We need to extract just
the face from that. Here we apply another technique called HAAR Cascading.
For the detection of the face and construction of a bounding box around it, we
will be using tools provided in the OpenCV framework. These include
rectangular “box” creation using techniques like mentioned before for face
detection. After the face is detected, we run the CNN model on this and obtain
the emotion that is being portrayed by the face.

We now have a successfully predicting model! To bring some real-life


applications into it, we can use this model to detect emotions in pictures and
videos. For pictures, the process is the same as above whereas, for videos,
the process is tweaked a bit. As we all know, a video is nothing but a
collection of different images. If we take an individual frame from the video, we
obtain an image! We can run our model on this image and obtain the emotion
being portrayed. After doing this for the entire video, we can stitch the frames
back together and obtain real-time predictions on a video.

Workflow

● Generating training, validation, and testing batches.


● Creating the Convolutional Neural Network (CNN) model.
● Training and Evaluating the model.
● Creating a Flask app to serve predictions.
● Using the model to recognize facial expressions in videos.

5. Future Work

Moving forward on the groundwork laid down by our project, we can, in the
future, aim towards:
● Making the prediction model more robust and have lower latency.
● Increasing the data fed into the model and incorporating deeper neural
networks to make more accurate predictions.
● Deploying the model to edge devices or on the web to be of use easily.
Fig 3.

6. References

● Pourmirzaei, M., Esmaili, F. and Montazer, G.A., 2021. Using Self-Supervised Co-Training to
Improve Facial Representation. arXiv preprint arXiv:2105.06421.

● Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W.,
Tang, Y., Thaler, D., Lee, D.H. and Zhou, Y., 2013, November. Challenges in representation
learning: A report on three machine learning contests. In International conference on neural
information processing (pp. 117-124). Springer, Berlin, Heidelberg.

● Zhou, H., Meng, D., Zhang, Y., Peng, X., Du, J., Wang, K. and Qiao, Y., 2019, October.
Exploring emotion features and fusion strategies for audio-video emotion recognition. In 2019
International Conference on Multimodal Interaction (pp. 562-566).
● Meng, D., Peng, X., Wang, K. and Qiao, Y., 2019, September. Frame attention networks for
facial expression recognition in videos. In 2019 IEEE International Conference on Image
Processing (ICIP) (pp. 3866-3870). IEEE.

● Shi, J. and Zhu, S., 2021. Learning to Amend Facial Expression Representation via De-albino
and Affinity. arXiv preprint arXiv:2103.10189.

● T. Vo, G. Lee, H. Yang and S. Kim, "Pyramid With Super Resolution for In-the-Wild Facial
Expression Recognition," in IEEE Access, vol. 8, pp. 131988-132001, 2020, doi:
10.1109/ACCESS.2020.3010018.

● Acharya, D., Huang, Z., Pani Paudel, D. and Van Gool, L., 2018. Covariance pooling for facial
expression recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops (pp. 367-374).

● Burkert, P., Trier, F., Afzal, M.Z., Dengel, A. and Liwicki, M., 2015. Dexpression: Deep
convolutional neural network for expression recognition. arXiv preprint arXiv:1509.05371.

● Ming, Z., Xia, J., Luqman, M.M., Burie, J.C. and Zhao, K., 2019. Dynamic multi-task learning
for face recognition with facial expression. arXiv preprint arXiv:1911.03281.

● Wang, K., Peng, X., Yang, J., Meng, D. and Qiao, Y., 2020. Region attention networks for
pose and occlusion robust facial expression recognition. IEEE Transactions on Image
Processing, 29, pp.4057-4069.

● Minaee, S., Minaei, M. and Abdolrashidi, A., 2021. Deep-emotion: Facial expression
recognition using attentional convolutional network. Sensors, 21(9), p.3046.

● Gacav, C., Benligiray, B. and Topal, C., 2017, March. Greedy search for descriptive spatial
face features. In 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (pp. 1497-1501). IEEE.

● Bulat, A., Cheng, S., Yang, J., Garbett, A., Sanchez, E. and Tzimiropoulos, G., 2021.
Pre-training strategies and datasets for facial representation learning. arXiv preprint
arXiv:2103.16554.

● ICEMI, Yangzhou, 2017 Human face detection algorithm via Haar cascade classifier
combined with three additional classifiers: L. Cuimei, Q. Zhiliang, J. Nan and W. Jianhua

● IEEE CVPR 2018/2168-2177 Facial Expression Recognition by De-Expression Residue


Learning

You might also like