You are on page 1of 25

PROJECT REPORT

TOPIC: REAL TIME FACIAL EXPRESSION


RECOGNITION

NAME:HARSH KULKARNI

ID:151070012

BATCH: A

BRANCH: COMPUTER TECHNOLOGY

SEM:VIII

UNDER GUIDANCE OF: PROF. V.K. SAMBHE

PROF. S.T. SHINGADE


INDEX
SR.No Content Page Remark

1. Introduction 3

2. Motivation 4

3. Problem statement 5

4. Methodology 6

5. Apparatus or necessary tools with specification 8

6. Implementation 11

7. Results and discussion 22

8. Conclusion 23

9. Future scope 24

10. References 25
INTRODUCTION:

Computer animated agents and robots bring a new dimension in human computer
interaction which makes it vital as how computers can affect our social life in day-to-day
activities. Face to face communication is a real-time process operating at a time scale in the order
of milliseconds. The level of uncertainty at this time scale is considerable, making it necessary
for humans and machines to rely on sensory rich perceptual primitives rather than slow symbolic
inference processes.

This project aims to classify the emotion on a person's face into one of seven categories,
using deep convolutional neural networks. This repository is an implementation of this research
paper. The model is trained on the FER-2013 dataset which was published at the International
Conference on Machine Learning (ICML). This dataset consists of 35887 grayscale, 48x48 sized
face images with seven emotions - angry, disgusted, fearful, happy, neutral, sad and surprised.

Facial expression recognition system is implemented using Convolution Neural


Network(CNN). CNN model of the project is based on LeNet Architecture. Kaggle facial
expression dataset with seven facial expression labels as happy, sad, surprise, fear, anger,
disgust, and neutral is used in this project. The system achieved 56.77 % accuracy and 0.57
precision on testing dataset.
MOTIVATION:
This paper discusses Facial Expression Recognition System which performs facial
expression analysis in a near real time from a live web cam feed. Primary objectives were to get
results in a near real time with light invariant, person independent and pose invariant way. The
system is composed of two different entities trainer and evaluator. Each frame of video feed is
passed through a series of steps including haar classifiers, skin detection, feature extraction,
feature points tracking, creating a learned Support Vector Machine model to classify emotions to
achieve a tradeoff between accuracy and result rate. A processing time of 100-120 ms per 10
frames was achieved with accuracy of around 60%. We measure our accuracy in terms of variety
of interaction and classification scenarios. We conclude by discussing relevance of our work to
human computer interaction and exploring further measures that can be taken.

This model can be used for prediction of expressions of both still images and real time
video. However, in both the cases we have to provide image to the model. In case of real time
video the image should be taken at any frame in time and feed it to the model for prediction of
expression. The system automatically detects the face using HAAR cascade then its crops it and
resize the image to a specific size and give it to the model for prediction. The model will
generate seven probability values corresponding to seven expressions. The highest probability
value to the corresponding expression will be the predicted expression for that image.

However, our goal here is to predict the human expressions, but we have trained our model on
both human and animated images. Since, we had only approx 1500 human images which are
very less to make a good model, so we took approximately 9000 animated images and leverage
those animated images for training the model and ultimately do the prediction of expressions on
human images.

For better prediction we have decided to keep the size of each image 350∗ 350.
PROBLEM STATEMENT:
Human emotions and intentions are expressed through facial expressions and deriving an
efficient and effective feature is the fundamental component of facial expression system. Facial
expressions convey non-verbal cues, which play an important role in interpersonal relations.
Automatic recognition of facial expressions can be an important component of natural human-machine
interfaces; it may also be used in behavioral science and in clinical practice. An automatic Facial
Expression Recognition system needs to solve the following problems: detection and location of
faces in a cluttered scene, facial feature extraction, and facial expression classification.

In this project facial expression recognition system is implemented using convolution neural
network. Facial images are classified into seven facial expression categories namely Anger, Disgust, Fear,
Happy, Sad, Surprise and 'Neutral. Kaggle dataset is used to train and test the classifier.
METHODOLOGY:
The facial expression recognition system is implemented using convolutional neural networks.
The block diagram of the system is shown in following figures:

The following methods are used:


1) Convolution:
The primary purpose of Convolution in case of a CNN is to extract features from the
input image. Convolution preserves the spatial relationship between pixels by learning image
features using small squares of input data.
The 2-dimensional convolution between image A and Filter B can be given as:

where size of A is (Ma x Na), size of B is (Mb x Nb), 0 ≤ � < �� + �� −1 ∧ 0 ≤ � < �� +


�� − 1
2)Rectified Linear Unit:
An additional operation called ReLU has been used after every Convolution operation. A
Rectified Linear Unit (ReLU) is a cell of a neural network which uses the following activation
function to calculate its output given x:
R(x) = Max(0,x)

3)Pooling (sub-sampling):
Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of
each feature map but retains the most important information. Spatial Pooling can be of different
types: Max, Average, Sum etc. In practice, Max Pooling has been shown to work better.

4)Classification (Multilayer Perceptron):

The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a softmax activation
function in the output layer. The term “Fully Connected” implies that every neuron in the previous layer
is connected to every neuron on the next layer. The output from the convolutional and pooling layers
represent high-level features of the input image. The purpose of the Fully Connected layer is to use these
features for classifying the input image into various classes based on the training dataset.
Softmax is used for activation functions. It treats the outputs as scores for each class. In the
Softmax, the function mapping stayed unchanged and these scores are interpreted as the unnormalized log
probabilities for each class. Softmax is calculated as:
APPARATUS OR NECESSARY TOOLS WITH
SPECIFICATION:
1)Dataset:
The dataset from a Kaggle Facial Expression Recognition Challenge (FER2013) is used
for the training and testing. It comprises pre-cropped, 48-by-48-pixel grayscale images of faces
each labeled with one of the 7 emotion classes: anger, disgust, fear, happiness, sadness, surprise,
and neutral. Dataset has a training set of 35887 facial images with facial expression labels.. The
dataset has class imbalance issues, since some classes have a large number of examples while
some have few. The dataset is balanced using oversampling, by increasing numbers in minority
classes. The balanced dataset contains 40263 images, from which 29263 images are used for
training, 6000 images are used for testing, and 5000 images are used for validation.
2)Architecture of CNN:
A typical architecture of a convolutional neural network contains an input layer, some
convolutional layers, some fully-connected layers, and an output layer. CNN is designed with
some modification on LeNet Architecture [10]. It has 6 layers without considering input and
output. The architecture of the Convolution Neural Network used in the project is shown in the
following figure.

1.Input Layer:
The input layer has predetermined, fixed dimensions, so the image must be pre-processed
before it can be fed into the layer. Normalized gray scale images of size 48 X 48 pixels from
Kaggle dataset are used for training, validation and testing. For testing proposed laptop webcam
images are also used, in which face is detected and cropped using OpenCV Haar Cascade
Classifier and normalized.
2.Convolution and Pooling (ConvPool) Layers:
Convolution and pooling is done based on batch processing. Each batch has N images
and CNN filter weights are updated on those batches. Each convolution layer takes image batch
input of four dimension N x Color-Channel x width x height. Feature maps or filters for
convolution are also four dimensional (Number of feature maps in, number of feature maps out,
filter width, filter height). In each convolution layer, four dimensional convolution is calculated
between image batch and feature maps. After convolution only the parameter that changes is
image width and height.
New image width = old image width – filter width + 1
New image height = old image height – filter height + 1
3. Fully Connected Layer:
This layer is inspired by the way neurons transmit signals through the brain. It takes a
large number of input features and transforms features through layers connected with trainable
weights. Two hidden layers of size 500 and 300 units are used in fully-connected layers. The
weights of these layers are trained by forward propagation of training data then backward
propagation of its errors. Back propagation starts from evaluating the difference between
prediction and true value, and back calculates the weight adjustment needed to every layer
before. We can control the training speed and the complexity of the architecture by tuning the
hyper-parameters, such as learning rate and network density. Hyper-parameters for this layer
include learning rate, momentum, regularization parameter, and decay.

4.Output Layer:
Output from the second hidden layer is connected to output layer having seven distinct
classes. Using Softmax activation function, output is obtained using the probabilities for each of
the seven class. The class with the highest probability is the predicted class.

3)Webcam:
It is a place where physical testing phase live photo is being tested.
IMPLEMENTATION:

FILE CONTENT:
● emojis (folder)
● model.py (file)
● multiface.py (file)
● singleface.py (file)
● model_1_atul.tflearn.data-00000-of-00001 (file)
● model_1_atul.tflearn.index (file)
● model_1_atul.tflearn.meta (file)
● haarcascade_frontalface_default.xml (file)
CODE:
model.py (file)

import numpy as np

import tflearn

from tflearn.layers.core import input_data, dropout,


fully_connected, flatten
from tflearn.layers.conv import conv_2d, max_pool_2d,
avg_pool_2d
from tflearn.layers.merge_ops import merge

from tflearn.layers.normalization import


local_response_normalization
from tflearn.layers.estimator import regression

from os.path import isfile, join

import sys

import tensorflow as tf

import os

# prevents appearance of tensorflow warnings

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)

class EMR:

def __init__(self):

self.target_classes = ['angry', 'disgusted', 'fearful',


'happy', 'sad', 'surprised', 'neutral']

def build_network(self):

"""

Build the convnet.

Input is 48x48

3072 nodes in fully connected layer

"""

self.network = input_data(shape = [None, 48, 48, 1])

print("Input data ",self.network.shape[1:])

self.network = conv_2d(self.network, 64, 5, activation


= 'relu')
print("Conv1 ",self.network.shape[1:])

self.network = max_pool_2d(self.network, 3, strides =


2)
print("Maxpool1 ",self.network.shape[1:])

self.network = conv_2d(self.network, 64, 5, activation


= 'relu')
print("Conv2 ",self.network.shape[1:])

self.network = max_pool_2d(self.network, 3, strides =


2)
print("Maxpool2 ",self.network.shape[1:])

self.network = conv_2d(self.network, 128, 4,


activation = 'relu')
print("Conv3 ",self.network.shape[1:])

self.network = dropout(self.network, 0.3)

print("Dropout ",self.network.shape[1:])

self.network = fully_connected(self.network, 3072,


activation = 'relu')
print("Fully connected",self.network.shape[1:])

self.network = fully_connected(self.network,
len(self.target_classes), activation = 'softmax')
print("Output ",self.network.shape[1:])

print("\n")

# Generates a TrainOp which contains the information


about optimization process - optimizer, loss function, etc
self.network = regression(self.network,optimizer =
'momentum',metric = 'accuracy',loss =
'categorical_crossentropy')
# Creates a model instance.

self.model = tflearn.DNN(self.network,checkpoint_path
= 'model_1_atul',max_checkpoints = 1,tensorboard_verbose =
2)
# Loads the model weights from the checkpoint

self.load_model()

def predict(self, image):

"""

Image is resized to 48x48, and predictions are returned.


"""

if image is None:

return None

image = image.reshape([-1, 48, 48, 1])

return self.model.predict(image)

def load_model(self):

"""

Loads pre-trained model.

"""

if isfile("model_1_atul.tflearn.meta"):

self.model.load("model_1_atul.tflearn")

else:

print("---> Couldn't find model")

if __name__ == "__main__":

print("\n------------Emotion Detection Program------------


\n")
network = EMR()

if sys.argv[1] == 'singleface':

import singleface

if sys.argv[1] == 'multiface':
import multiface
multiface.py (file)

import cv2
import sys
import numpy as np
from model import EMR

# prevents opencl usage and unnecessary logging messages


cv2.ocl.setUseOpenCL(False)

EMOTIONS = ['Angry', 'Disgusted', 'Fearful', 'Happy', 'Sad',


'Surprised', 'Neutral']

# Initialize object of EMR class


network = EMR()
network.build_network()

# In case you want to detect emotions on a video, provide the


video file path instead of 0 for VideoCapture.
cap = cv2.VideoCapture(0)
font = cv2.FONT_HERSHEY_SIMPLEX
feelings_faces = []

# append the list with the emoji images


for index, emotion in enumerate(EMOTIONS):
feelings_faces.append(cv2.imread('./emojis/' + emotion +
'.png', -1))

while True:
# Again find haar cascade to draw bounding box around face
ret, frame = cap.read()
if not ret:
break
facecasc =
cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = facecasc.detectMultiScale(gray, 1.3, 5)
if len(faces) > 0:
# draw box around faces
for face in faces:
(x,y,w,h) = face
frame = cv2.rectangle(frame,(x,y-
30),(x+w,y+h+10),(255,0,0),2)
newimg = cv2.cvtColor(frame[y:y+h,x:x+w],
cv2.COLOR_BGR2GRAY)
newimg = cv2.resize(newimg, (48,48), interpolation
= cv2.INTER_CUBIC) / 255.
result = network.predict(newimg)
if result is not None:
maxindex = np.argmax(result[0])
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(frame,EMOTIONS[maxindex],(x+5,y-
35), font, 2,(255,255,255),2,cv2.LINE_AA)

cv2.imshow('Video',
cv2.resize(frame,None,fx=2,fy=2,interpolation =
cv2.INTER_CUBIC))
if cv2.waitKey(1) & 0xFF == ord('q'):
break

cap.release()
cv2.destroyAllWindows()
singleface.py (file)

import cv2
import sys
import numpy as np
from model import EMR

# prevents opencl usage and unnecessary logging messages


cv2.ocl.setUseOpenCL(False)

EMOTIONS = ['angry', 'disgusted', 'fearful', 'happy', 'sad',


'surprised', 'neutral']

def format_image(image):
"""
Function to format frame
"""
if len(image.shape) > 2 and image.shape[2] == 3:
# determine whether the image is color
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
# Image read from buffer
image = cv2.imdecode(image,
cv2.CV_LOAD_IMAGE_GRAYSCALE)

cascade_classifier =
cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
faces =
cascade_classifier.detectMultiScale(image,scaleFactor = 1.3
,minNeighbors = 5)

if not len(faces) > 0:


return None
# initialize the first face as having maximum area, then
find the one with max_area
max_area_face = faces[0]
for face in faces:
if face[2] * face[3] > max_area_face[2] *
max_area_face[3]:
max_area_face = face
face = max_area_face

# extract ROI of face


image = image[face[1]:(face[1] + face[2]), face[0]:(face[0]
+ face[3])]

try:
# resize the image so that it can be passed to the
neural network
image = cv2.resize(image, (48,48), interpolation =
cv2.INTER_CUBIC) / 255.
except Exception:
print("----->Problem during resize")
return None

return image

# Initialize object of EMR class


network = EMR()
network.build_network()

cap = cv2.VideoCapture(0)
font = cv2.FONT_HERSHEY_SIMPLEX
feelings_faces = []

# append the list with the emoji images


for index, emotion in enumerate(EMOTIONS):
feelings_faces.append(cv2.imread('./emojis/' + emotion +
'.png', -1))
while True:
# Again find haar cascade to draw bounding box around face
ret, frame = cap.read()
if not ret:
break
facecasc =
cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = facecasc.detectMultiScale(gray,scaleFactor=1.3,
minNeighbors=5)

# compute softmax probabilities


result = network.predict(format_image(frame))
if result is not None:
# write the different emotions and have a bar to
indicate probabilities for each class
for index, emotion in enumerate(EMOTIONS):
cv2.putText(frame, emotion, (10, index * 20 + 20),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1);
cv2.rectangle(frame, (130, index * 20 + 10), (130 +
int(result[0][index] * 100), (index + 1) * 20 + 4), (255, 0,
0), -1)

# find the emotion with maximum probability and display


it
maxindex = np.argmax(result[0])
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(frame,EMOTIONS[maxindex],(10,360), font,
2,(255,255,255),2,cv2.LINE_AA)
face_image = feelings_faces[maxindex]

for c in range(0, 3):


# The shape of face_image is (x,y,4). The fourth
channel is 0 or 1. In most cases it is 0, so, we assign the
roi to the emoji.
# You could also do: frame[200:320,10:130,c] =
frame[200:320, 10:130, c] * (1.0 - face_image[:, :, 3] /
255.0)
frame[200:320, 10:130, c] =
face_image[:,:,c]*(face_image[:, :, 3] / 255.0)
+ frame[200:320, 10:130, c] * (1.0 - face_image[:, :, 3] /
255.0)

if len(faces) > 0:
# draw box around face with maximum area
max_area_face = faces[0]
for face in faces:
if face[2] * face[3] > max_area_face[2] *
max_area_face[3]:
max_area_face = face
face = max_area_face
(x,y,w,h) = max_area_face
frame = cv2.rectangle(frame,(x,y-
50),(x+w,y+h+10),(255,0,0),2)

cv2.imshow('Video',
cv2.resize(frame,None,fx=2,fy=2,interpolation =
cv2.INTER_CUBIC))
if cv2.waitKey(1) & 0xFF == ord('q'):
break

cap.release()
cv2.destroyAllWindows()
RESULTS AND DISCUSSION:
CONCLUSION:
CNN architecture for facial expression recognition as mentioned above was implemented in Python.
Along with Python programming language, Numpy, Theano and CUDA libraries were used.
Training image batch size was taken as 30, while filter map is of size 20x5x5 for both
convolution layer. Validation set was used to validate the training process. In last batch of every epoch in
validation cost, validation error, training cost, training error are calculated. Input parameters for training
are image set and corresponding output labels. The training process updated the weights of feature maps
and hidden layers based on hyper-parameters such as learning rate, momentum, regularization and decay.
In this system batch-wise learning rate was used as 10e-5, momentum as 0.99, regularization as 10e-7 and
decay as 0.99999.
The comparison of validation cost, validation error, training cost, training error are shown in
figures below.
FUTURE SCOPE:
We have got a pretty good result but still there is a huge scope of improvement.

1. In order to get better accuracy we need much more human images with good
variance among them.

2. We can also fine tune last 2 or 3 convolution layer to increase accuracy.

3.We can also design our own CNN model if we have time and computation
power. Of-course, we need much more images for this. But by careful hyper-parameter
tuning and training the model on 100k human images with good variance among them
and by keeping the size of each image higher than 400*400, we can achieve close to 99%
accuracy on real world and in real time

4.In the future work, the model can be extended to color images. This will allow to
investigate the efficacy of pre-trained models such as AlexNet or VGGNet for facial
emotion recognition.
REFERENCES:
1. Shan, C., Gong, S., & McOwan, P. W. (2005, September). Robust facial expression
recognition using local binary patterns. In Image Processing, 2005. ICIP 2005. IEEE
International Conference on (Vol. 2, pp. II-370). IEEE.
2. Chibelushi, C. C., & Bourel, F. (2003). Facial expression recognition: A brief tutorial
overview. CVonline: On-Line Compendium of Computer Vision, 9.
3. LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16 November 2013
4. "Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation".
DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
5. Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject
independent facial expression recognition with robust face detection using a
convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559.
doi:10.1016/S0893-6080(03)00115-1. Retrieved 17 November 2013.
6. C. Zor, “Facial expression recognition,” Master’s thesis, University of Surrey, Guildford,
2008.
7. Suwa, M.; Sugie N. and Fujimora K. A Preliminary Note on Pattern Recognition of
Human Emotional Expression, Proc. International Joint Conf, Pattern Recognition, pages
408-410, 1978

You might also like