Professional Documents
Culture Documents
NAME:HARSH KULKARNI
ID:151070012
BATCH: A
SEM:VIII
1. Introduction 3
2. Motivation 4
3. Problem statement 5
4. Methodology 6
6. Implementation 11
8. Conclusion 23
9. Future scope 24
10. References 25
INTRODUCTION:
Computer animated agents and robots bring a new dimension in human computer
interaction which makes it vital as how computers can affect our social life in day-to-day
activities. Face to face communication is a real-time process operating at a time scale in the order
of milliseconds. The level of uncertainty at this time scale is considerable, making it necessary
for humans and machines to rely on sensory rich perceptual primitives rather than slow symbolic
inference processes.
This project aims to classify the emotion on a person's face into one of seven categories,
using deep convolutional neural networks. This repository is an implementation of this research
paper. The model is trained on the FER-2013 dataset which was published at the International
Conference on Machine Learning (ICML). This dataset consists of 35887 grayscale, 48x48 sized
face images with seven emotions - angry, disgusted, fearful, happy, neutral, sad and surprised.
This model can be used for prediction of expressions of both still images and real time
video. However, in both the cases we have to provide image to the model. In case of real time
video the image should be taken at any frame in time and feed it to the model for prediction of
expression. The system automatically detects the face using HAAR cascade then its crops it and
resize the image to a specific size and give it to the model for prediction. The model will
generate seven probability values corresponding to seven expressions. The highest probability
value to the corresponding expression will be the predicted expression for that image.
However, our goal here is to predict the human expressions, but we have trained our model on
both human and animated images. Since, we had only approx 1500 human images which are
very less to make a good model, so we took approximately 9000 animated images and leverage
those animated images for training the model and ultimately do the prediction of expressions on
human images.
For better prediction we have decided to keep the size of each image 350∗ 350.
PROBLEM STATEMENT:
Human emotions and intentions are expressed through facial expressions and deriving an
efficient and effective feature is the fundamental component of facial expression system. Facial
expressions convey non-verbal cues, which play an important role in interpersonal relations.
Automatic recognition of facial expressions can be an important component of natural human-machine
interfaces; it may also be used in behavioral science and in clinical practice. An automatic Facial
Expression Recognition system needs to solve the following problems: detection and location of
faces in a cluttered scene, facial feature extraction, and facial expression classification.
In this project facial expression recognition system is implemented using convolution neural
network. Facial images are classified into seven facial expression categories namely Anger, Disgust, Fear,
Happy, Sad, Surprise and 'Neutral. Kaggle dataset is used to train and test the classifier.
METHODOLOGY:
The facial expression recognition system is implemented using convolutional neural networks.
The block diagram of the system is shown in following figures:
3)Pooling (sub-sampling):
Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of
each feature map but retains the most important information. Spatial Pooling can be of different
types: Max, Average, Sum etc. In practice, Max Pooling has been shown to work better.
The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a softmax activation
function in the output layer. The term “Fully Connected” implies that every neuron in the previous layer
is connected to every neuron on the next layer. The output from the convolutional and pooling layers
represent high-level features of the input image. The purpose of the Fully Connected layer is to use these
features for classifying the input image into various classes based on the training dataset.
Softmax is used for activation functions. It treats the outputs as scores for each class. In the
Softmax, the function mapping stayed unchanged and these scores are interpreted as the unnormalized log
probabilities for each class. Softmax is calculated as:
APPARATUS OR NECESSARY TOOLS WITH
SPECIFICATION:
1)Dataset:
The dataset from a Kaggle Facial Expression Recognition Challenge (FER2013) is used
for the training and testing. It comprises pre-cropped, 48-by-48-pixel grayscale images of faces
each labeled with one of the 7 emotion classes: anger, disgust, fear, happiness, sadness, surprise,
and neutral. Dataset has a training set of 35887 facial images with facial expression labels.. The
dataset has class imbalance issues, since some classes have a large number of examples while
some have few. The dataset is balanced using oversampling, by increasing numbers in minority
classes. The balanced dataset contains 40263 images, from which 29263 images are used for
training, 6000 images are used for testing, and 5000 images are used for validation.
2)Architecture of CNN:
A typical architecture of a convolutional neural network contains an input layer, some
convolutional layers, some fully-connected layers, and an output layer. CNN is designed with
some modification on LeNet Architecture [10]. It has 6 layers without considering input and
output. The architecture of the Convolution Neural Network used in the project is shown in the
following figure.
1.Input Layer:
The input layer has predetermined, fixed dimensions, so the image must be pre-processed
before it can be fed into the layer. Normalized gray scale images of size 48 X 48 pixels from
Kaggle dataset are used for training, validation and testing. For testing proposed laptop webcam
images are also used, in which face is detected and cropped using OpenCV Haar Cascade
Classifier and normalized.
2.Convolution and Pooling (ConvPool) Layers:
Convolution and pooling is done based on batch processing. Each batch has N images
and CNN filter weights are updated on those batches. Each convolution layer takes image batch
input of four dimension N x Color-Channel x width x height. Feature maps or filters for
convolution are also four dimensional (Number of feature maps in, number of feature maps out,
filter width, filter height). In each convolution layer, four dimensional convolution is calculated
between image batch and feature maps. After convolution only the parameter that changes is
image width and height.
New image width = old image width – filter width + 1
New image height = old image height – filter height + 1
3. Fully Connected Layer:
This layer is inspired by the way neurons transmit signals through the brain. It takes a
large number of input features and transforms features through layers connected with trainable
weights. Two hidden layers of size 500 and 300 units are used in fully-connected layers. The
weights of these layers are trained by forward propagation of training data then backward
propagation of its errors. Back propagation starts from evaluating the difference between
prediction and true value, and back calculates the weight adjustment needed to every layer
before. We can control the training speed and the complexity of the architecture by tuning the
hyper-parameters, such as learning rate and network density. Hyper-parameters for this layer
include learning rate, momentum, regularization parameter, and decay.
4.Output Layer:
Output from the second hidden layer is connected to output layer having seven distinct
classes. Using Softmax activation function, output is obtained using the probabilities for each of
the seven class. The class with the highest probability is the predicted class.
3)Webcam:
It is a place where physical testing phase live photo is being tested.
IMPLEMENTATION:
FILE CONTENT:
● emojis (folder)
● model.py (file)
● multiface.py (file)
● singleface.py (file)
● model_1_atul.tflearn.data-00000-of-00001 (file)
● model_1_atul.tflearn.index (file)
● model_1_atul.tflearn.meta (file)
● haarcascade_frontalface_default.xml (file)
CODE:
model.py (file)
import numpy as np
import tflearn
import sys
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)
class EMR:
def __init__(self):
def build_network(self):
"""
Input is 48x48
"""
print("Dropout ",self.network.shape[1:])
self.network = fully_connected(self.network,
len(self.target_classes), activation = 'softmax')
print("Output ",self.network.shape[1:])
print("\n")
self.model = tflearn.DNN(self.network,checkpoint_path
= 'model_1_atul',max_checkpoints = 1,tensorboard_verbose =
2)
# Loads the model weights from the checkpoint
self.load_model()
"""
if image is None:
return None
return self.model.predict(image)
def load_model(self):
"""
"""
if isfile("model_1_atul.tflearn.meta"):
self.model.load("model_1_atul.tflearn")
else:
if __name__ == "__main__":
if sys.argv[1] == 'singleface':
import singleface
if sys.argv[1] == 'multiface':
import multiface
multiface.py (file)
import cv2
import sys
import numpy as np
from model import EMR
while True:
# Again find haar cascade to draw bounding box around face
ret, frame = cap.read()
if not ret:
break
facecasc =
cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = facecasc.detectMultiScale(gray, 1.3, 5)
if len(faces) > 0:
# draw box around faces
for face in faces:
(x,y,w,h) = face
frame = cv2.rectangle(frame,(x,y-
30),(x+w,y+h+10),(255,0,0),2)
newimg = cv2.cvtColor(frame[y:y+h,x:x+w],
cv2.COLOR_BGR2GRAY)
newimg = cv2.resize(newimg, (48,48), interpolation
= cv2.INTER_CUBIC) / 255.
result = network.predict(newimg)
if result is not None:
maxindex = np.argmax(result[0])
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(frame,EMOTIONS[maxindex],(x+5,y-
35), font, 2,(255,255,255),2,cv2.LINE_AA)
cv2.imshow('Video',
cv2.resize(frame,None,fx=2,fy=2,interpolation =
cv2.INTER_CUBIC))
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
singleface.py (file)
import cv2
import sys
import numpy as np
from model import EMR
def format_image(image):
"""
Function to format frame
"""
if len(image.shape) > 2 and image.shape[2] == 3:
# determine whether the image is color
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
# Image read from buffer
image = cv2.imdecode(image,
cv2.CV_LOAD_IMAGE_GRAYSCALE)
cascade_classifier =
cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
faces =
cascade_classifier.detectMultiScale(image,scaleFactor = 1.3
,minNeighbors = 5)
try:
# resize the image so that it can be passed to the
neural network
image = cv2.resize(image, (48,48), interpolation =
cv2.INTER_CUBIC) / 255.
except Exception:
print("----->Problem during resize")
return None
return image
cap = cv2.VideoCapture(0)
font = cv2.FONT_HERSHEY_SIMPLEX
feelings_faces = []
if len(faces) > 0:
# draw box around face with maximum area
max_area_face = faces[0]
for face in faces:
if face[2] * face[3] > max_area_face[2] *
max_area_face[3]:
max_area_face = face
face = max_area_face
(x,y,w,h) = max_area_face
frame = cv2.rectangle(frame,(x,y-
50),(x+w,y+h+10),(255,0,0),2)
cv2.imshow('Video',
cv2.resize(frame,None,fx=2,fy=2,interpolation =
cv2.INTER_CUBIC))
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
RESULTS AND DISCUSSION:
CONCLUSION:
CNN architecture for facial expression recognition as mentioned above was implemented in Python.
Along with Python programming language, Numpy, Theano and CUDA libraries were used.
Training image batch size was taken as 30, while filter map is of size 20x5x5 for both
convolution layer. Validation set was used to validate the training process. In last batch of every epoch in
validation cost, validation error, training cost, training error are calculated. Input parameters for training
are image set and corresponding output labels. The training process updated the weights of feature maps
and hidden layers based on hyper-parameters such as learning rate, momentum, regularization and decay.
In this system batch-wise learning rate was used as 10e-5, momentum as 0.99, regularization as 10e-7 and
decay as 0.99999.
The comparison of validation cost, validation error, training cost, training error are shown in
figures below.
FUTURE SCOPE:
We have got a pretty good result but still there is a huge scope of improvement.
1. In order to get better accuracy we need much more human images with good
variance among them.
3.We can also design our own CNN model if we have time and computation
power. Of-course, we need much more images for this. But by careful hyper-parameter
tuning and training the model on 100k human images with good variance among them
and by keeping the size of each image higher than 400*400, we can achieve close to 99%
accuracy on real world and in real time
4.In the future work, the model can be extended to color images. This will allow to
investigate the efficacy of pre-trained models such as AlexNet or VGGNet for facial
emotion recognition.
REFERENCES:
1. Shan, C., Gong, S., & McOwan, P. W. (2005, September). Robust facial expression
recognition using local binary patterns. In Image Processing, 2005. ICIP 2005. IEEE
International Conference on (Vol. 2, pp. II-370). IEEE.
2. Chibelushi, C. C., & Bourel, F. (2003). Facial expression recognition: A brief tutorial
overview. CVonline: On-Line Compendium of Computer Vision, 9.
3. LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16 November 2013
4. "Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation".
DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
5. Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject
independent facial expression recognition with robust face detection using a
convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559.
doi:10.1016/S0893-6080(03)00115-1. Retrieved 17 November 2013.
6. C. Zor, “Facial expression recognition,” Master’s thesis, University of Surrey, Guildford,
2008.
7. Suwa, M.; Sugie N. and Fujimora K. A Preliminary Note on Pattern Recognition of
Human Emotional Expression, Proc. International Joint Conf, Pattern Recognition, pages
408-410, 1978