Final SRS-2

SANJAY GHODAWAT UNIVERSITY
Kolhapur
Established under section 2(f) of UGC Act 1956
Sanjay Ghodawat University Act XL of 2017 of Govt. Of Maharashtra | Approved by PCI, COA &AICTE
Project SRS
On
“Face Emotion Recognition Using Deep Learning”
A report submitted in partial fulfillment of the requirements for the
Project Phase I
School of Computer Science and Engineering
Harshad Nivas Patil PRN No: 21ST114282023

Prathmesh Babalu Bhat PRN No: 20ST114281004
Pratish Akash Kavade PRN No: 20ST114281026
Program: CSE Class: B. Tech Final Year Div: B
Under Supervision of
Dr. Chetan Arage
Sanjay Ghodawat University, B. TECH, CSE

SANJAY GHODAWAT UNIVERSITY
Kolhapur
Established under section 2(f) of UGC Act 1956
Sanjay Ghodawat University Act XL of 2017 of Govt. Of Maharashtra | Approved by PCI, COA &AICTE
School of Computer Science and Engineering
CERTIFICATE
This is to certify that the project synopsis entitled “Face Emotion Recognition” submitted
By
Harshad Nivas Patil PRN No: 21ST114282023
Prathmesh Babalu Bhat PRN No: 20ST114281004
Pratish Akash Kavade PRN No: 20ST114281026
Program: CSE Class: B. Tech Final Year Div: B

Is work done by them and submitted during the 2023-2024 academic
year, in partial fulfillment of the requirements for the Project Phase I
Sanjay Ghodawat University, Kolhapur
Dr.Chetan Arage Mrs. Veena Mali Dr.Mrs. Deepika Patil

Project Guide Project Phase I HOD (CSE)
Coordinator

Table Content
Chapter No. Particulars Page No.

Introduction
1 1.1 Background and Context 1
1.2 Purpose
1.3 Functional Features
1.4 Significance of the Project
Related Work
2 2.1 Literature Survey 2
2.2 Gap Identified
Problem Statement and Objectives
3 3.1 Problem Statement 3
3.2 Objectives
3.3 Scope
Overall Description
4 4.1 Product Perspective 4-5
4.2 Product Functions
4.3 User Characteristics
4.4 Hardware and Software Requirements
Proposed Work
5 5.1 Functional Requirements
6-12
5.2 Non-functional Requirements
5.2.1 Performance requirements
5.2.2 Safety requirements
5.2.3 Security requirements
Other Requirements
6 6.1 Design Constraints 13
Methodology
7 7.1 Proposed System 14
7.2 System Flowchart
7.3 Block Diagram
7.4 Data Flow Diagram
7.5 ER Diagram
7.6 Sequence Diagram
References
8 15
Appendices
9 16

1. Introduction
1.1 Background and Context:
Facial emotion detection and counting is a challenging task in computer vision,

especially for videos captured from aerial platforms or crowded scenes. It has
many applications in security, surveillance, traffic analysis, crowd management,
and social distancing monitoring.
However, traditional methods based on handcrafted features, background

subtraction, or optical flow often fail to handle complex scenarios with varying
illumination, occlusion, camera motion, and scale changes.
Therefore, deep learning methods based on CNNs have been proposed to improve
the performance and robustness of facial emotion detection and counting
systems.
CNNs are a type of artificial neural networks that can learn hierarchical
representations of visual data by applying multiple layers of convolution,
pooling, and activation functions. CNNs have achieved state-of-the-art results in
various computer vision tasks, such as image classification, object detection,
semantic segmentation, and pose estimation.
CNNs can also be combined with other techniques, such as optical flow,
pretrained models, and extreme learning machines, to enhance the feature
extraction and classification capabilities for facial emotion detection and
counting.
In this project, we aim to develop a real-time facial emotion detection and

counting system using CNNs for videos captured from aerial platforms or
crowded scenes. We will review the existing literature on this topic, compare
different CNN architectures and methods, implement and evaluate our proposed
system, and discuss the challenges and future directions.
1
1.2 Purpose:
The purpose of your project is to develop a real-time facial emotion detection and
counting system using convolutional neural networks (CNNs) for videos captured
from aerial platforms or crowded scenes. This system will have the following
benefits:
• It will enable timely identification of person, recognition of human activity

and scene analysis.
• It will improve the performance and robustness of facial emotion detection

and counting systems by using deep learning methods based on CNNs.
• It will reduce the computational cost and detection rate of running a CNN by
using a feature-based layered pre-filter.
1.3 Significance of the Project:
The significance of our project is to demonstrate the potential of using

convolutional neural networks (CNNs) for real-time facial emotion detection
and counting in videos captured from aerial platforms or crowded scenes. our
project will contribute to the following aspects:
1.It will provide a robust and efficient solution for facial emotion detection and
tracking in noisy and occluded environments, by using data augmentation
techniques, softmax layer, and integrated loss function.
2.It will enhance the precision and recall of facial emotion detection by using a
feature-based layered pre-filter, which fuses CNN with a layered classifier and
filters out unnecessary objects.
3.It will enable concurrent crowd management and social distancing monitoring
2
by using a centroid tracker, which assigns an id to each person and counts the
number of people in the scene.
4.It will perform human face emotion recognition using CNN over temporal
images, by using a hierarchical action structure, which includes three levels:
action layer, motion layer, and posture layer.
2. Related Work
2.1 Literature Survey:
1. Gesture Recognition Technologies:
Investigates computer vision and machine learning applications, crucial for

real-time recognition of finger-spelling gestures.
2. Sign Language Recognition Systems:
Explores neural network and image processing techniques, informing the

design of the current project's recognition system.
3. Neural Networks in Human-Computer Interaction:
Examines the extensive use of neural networks in gesture recognition, aiding

in optimizing the proposed Convolutional Neural Network (CNN) model.
4. Accessibility Tools for DHH Individuals:
Studies the development of tools enhancing accessibility for Deaf and hard-
of-hearing individuals, guiding the project's user interface design.
5. Real-time Translation Systems:
Analyzes real-time translation systems for sign language, shaping the

project's goal of breaking down communication barriers.
3
2.2 Gap Identified:
• Despite the advances in facial emotion detection and counting using

convolutional neural networks (CNNs), there are still some gaps and
challenges that need to be addressed. Some of the gaps identified in the current
literature or practice are:
• Most of the existing methods for facial emotion detection and counting are
designed for videos captured from static cameras or ground platforms, and
they do not perform well for videos captured from aerial platforms or nonstatic
cameras, which have different characteristics such as varying altitudes,
camera motion, occlusion, and scale changes.
• Most of the existing methods for facial emotion detection and counting rely
on a single CNN model, which can be computationally expensive and time-
consuming to run on a mobile robot, and they do not exploit~ the
complementary information from other feature extraction and classification
techniques.
• Most of the existing methods for facial emotion detection and counting do not
consider the temporal information or the human actions in the videos, which
can provide useful clues for distinguishing humans from other objects and for
understanding the human behavior and scene context.
• Therefore, there is a need for a novel and robust method for facial emotion
detection and counting that can handle videos captured from aerial platforms
or nonstatic cameras, that can reduce the computational cost and detection rate
of running a CNN, and that can incorporate the temporal information and the
human actions in the videos.
4
3. Problem Statement and Objectives
3.1 Problem Statement:
Crowd detection is the task of locating and counting people in an image or video,
which has applications in security, surveillance, and crowd management. However,
crowd detection is a challenging problem due to the following factors:
• Occlusion: People in a crowd may partially or completely occlude each

other, making it difficult to identify and count them individually.
• Perspective distortion: The size and shape of people in a crowd may vary
depending on their distance from the camera, resulting in different scales and
aspect ratios.
• Background clutter: The background of a crowd image may contain objects

or scenes that are similar to the appearance of people, such as signs, trees, or
buildings, causing false positives or negatives.
• Variation in appearance: People in a crowd may have different poses,

clothing, hairstyles, and accessories, making it hard to distinguish them from
each other.
3.2 Objectives:
5
Some of the major objectives of this project are- To design and implement a
**convolutional neural network (CNN) that can accurately detect and count the
number of people in a crowded scene.
- To explore the potential applications and challenges of crowd detection using

CNNs, such as , traffic management.
- To compare the performance of the proposed CNN model with other existing
methods of crowd detection, such as density estimation, regression, and detection.
3.3 Scope:
The scope of our project is to design and implement a crowd detection system using
Convolutional Neural Networks (CNNs), which are a type of deep learning model
that can learn to extract high-level features from images and videos.
Your system will be able to:
1. Take an image or video feed of a crowded area as input.

2. Preprocess the input to enhance the quality and remove noise.
3. Extract features from the input using a pre-trained CNN
backbone.
4. Generate a density map of the input using a U-Net architecture.
5. Estimate the number of people in the crowd by summing up the
values in the density map.
6. Evaluate the performance of your system using mean absolute
error and mean squared error metrics.
6
4. Overall Description
4.1 Product Perspective:
The proposed system is a real-time facial emotion detection and counting

system using convolutional neural networks (CNNs) for videos captured
from aerial platforms or crowded scenes. The system is a standalone
product that can be installed on a computer or a mobile device with a
camera. The system can also be integrated with other applications or
systems that require facial emotion detection and counting functionality,
such as video surveillance, security, traffic analysis, crowd management,
and social distancing monitoring.
4.2 Product Functions:
The system has the following main functions or features:
1.Facial emotion detection: The system can detect humans in videos captured from
aerial platforms or crowded scenes, by using a single shot detector (SSD) mobile
net as the CNN model. The system can handle varying altitudes, camera motion,
occlusion, and scale changes.
2.Human counting: The system can count the number of humans in the scene, by
7
using a centroid tracker that assigns an id to each person and calculates the center
of each bounding box.
3.Social distancing monitoring: The system can monitor the social distancing
violation among the humans, by using a clustering algorithm that measures the
distance between the centroids of the bounding boxes and compares it with a
threshold value.
4.Face detection: The system can detect the faces of the humans in the scene, by
using a face detection algorithm that locates the regions of interest (ROI) within
the bounding boxes.
4.3 User Characteristics:

1.Researchers and developers who are interested in facial emotion detection
and counting using convolutional neural networks (CNNs) for videos
captured from aerial platforms or crowded scenes. They have a background
in computer vision, deep learning, and video processing, and they can use
the system to test and evaluate different CNN models and methods for facial
emotion detection and counting.
4.4 Hardware and Software Requirements:
Minimum HARDWARE REQUIRMENTS

• Windows / linux / macOS device
• Minimum of 4gb RAM
Minimum SOFTWARE REQUIRMENTS
8
• Frontend -Python
9
5. Proposed Work
5.1 Functional Requirements:
Functional Requirements:
1.The system shall detect humans in videos captured from aerial platforms or
crowded scenes, by using a single shot detector (SSD) mobile net as the CNN
model.
2.The system shall count the number of humans in the scene, by using a centroid
tracker that assigns an id to each person and calculates the center of each bounding
box.
3.The system shall monitor the social distancing violation among the humans, by
using a clustering algorithm that measures the distance between the centroids of the
bounding boxes and compares it with a threshold value.
4.The system shall detect the faces of the humans in the scene, by using a face
detection algorithm that locates the regions of interest (ROI) within
the bounding boxes.
5.2 Non-functional Requirements:
5.2.1 Performance Requirements

1. Accuracy
The system shall achieve a high accuracy of facial emotion detection and counting,
by using a single shot detector (SSD) mobile net as the CNN model, a feature-based
layered pre-filter, a centroid tracker, and a hierarchical action structure. The system
shall have a minimum accuracy of 90% for facial emotion detection, 95% for
human counting, 85% for social distancing monitoring, 80% for face detection, 75%
for face mask classification, and 70% for human action recognition.
2. Speed
The system shall achieve a high speed of facial emotion detection and counting, by
using a feature-based layered pre-filter, a centroid tracker, and a hierarchical action
structure. The system shall have a minimum speed of 15 frames per second (FPS)
for facial emotion detection and counting, 10 FPS for social distancing monitoring,
10
8 FPS for face detection, 6 FPS for face mask classification, and 4 FPS for human
action recognition.
3. Scalability
The system shall be able to handle videos with different resolutions, frame rates,
and formats, by using a video processing module that can resize, crop, rotate, and
convert the videos according to the system requirements. The system shall also be
able to handle videos with different numbers of humans, from a few to hundreds, by
using a feature-based layered pre-filter that can filter out unnecessary objects and
reduce the computational cost and detection rate of running a CNN.
4. Reliability
The system shall be able to operate reliably and consistently, by using a robust
CNN model, a feature-based layered pre-filter, a centroid tracker, and a hierarchical
action structure. The system shall also be able to handle errors and exceptions, such
as missing or corrupted frames, network failures, power failures, and hardware
failures, by using a error handling module that can detect, report, and recover from
the errors and exceptions.
5.2.2 Safety Requirements

1. Privacy
The system shall respect the privacy of the humans in the videos, by using a
face detection algorithm that locates the regions of interest (ROI) within the
bounding boxes, and a face mask classification algorithm that classifies the
faces as wearing a mask or not wearing a mask. The system shall also encrypt
the videos and the output data, and store them securely in a database. The
system shall also comply with the relevant laws and regulations regarding
data protection and privacy.
2. Security
The system shall protect the system and the data from unauthorized access,
modification, or deletion, by using a authentication module that requires the
user to enter a username and a password to access the system and the data.
The system shall also use a firewall and an antivirus software to prevent
malware attacks and cyberattacks. The system shall also comply with the
relevant laws and regulations regarding data security and cybersecurity.
11
3. Ethics
The system shall adhere to the ethical principles and values of the society and
the profession, by using a facial emotion detection and counting system that
is fair, transparent, accountable, and responsible. The system shall also
respect the human dignity, rights, and interests of the humans in the videos,
and avoid any harm or discrimination to them. The system shall also comply
with the relevant laws and regulations regarding data ethics
and human ethics.
5.2.3 Security Requirements

1. Authentication
The system shall require the user to enter a username and a password to access the
system and the data. The system shall verify the user’s identity and grant or deny
access accordingly. The system shall also prevent unauthorized users from
accessing the system and the data by using a lockout mechanism that blocks the
user after a certain number of failed login attempts.
2. Encryption
The system shall encrypt the videos and the output data, and store them securely in
a database. The system shall use a strong encryption algorithm and a secret key to
encrypt and decrypt the data. The system shall also protect the secret key from
being exposed or stolen by using a key management system that generates, stores,
and distributes the key.
3. Firewall
The system shall use a firewall to prevent unauthorized network access to the
system and the data. The firewall shall filter the incoming and outgoing network
traffic based on predefined rules and policies. The firewall shall also block any
suspicious or malicious network packets that may harm the system or the data.
12
6. Other Requirements
6.1 Design Constraints:
• Camera Quality: The resolution and frame rate of the camera used to capture
the video sequences should be high enough to ensure clear and smooth
images of the human subjects.
• Processing Power: The computational resources available for the system

should be sufficient to support real-time image processing and CNN
inference, without causing significant delays or errors.
• CNN Architecture: The CNN model used for facial emotion detection
should be carefully designed and optimized to achieve high accuracy and
efficiency, while avoiding overfitting or underfitting problems.
• Dataset Quality: The dataset used to train and test the CNN model should
be large and diverse enough to cover various scenarios, such as different
backgrounds, lighting conditions, poses, and occlusions of human subjects.
• User Interface: The user interface of the system should be intuitive and user-
friendly, allowing the users to easily interact with the system and obtain the
desired results.
13
7. Methodology
7.1 Proposed System:

The proposed system aims to develop a real-time face emotion detection and system
using convolutional neural networks (CNNs). The system will use a camera or a video
source to capture the scenes of interest, and then apply a CNN-based model to detect
and count the humans in the images or frames. The system will also handle the
challenges of occlusions, illuminance variations, and non-static camera movements.
7.2 System Flowchart:
14
7.3 Block Diagram:
15
7.5 Data Flow Diagram:
16
7.6 ER Diagram:
17
7.7 Sequence diagram:
18
8. References:
List of papers/books/websites etc refer for project References
1. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal

Networks" (ShaoqingRen, Kaiming He, Ross Girshick, Jian Sun)
2. "You Only Look Once: Unified, Real-Time Object Detection" (Joseph Redmon,
Santosh Divvala, Ross Girshick, Ali Farhadi)
3. "Real-Time Facial emotion detection in Surveillance Videos Using Deep Learning"

(Vahid Bahri, Amir Zarezade, Shohreh Kasaei)
4. Kaggle:https://www.kaggle.com/search?q=Human+detection+dataset
5.Programming Computer Vision with Python, 1st Edition, Jan Eri Solem, 2012, O’
Reily
6.Learning OpenCv, Adrian Kaehler and Gary Rost Bradski, 2008, O Reily
7.Deep Learning with Tensorflow, Giancarlo Zaccone, Md. Rezaul Karim, Ahmed
Menshawy, 2017.
8.Https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-
convolutional-neural-network-3607be47480
https://medium.com/dataseries/basic-overview-of-convolutional-neural-network-cnn-
4fcc7dbb4f17#:~:text=The%20activation%20function%20is%20a,neuron%20woul
d%20fire%20or%20not.&text=We%20have%20different%20types%20o
f,Rectified%20Linear%20Unit%20(ReLU).
19
9. Appendices:
Open CV
OpenCV is the huge open-source library for the computer vision, machine
learning, and image processing and now it plays a major role in real-time
operation which is very important in today’s systems. By using it, one can process
images and videos to identify objects, faces, or even handwriting of a human.
When it integrated with various libraries, such as NumPy, python is capable of
processing the OpenCV array structure for analysis. To Identify image pattern and
its various features we use vector space and perform mathematical operations on
these features.
AdaBoost
There’s a set of features which would capture certain facial structures like
eyebrows or the bridge between both the eyes, or the lips etc.But originally the
feature set was not limited to this.
The feature set had an approx. of 180,000 of them, which got reduced to 6000.
They used a Boosting Technique called AdaBoost, in which each of these 180,000
features were applied to the images separately to create Weak Learners.
20
Convolutional Neural Networks (CNNs)
Convolutional Neural networks, a variation of multila er perceptron’s, are designed to
minimize preprocessing requirements. They are characterized by a shared-weights
architecture and translation invariance. Key aspects of CNNs include:
• Inspiration from Biology: Inspired by biological processes and the organization

of the animal visual cortex.
• Shift Invariance: Also known as shift invariant or space invariant artificial neural
networks (SIANN).
• Minimal Preprocessing: Requires relatively little preprocessing compared to

other image classification algorithms.
• Learned Filters: Learns filters, eliminating the need for hand-engineered features
in traditional algorithms.
• Applications: Widely used in image and video recognition, recommender systems,

image classification, medical image analysis, and natural language processing
21
TensorFlow
TensorFlow, an open-source software library, facilitates dataflow programming for
various tasks, including machine learning applications such as neural networks.
Developed by the Google Brain team, TensorFlow features:
• Symbolic Math Library: Supports dataflow programming and is a symbolic math

library.
•
• Machine Learning: Used for machine learning applications, including neural

networks.
• Open Source: Released under the Apache 2.0 open source license on November 9,
2015.
• Flexible Architecture: Allows computation deployment across diverse platforms,

including CPUs, GPUs, TPUs, and mobile devices.
• Cross-Platform: Available on 64-bit Linux, macOS, Windows, Android, and iOS.
22

Final SRS-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final SRS-2

Uploaded by

Copyright:

Available Formats

SANJAY GHODAWAT UNIVERSITY

School of Computer Science and Engineering

Harshad Nivas Patil PRN No: 21ST114282023

Program: CSE Class: B. Tech Final Year Div: B

Sanjay Ghodawat University, B. TECH, CSE

School of Computer Science and Engineering

Program: CSE Class: B. Tech Final Year Div: B

Dr.Chetan Arage Mrs. Veena Mali Dr.Mrs. Deepika Patil

Sanjay Ghodawat University, B. TECH, CSE

Chapter No. Particulars Page No.

Sanjay Ghodawat University, B. TECH, CSE

1.1 Background and Context:

Facial emotion detection and counting is a challenging task in computer vision,

However, traditional methods based on handcrafted features, background

In this project, we aim to develop a real-time facial emotion detection and

• It will enable timely identification of person, recognition of human activity

• It will improve the performance and robustness of facial emotion detection

1.3 Significance of the Project:

The significance of our project is to demonstrate the potential of using

2.1 Literature Survey:

1. Gesture Recognition Technologies:

Investigates computer vision and machine learning applications, crucial for

2. Sign Language Recognition Systems:

Explores neural network and image processing techniques, informing the

3. Neural Networks in Human-Computer Interaction:

Examines the extensive use of neural networks in gesture recognition, aiding

4. Accessibility Tools for DHH Individuals:

5. Real-time Translation Systems:

Analyzes real-time translation systems for sign language, shaping the

• Despite the advances in facial emotion detection and counting using

3.1 Problem Statement:

• Occlusion: People in a crowd may partially or completely occlude each

• Background clutter: The background of a crowd image may contain objects

• Variation in appearance: People in a crowd may have different poses,

- To explore the potential applications and challenges of crowd detection using

Your system will be able to:

1. Take an image or video feed of a crowded area as input.

4.1 Product Perspective:

The proposed system is a real-time facial emotion detection and counting

4.2 Product Functions:

The system has the following main functions or features:

4.3 User Characteristics:

4.4 Hardware and Software Requirements:

Minimum HARDWARE REQUIRMENTS

Minimum SOFTWARE REQUIRMENTS

5.1 Functional Requirements:

5.2 Non-functional Requirements:

5.2.1 Performance Requirements

5.2.2 Safety Requirements

5.2.3 Security Requirements

6.1 Design Constraints:

• Processing Power: The computational resources available for the system

7.1 Proposed System:

7.2 System Flowchart:

List of papers/books/websites etc refer for project References

1. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal

3. "Real-Time Facial emotion detection in Surveillance Videos Using Deep Learning"

• Inspiration from Biology: Inspired by biological processes and the organization

• Minimal Preprocessing: Requires relatively little preprocessing compared to

• Applications: Widely used in image and video recognition, recommender systems,

• Symbolic Math Library: Supports dataflow programming and is a symbolic math

• Machine Learning: Used for machine learning applications, including neural

• Flexible Architecture: Allows computation deployment across diverse platforms,

• Cross-Platform: Available on 64-bit Linux, macOS, Windows, Android, and iOS.