Professional Documents
Culture Documents
Final SRS-3
Final SRS-3
Kolhapur
Established under section 2(f) of UGC Act 1956
Sanjay Ghodawat University Act XL of 2017 of Govt. Of Maharashtra | Approved by PCI, COA &AICTE
Project SRS
On
“Face Emotion Recognition Using Deep Learning”
A report submitted in partial fulfillment of the requirements for the
Project Phase I
Under Supervision of
Dr. Chetan Arage
CERTIFICATE
This is to certify that the project synopsis entitled “Face Emotion Recognition” submitted
By
Harshad Nivas Patil PRN No: 21ST114282023
Prathmesh Babalu Bhat PRN No: 20ST114281004
Pratish Akash Kavade PRN No: 20ST114281026
Related Work
2 2.1 Literature Survey 3-4
2.2 Gap Identified
Problem Statement and Objectives
3 3.1 Problem Statement 5-
3.2 Objectives 6
3.3 Scope
Overall Description
4 4.1 Product Perspective 7-8
4.2 Product Functions
4.3 User Characteristics
4.4 Hardware and Software Requirements
Proposed Work
5 5.1 Functional Requirements
9-12
5.2 Non-functional Requirements
5.2.1 Performance requirements
5.2.2 Safety requirements
5.2.3 Security requirements
Other Requirements
6 6.1 Design Constraints 13
Methodology
7 7.1 Proposed System 14-18
7.2 System Flowchart
7.3 Block Diagram
7.4 Data Flow Diagram
7.5 ER Diagram
7.6 Sequence Diagram
References
8 19
Appendices
9
20-22
Therefore, deep learning methods based on CNNs have been proposed to improve
the performance and robustness of facial emotion detection and counting
systems.
CNNs are a type of artificial neural networks that can learn hierarchical
representations of visual data by applying multiple layers of convolution,
pooling, and activation functions. CNNs have achieved state-of-the-art results in
various computer vision tasks, such as image classification, object detection,
semantic segmentation, and pose estimation.
CNNs can also be combined with other techniques, such as optical flow,
pretrained models, and extreme learning machines, to enhance the feature
extraction and classification capabilities for facial emotion detection and
counting.
1
1.2 Purpose:
The purpose of your project is to develop a real-time facial emotion detection and
counting system using convolutional neural networks (CNNs) for videos captured
from aerial platforms or crowded scenes. This system will have the following
benefits:
• It will reduce the computational cost and detection rate of running a CNN by
using a feature-based layered pre-filter.
1.It will provide a robust and efficient solution for facial emotion detection and
tracking in noisy and occluded environments, by using data augmentation
techniques, softmax layer, and integrated loss function.
2.It will enhance the precision and recall of facial emotion detection by using a
feature-based layered pre-filter, which fuses CNN with a layered classifier and
filters out unnecessary objects.
3.It will enable concurrent crowd management and social distancing monitoring
2
by using a centroid tracker, which assigns an id to each person and counts the
number of people in the scene.
4.It will perform human face emotion recognition using CNN over temporal
images, by using a hierarchical action structure, which includes three levels:
action layer, motion layer, and posture layer.
2. Related Work
Studies the development of tools enhancing accessibility for Deaf and hard-
of-hearing individuals, guiding the project's user interface design.
3
2.2 Gap Identified:
• Most of the existing methods for facial emotion detection and counting are
designed for videos captured from static cameras or ground platforms, and
they do not perform well for videos captured from aerial platforms or nonstatic
cameras, which have different characteristics such as varying altitudes,
camera motion, occlusion, and scale changes.
• Most of the existing methods for facial emotion detection and counting rely
on a single CNN model, which can be computationally expensive and time-
consuming to run on a mobile robot, and they do not exploit~ the
complementary information from other feature extraction and classification
techniques.
• Most of the existing methods for facial emotion detection and counting do not
consider the temporal information or the human actions in the videos, which
can provide useful clues for distinguishing humans from other objects and for
understanding the human behavior and scene context.
• Therefore, there is a need for a novel and robust method for facial emotion
detection and counting that can handle videos captured from aerial platforms
or nonstatic cameras, that can reduce the computational cost and detection rate
of running a CNN, and that can incorporate the temporal information and the
human actions in the videos.
4
3. Problem Statement and Objectives
Crowd detection is the task of locating and counting people in an image or video,
which has applications in security, surveillance, and crowd management. However,
crowd detection is a challenging problem due to the following factors:
• Perspective distortion: The size and shape of people in a crowd may vary
depending on their distance from the camera, resulting in different scales and
aspect ratios.
3.2 Objectives:
5
Some of the major objectives of this project are- To design and implement a
**convolutional neural network (CNN) that can accurately detect and count the
number of people in a crowded scene.
- To compare the performance of the proposed CNN model with other existing
methods of crowd detection, such as density estimation, regression, and detection.
3.3 Scope:
The scope of our project is to design and implement a crowd detection system using
Convolutional Neural Networks (CNNs), which are a type of deep learning model
that can learn to extract high-level features from images and videos.
6
4. Overall Description
1.Facial emotion detection: The system can detect humans in videos captured from
aerial platforms or crowded scenes, by using a single shot detector (SSD) mobile
net as the CNN model. The system can handle varying altitudes, camera motion,
occlusion, and scale changes.
2.Human counting: The system can count the number of humans in the scene, by
7
using a centroid tracker that assigns an id to each person and calculates the center
of each bounding box.
3.Social distancing monitoring: The system can monitor the social distancing
violation among the humans, by using a clustering algorithm that measures the
distance between the centroids of the bounding boxes and compares it with a
threshold value.
4.Face detection: The system can detect the faces of the humans in the scene, by
using a face detection algorithm that locates the regions of interest (ROI) within
the bounding boxes.
8
• Frontend -Python
9
5. Proposed Work
Functional Requirements:
1.The system shall detect humans in videos captured from aerial platforms or
crowded scenes, by using a single shot detector (SSD) mobile net as the CNN
model.
2.The system shall count the number of humans in the scene, by using a centroid
tracker that assigns an id to each person and calculates the center of each bounding
box.
3.The system shall monitor the social distancing violation among the humans, by
using a clustering algorithm that measures the distance between the centroids of the
bounding boxes and compares it with a threshold value.
4.The system shall detect the faces of the humans in the scene, by using a face
detection algorithm that locates the regions of interest (ROI) within
the bounding boxes.
2. Speed
The system shall achieve a high speed of facial emotion detection and counting, by
using a feature-based layered pre-filter, a centroid tracker, and a hierarchical action
structure. The system shall have a minimum speed of 15 frames per second (FPS)
for facial emotion detection and counting, 10 FPS for social distancing monitoring,
10
8 FPS for face detection, 6 FPS for face mask classification, and 4 FPS for human
action recognition.
3. Scalability
The system shall be able to handle videos with different resolutions, frame rates,
and formats, by using a video processing module that can resize, crop, rotate, and
convert the videos according to the system requirements. The system shall also be
able to handle videos with different numbers of humans, from a few to hundreds, by
using a feature-based layered pre-filter that can filter out unnecessary objects and
reduce the computational cost and detection rate of running a CNN.
4. Reliability
The system shall be able to operate reliably and consistently, by using a robust
CNN model, a feature-based layered pre-filter, a centroid tracker, and a hierarchical
action structure. The system shall also be able to handle errors and exceptions, such
as missing or corrupted frames, network failures, power failures, and hardware
failures, by using a error handling module that can detect, report, and recover from
the errors and exceptions.
2. Security
The system shall protect the system and the data from unauthorized access,
modification, or deletion, by using a authentication module that requires the
user to enter a username and a password to access the system and the data.
The system shall also use a firewall and an antivirus software to prevent
malware attacks and cyberattacks. The system shall also comply with the
relevant laws and regulations regarding data security and cybersecurity.
11
3. Ethics
The system shall adhere to the ethical principles and values of the society and
the profession, by using a facial emotion detection and counting system that
is fair, transparent, accountable, and responsible. The system shall also
respect the human dignity, rights, and interests of the humans in the videos,
and avoid any harm or discrimination to them. The system shall also comply
with the relevant laws and regulations regarding data ethics
and human ethics.
2. Encryption
The system shall encrypt the videos and the output data, and store them securely in
a database. The system shall use a strong encryption algorithm and a secret key to
encrypt and decrypt the data. The system shall also protect the secret key from
being exposed or stolen by using a key management system that generates, stores,
and distributes the key.
3. Firewall
The system shall use a firewall to prevent unauthorized network access to the
system and the data. The firewall shall filter the incoming and outgoing network
traffic based on predefined rules and policies. The firewall shall also block any
suspicious or malicious network packets that may harm the system or the data.
12
6. Other Requirements
• Camera Quality: The resolution and frame rate of the camera used to capture
the video sequences should be high enough to ensure clear and smooth
images of the human subjects.
• CNN Architecture: The CNN model used for facial emotion detection
should be carefully designed and optimized to achieve high accuracy and
efficiency, while avoiding overfitting or underfitting problems.
• Dataset Quality: The dataset used to train and test the CNN model should
be large and diverse enough to cover various scenarios, such as different
backgrounds, lighting conditions, poses, and occlusions of human subjects.
• User Interface: The user interface of the system should be intuitive and user-
friendly, allowing the users to easily interact with the system and obtain the
desired results.
13
7. Methodology
14
7.3 Block Diagram:
15
7.5 Data Flow Diagram:
16
7.6 ER Diagram:
17
7.7 Sequence diagram:
18
8. References:
2. "You Only Look Once: Unified, Real-Time Object Detection" (Joseph Redmon,
Santosh Divvala, Ross Girshick, Ali Farhadi)
4. Kaggle:https://www.kaggle.com/search?q=Human+detection+dataset
5.Programming Computer Vision with Python, 1st Edition, Jan Eri Solem, 2012, O’
Reily
6.Learning OpenCv, Adrian Kaehler and Gary Rost Bradski, 2008, O Reily
7.Deep Learning with Tensorflow, Giancarlo Zaccone, Md. Rezaul Karim, Ahmed
Menshawy, 2017.
8.Https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-
convolutional-neural-network-3607be47480
https://medium.com/dataseries/basic-overview-of-convolutional-neural-network-cnn-
4fcc7dbb4f17#:~:text=The%20activation%20function%20is%20a,neuron%20woul
d%20fire%20or%20not.&text=We%20have%20different%20types%20o
f,Rectified%20Linear%20Unit%20(ReLU).
19
9. Appendices:
Open CV
OpenCV is the huge open-source library for the computer vision, machine
learning, and image processing and now it plays a major role in real-time
operation which is very important in today’s systems. By using it, one can process
images and videos to identify objects, faces, or even handwriting of a human.
When it integrated with various libraries, such as NumPy, python is capable of
processing the OpenCV array structure for analysis. To Identify image pattern and
its various features we use vector space and perform mathematical operations on
these features.
AdaBoost
There’s a set of features which would capture certain facial structures like
eyebrows or the bridge between both the eyes, or the lips etc.But originally the
feature set was not limited to this.
The feature set had an approx. of 180,000 of them, which got reduced to 6000.
They used a Boosting Technique called AdaBoost, in which each of these 180,000
features were applied to the images separately to create Weak Learners.
20
Convolutional Neural Networks (CNNs)
Convolutional Neural networks, a variation of multila er perceptron’s, are designed to
minimize preprocessing requirements. They are characterized by a shared-weights
architecture and translation invariance. Key aspects of CNNs include:
• Shift Invariance: Also known as shift invariant or space invariant artificial neural
networks (SIANN).
• Learned Filters: Learns filters, eliminating the need for hand-engineered features
in traditional algorithms.
21
TensorFlow
TensorFlow, an open-source software library, facilitates dataflow programming for
various tasks, including machine learning applications such as neural networks.
Developed by the Google Brain team, TensorFlow features:
• Open Source: Released under the Apache 2.0 open source license on November 9,
2015.
22