Face Mask Detection using YOLO v3

PROJECT REPORT
On
FACE-MASK DETECTION USING YOLO V3

ARCHITECTURE
Submitted by
Nisarg Pethani (IU1641050045)

Harshal Vora (IU1641050063)
In fulfillment for the award of the degree
Of
BACHELOR OF TECHNOLOGY
In
COMPUTER ENGINEERING
INSTITUTE OF TECHNOLOGY AND ENGINEERING

INDUS UNIVERSITY CAMPUS, RANCHARDA, VIA-THALTEJ
AHMEDABAD-382115, GUJARAT, INDIA,
WEB: www.indusuni.ac.in
MAY 2020
PROJECT REPORT
ON
FACE-MASK DETECTION USING YOLO

V3 ARCHITECTURE
AT
In the partial fulfillment of the requirement

for the degree of
Bachelor of Technology
in
Computer Engineering
PREPARED BY
UNDER GUIDANCE OF
Internal Guide
Mr. Hiren Mer
Assistant Professor,
Department of Computer Engineering,
I.T.E, Indus University, Ahmedabad
SUBMITTED TO
INSTITUTE OF TECHNOLOGY AND ENGINEERING
INDUS UNIVERSITY CAMPUS, RANCHARDA, VIA-THALTEJ
AHMEDABAD-382115, GUJARAT, INDIA,
WEB: www.indusuni.ac.in
MAY 2020
CANDIDATE’S DECLARATION
I declare that final semester report entitled “Face-Mask Detection using YOLO V3
Architecture” is my own work conducted under the supervision of the guide Mr. Hiren
Mer.
I further declare that to the best of my knowledge, the report for B.Tech final semester
does not contain part of the work which has been submitted for the award of B.Tech
Degree either in this university or any other university without proper citation.
___________________________________
Candidate’s Signature
___________________________________
Guide: Mr. Hiren Mer
Assistant Professor
Indus Institute of Technology and Engineering
INDUS UNIVERSITY– Ahmedabad,
State: Gujarat
CANDIDATE’S DECLARATION
I declare that final semester report entitled “Face-Mask Detection using YOLO V3
Architecture” is my own work conducted under the supervision of the guide Mr. Hiren
Mer.
I further declare that to the best of my knowledge, the report for B.Tech final semester
does not contain part of the work which has been submitted for the award of B.Tech
Degree either in this university or any other university without proper citation.
___________________________________
Candidate’s Signature
___________________________________
Guide: Mr. Hiren Mer
Assistant Professor
Indus Institute of Technology and Engineering
INDUS UNIVERSITY– Ahmedabad,
State: Gujarat
INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING
COMPUTER ENGINEERING
2019 -2020
CERTIFICATE
Date: May 10th, 2020
This is to certify that the project work entitled “Face-Mask Detection using YOLO V3
Architecture” has been carried out by Nisarg Pethani, Harshal Vora under my
guidance in partial fulfillment of degree of Bachelor of Technology in COMPUTER
ENGINEERING (Final Year) of Indus University, Ahmedabad during the academic
year 2019 - 2020.
___________________________ ________________________________
Mr. Hiren Mer Dr. Seema Mahajan
Assistant Professor, Head of the Department,
Department of Computer Engineering, Department of Computer Engineering,
I.T.E, Indus University I.T.E, Indus University
Ahmedabad Ahmedabad
ACKNOWLEDGEMENT
Towards the successful completion of my B.Tech in Computer Engineering final year project,
we feel greatly obliged to certain Specials.
I am thankful and would like to express my gratitude to my internal guide Mr. Hiren Mer for
his conscientious guidance and diligently helping me in this endeavor. I am grateful to him for
providing precise milestones to be achieved for my final year project. Also, I extend my
gratitude all teachers who taught me throughout my Engineering and thank them for the
knowledge they imparted to me, also helping me in providing suggestions for existing features
of the project and how could they be improved. Finally, I give thanks to all those who
indirectly helped me or contributed towards the completion of my final year project.
- Nisarg Pethani
- Harshal Vora
IU/ITE/CE/2020/UDP-006
TABLE OF CONTENT
Title Page No
ABSTRACT................................................................................................... v
LIST OF FIGURES........................................................................................ vi
LIST OF TABLES......................................................................................... ix
ABBREVIATIONS........................................................................................ x
CHAPTER 1 INTRODUCTION................................................................... 1
1.1 Project Summary.......................................................................... 2
1.2 Project Purpose............................................................................. 2
1.3 Project Scope................................................................................ 3
1.4 Objectives........................................................................ 3
1.5 Technology and Literature Overview.......................................... 4
1.5.1 Python........................................................................... 4
1.5.2 PyTorch......................................................................... 5
1.5.3 PyCharm........................................................................ 5
1.5.4 LabelImg....................................................................... 5
1.5.5 DarkLabel...................................................................... 6
1.6 Synopsis....................................................................................... 6
CHAPTER 2 PROJECT MANAGEMENT................................................... 7
2.1 Project Planning Objectives......................................................... 8
2.1.1 Project Development approach..................................... 8
2.1.2 Resource........................................................................ 8
2.1.2.1 Human Resource............................................ 8
2.1.2.2 Environment Resource................................... 8
2.2 Project Scheduling....................................................................... 8
2.3 Timeline Chart............................................................................. 9
CHAPTER 3 SYSTEM REQUIREMENTS.................................................. 10
3.1 Hardware Requirement................................................................ 11
3.2 Software Requirement.................................................................. 11
3.3 Environment Setup....................................................................... 14
CHAPTER 4 NEURAL NETWORK............................................................ 15
4.1 AI vs ML vs DL........................................................................... 16
4.1.1 Artificial Intelligence.................................................... 16
Department of Computer Engineering i

4.1.2 Machine Learning......................................................... 16

4.1.3 Deep Learning............................................................... 16
4.2 Neural Network............................................................................ 17
4.3 Convolutional Neural Network.................................................... 19
4.3.1 Kernel............................................................................ 20
4.3.2 Pooling.......................................................................... 21
4.3.2.1 Max Pooling................................................... 21
4.3.2.2 Average Pooling............................................. 22
4.4 Related Works.............................................................................. 22
4.4.1 Classification + Regression........................................... 22
4.4.2 Two-Stage Method........................................................ 23
4.4.3 Unified Method............................................................. 23
CHAPTER 5 YOLO...................................................................................... 25
5.1 Introduction.................................................................................. 26
5.2 Related Terms.............................................................................. 26
5.2.1 IOU................................................................................ 26
5.2.2 Anchor Box / Bounding Box......................................... 27
5.2.3 mAP............................................................................... 27
5.2.3.1 Recall.............................................................. 27
5.2.3.2 Precision......................................................... 28
5.2.3.3 mAP................................................................ 28
5.2.4 Threshold....................................................................... 29
5.2.4.1 Conf. Threshold.............................................. 29
5.2.4.2 NMS Threshold.............................................. 29
5.2.5 Activation Function....................................................... 29
5.2.5.1 Sigmoid Function........................................... 29
5.2.5.2 ReLU Function............................................... 30
5.2.5.3 LReLU Function............................................ 31
5.2.6 Loss Function................................................................ 32
5.2.6.1 MSE Loss....................................................... 32
5.2.6.2 BCE Loss....................................................... 33
5.3 Architecture.................................................................................. 33
5.3.1 Convolution Layer........................................................ 34
Department of Computer Engineering ii

5.3.2 Shortcut Layer............................................................... 35

5.3.3 Residual Block.............................................................. 35
5.3.4 Upsample Layer............................................................ 35
5.3.5 YOLO Layer................................................................. 35
5.4 Approach: Standard YOLO Vs Self-Modified YOLO................ 35
5.5 Approach...................................................................................... 36
5.5.1 Detection Process.......................................................... 37
5.5.1.1 Bounding Box Evaluation.............................. 37
5.5.2 Thresholding................................................................. 39
5.5.3 Non-Max Suppression................................................... 40
5.5.4 Bounding Box Labelling............................................... 40
5.5.5 Final Results.................................................................. 41
CHAPTER 6 DETAILED DESCRIPTION AND IMPLEMETATION....... 42
6.1 Dataset.......................................................................................... 43
6.1.1 Raw dataset & Labelling............................................... 43
6.1.1.1 LabelImg........................................................ 43
6.1.1.2 DarkLabel....................................................... 43
6.1.2 Training Dataset............................................................ 44
6.1.2.1 Image File & .text File................................... 44
6.2 Model Description........................................................................ 45
6.2.1 Configuration File......................................................... 45
6.2.1.1 Description..................................................... 45
6.2.1.2 Parsing............................................................ 47
6.2.2 Model Making............................................................... 48
6.2.3 .data File........................................................................ 49
6.2.4 .names File.................................................................... 49
6.2.5 train.txt File................................................................... 50
6.2.6 validate.txt File.............................................................. 50
6.3 Training........................................................................................ 50
6.3.1 Loss Calculation............................................................ 50
6.3.2 Training Process............................................................ 51
6.4 Detection...................................................................................... 51
6.4.1 Standard YOLO Vs Self-Modified YOLO................... 51
Department of Computer Engineering iii

6.4.2 Real Time Detection..................................................... 53

6.4.3 Detection In video......................................................... 54
6.4.4 Detection In Image........................................................ 54
6.6 Directory Structure....................................................................... 55
CHAPTER 7 TESTING................................................................................. 56
7.1 Black Box Testing........................................................................ 57
7.2 White Box Testing....................................................................... 58
7.3 Testing Strategy........................................................................... 58
7.4 Test Suites.................................................................................... 59
7.4.1 Test Suite 1.................................................................... 59
7.4.2 Test Suite 2.................................................................... 60
7.4.3 Test Suite 3.................................................................... 60
7.4.4 Test Suite 4.................................................................... 61
7.5 Testing: Challenges & Solution................................................... 61
CHAPTER 8 LIMITATIONS AND FUTURE ENHANCEMENT.............. 63
8.1 Limitations................................................................................... 64
8.2 Future Enhancements................................................................... 64
CHAPTER 9 CONCLUSION........................................................................ 65
9.1 Conclusion.................................................................................... 66
BIBLIOGRAPHY.......................................................................................... 67
Department of Computer Engineering iv

ABSTARCT
Object Detection is one of the most emerging and widely studied fields of computer
vision systems. The goal of Object Detection is to find out objects of certain classes along
with its location in a given image and assign a respective class label. With the help of
deep learning, the usage and efficiency of object detection systems has increased
tremendously. Our project incorporates state-of-the-art techniques for object detection
that can also be used for real-time object detection.
A major inconvenience in many object detection mechanisms is the dependency on other
computer vision approaches before using deep learning which results in loss of
performance in the system. In this project we make use of deep learning to solve the
problem of object detection in an end-to-end manner. The network is trained on a self-
developed dataset. The resulting module is very fast and accurate and can also be used for
real-time object detection.
Department of Computer Engineering v

LIST OF FIGURES
Figure No Title Page No.
Figure 1.1 Classification vs Localization vs Detection 3
Figure 2.1 Gantt Chart for Backend System 9
Figure 4.1 AI vs ML vs DL 16
Figure 4.2 Biological Neuron & Artificial Neuron 17
Figure 4.3 Neural Network 18
Figure 4.4 Convolutional Neural Network 19
Figure 4.5 Convolutional Process 20
Figure 4.6 Pooling Process 21
Figure 4.7 Classification + Regression 22
Figure 4.8 Two-Stage Method: Stage 1 23
Figure 4.9 Two-Stage Method: Stage 2 23
Figure 4.10 Unified Method 24
Figure 5.1 Intersect Over Union (IOU) 27
Figure 5.2 Bounding Box 27
Figure 5.3 Precision & Recall 28
Figure 5.4 Sigmoid Activation Function 30
Figure 5.5 ReLU Activation Function 30
Figure 5.6 Leaky ReLU Activation Function 31
Figure 5.7 Yolo v3 Architecture 33
Figure 5.8 Bounding Box Prediction 38
Figure 5.9 Detection Process 39
Department of Computer Engineering vi

Figure 5.10 Thresholding 39
Figure 5.11 Non-Max Suppression 40
Figure 5.12 Bounding Box Labeling 41
Figure 5.13 Final Result 41
Figure 6.1 Sample Image File 44
Figure 6.2 .txt File 45
Figure 6.3 Configuration File: Network Information 45
Figure 6.4 Configuration File: Convolutional Layer Information 46
Figure 6.5 Configuration File: Route Layer Information 46
Figure 6.6 Configuration File: Upsample Layer Information 46
Figure 6.7 Configuration File: Shortcut Layer Information 46
Figure 6.8 Configuration File: YOLO Layer Information 47
Figure 6.9 Configuration File Parsing 47
Figure 6.10 YOLO Architecture Making Procedure 48
Figure 6.11 YOLO Architecture as Module_list 49
Figure 6.12 mask_dataset.data File 49
Figure 6.13 mask_dataset.names File 49
Figure 6.14 mask_dataset_train.txt File 50
Figure 6.15 mask_dataset_validate.txt File 50
Figure 6.16 Loss Calculation 51
Figure 6.17 Training Process 51
Figure 6.18 Standard Approach 52
Figure 6.19 Self-Modified Approach 52
Department of Computer Engineering vii

Figure 6.24 Real Time Face-Mask Detection 53
Figure 6.25 Real Time Face-Mask Detection 53
Figure 6.26 Face-Mask Detection in Video 54
Figure 6.27 Face-Mask Detection in Image 54
Figure 6.28 Project File Structure 55
Figure 7.1 Test suit 1: mAP: 0.64 59
Department of Computer Engineering viii

LIST OF TABLES
Table No Title Page No.
Table 1.1 Python Advantages and Disadvantages 4
Table 1.2 Synopsis 6
Table 3.1 Hardware Requirements 11
Table 3.2 Software Requirements 11
Table 3.3 Used Libraries of Python with Description 12

Standard YOLO Approach Vs Self-Modified YOLO
Table 5.1 Approach 36
Standard YOLO Approach Vs Self-Modified YOLO
Table 6.1 Approach 52
Department of Computer Engineering ix

ABBREVIATION
Abbreviations used throughout this whole document for Survey Application are:
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
NLP Natural Language Processing
YOLO You Look Only Once
PIL Python Imaging Library
CNN Convolutional Neural Network
RCNN Recurrent Convolutional Neural Network
SSD Single Shot MultiBox Detector
IOU Intersection Over Union
mAP Mean Average Precision
NMS Non-Max Suppression
ReLU Rectified Linear Unit
LReLU Leaky Rectified Linear Unit
MSE Mean Square Error
BCE Binary Cross Entropy
FPS Frame Per Second
IO Input / Output
Department of Computer Engineering x

CHAPTER 1
INTRODUCTION
 PROJECT SUMMARY
 PROJECT PURPOSE
 PROJECT SCOPE
 PROJECT OBJECTIVES
 TECHNOLOGY AND
LITERATURE OVERVIEW
 SYNOPSIS
IU/ITE/CE/2020/UDP-006 INTRODUCTION
1.1 PROJECT SUMMARY
The most complicated problem in the project is to detect whether a person is wearing a
face mask or not and that involves classification and localization.
 Image classification which involved predicting the class of an image.
 And more complicated problem is image localization, where the image will have a
single object and the model should predict the class of the object as well as its
location and put a bounding box around the object.
An overview of the problem is shown in the Fig 1.1
Fig 1.1 Classification vs Localization vs Detection
Here, in our project the input to our model will be an image or a video (Mostly Real-
Time) and the output will be a bounding box corresponding to person face in the
image/video along with telling that that person has wear a face mask or not.
1.2 PROJECT PURPOSE
Face Mask detection is an important aspect in the Health care industry and it cannot be
taken lightly.
This project is to help identify face masks as an object in video surveillance cameras
across different places like hospitals, emergency departments, out-patient facilities,
residential care facilities, emergency medical services, and home health care delivery to
Department of Computer Engineering 2

provide safety to doctors, patients and reduce the outbreak of disease. Where the
detection of Face Mask would be required to happen in Real-time as the necessary actions
in case of any disobedience will be taken on the spot.
1.3 PROJECT SCOPE
 Airports:
The Face Mask Detection System can be used at airports to detect travelers
without masks. Face data of travelers can be captured in the system at the
entrance. If a traveler is found to be without a face mask, their picture is sent to
the airport authorities so that they could take quick action. If the person’s face is
already stored, like the face of an Airport worker, it can send the alert to the
worker’s phone directly
 Hospitals:
Using Face Mask Detection System, Hospitals can monitor if their staff is wearing
masks during their shift or not. If any health worker is found without a mask, they
will receive a notification with a reminder to wear a mask. Also, if quarantine
people who are required to wear a mask, the system can keep an eye and detect if
the mask is present or not and send notification automatically or report to the
authorities.
 Offices:
The Face Mask Detection System can be used at office premises to detect if
employees are maintaining safety standards at work. It monitors employees
without masks and sends them a reminder to wear a mask. The reports can be
downloaded or sent an email at the end of the day to capture people who are not
complying with the regulations or the requirements
1.4 OBJECTIVES
It is not feasible for a human to detect face mask in Real-time as there can be more than
hundreds of instances in a given frame, also it will be very time consuming and non-
efficient for a human to find a subject with or without the mask.

Because of this reason we have to make a powerful model that can overcome the problem
of Real-time detection and inefficiency of a human.
Also, the model should be capable to provide Face mask detection on Real-time
surveillance camera feed, any video, or a set of images.
1.5 TECHNOLOGY AND LITERATURE OVERVIEW
Below subsections are intended to present the overview of the technologies that are used
in this project.
1.5.1 Python
Python is an interpreted, object-oriented, high-level, general-purpose

programming language which provides high support for machine learning
& Deep learning algorithms because of its library support.
Python is very simple and easy to learn. It has a syntax that is very easy to
learn and is very easily readable and it is very easily and efficiently
maintainable.
Python supports modules and has a lot of packages due to which
modularity and hence the code can be reused.
Some of the features of python programming:

 Support for ML & DL Libraries
 Extensible in C and C++
 Interactive
 Dynamic
 Object-oriented
Table 1.1 Python Advantages and Disadvantages
Advantages Disadvantages
Vast libraries support Slow speed
Improved Productivity Not memory efficient
IOT opportunities Weak in Mobile computing

Potable, Free and Open source Design Restrictions

Dynamically typed Database Access
Embeddable Runtime errors
1.5.2 PyTorch
PyTorch is an open-source machine learning library based on the Torch

library. It is highly used in applications such as computer vision and
natural language processing (NLP). It was primarily developed by
Facebook’s AI Research Lab. It is Free and Open Source released under
the BSD license. PyTorch also has a C++ interface.
PyTorch has two main high-level features:

 Tensor computing (E.g. -Numpy) with strong acceleration via
graphics processing unit (GPU).
 Deep Neural Networks built on a tape-based automatic
differentiation system.
1.5.3 PyCharm
PyCharm is an integrated development environment (IDE) which is used

in computer programming. Although it supports most of the modern
programming languages, it is mainly used for python programming. It was
developed by a Czech company that goes by the name JetBrains.
Some of its functionalities are for code analysis, a graphical debugger, an
integrated unit tester, integration with version control systems and, it also
supports web development with Data Science and Anaconda as well as
Django.
PyCharm is cross-platform which means it works with Windows, macOS
and, Linux versions also.
1.5.4 LabelImg
LabelImg is a graphical annotation tool. It is mainly written in Python and

uses Qt for its graphical interface. Annotations can be saved as an .txt in

YOLO or annotations can also be saved as format .xml in PASCAL VOC

format the format which is used by Imagenet.
1.5.5 DarkLabel
DarkLabel is a video annotation tool that makes labeling an object in a

video very simple and efficient. DarkLabel provides a basis for reasonable
comparison with rectangular object annotations and linear interpolation as
a bounding shape propagation technique, respectively. It is very easy to
handle, and the learning curve is almost equal to zero. It is very time
efficient and accurate.
1.6 SYNOPSIS
Table 1.2 Synopsis

Project Title Face-mask Detection
Daily work Approximately 5 hours
Time Duration Approximately 3.5 months
Software Specification Python, PyTorch, LabelImg, DarkLabel
Start Date January 17th, 2020
End Date May 6th, 2020

CHAPTER 2
PROJECT
MANAGEMENT
 PROJECT PLANNING
OBJECTIVE
 PROJECT SCHEDULING
 TIMELINE CHART
IU/ITE/CE/2020/UDP-006 PROJECT MANAGEMENT
2.1 PROJECT PLANNING OBJECTIVES
The project is developed at Rajkot and the time duration for completing the project is
from 15th January, 2020 to May 5th, 2020.
During the project development period, we have submitted a report and presentations to
the internal guide on regular intervals whenever required.
2.1.1 Project Development approach
Our project is Face-mask Detection using Deep Learning Algorithm.

The motivation for this project is that machine learning and deep learning
is a very fast-growing subject in the field of computer vision.
2.1.2 Resource
2.1.2.1 Human Resource
The human resources required are

1. Project Guides.
2. Developers.
2.1.2.2 Environment Resource
The environment that supports the software project, often called

software engineering includes software and hardware.
2.2 PROJECT SCHEDULING
Project scheduling is one of the most important aspects of any project. Any project must
have a precise schedule before developing it.
When a project developer works on a scheduled project, it is more advantageous for
him/her to compare to an unscheduled project. It gives us a timeline for the motivation of
finishing a particular activity. Scheduling gives us an idea about project length, its cost,
its expected duration of completion and we can also find out the shortest way to complete
the project with the less overall cost of the project.

IU/ITE/CE/2020/UDP-006 PROJECT MANAGEMENT
The project schedule describes dependency between activities. It states the estimated time
required to reach each milestone and allocation of people to activities.
2.3 TIMELINE CHART
The overall project is estimated to have completed in an approximation of 4 months,

which is around 110 days. That includes the learning phase, the requirements
specification for the project, the development phases, and the testing phase with an
integration phase in the end.
Fig 2.1 is the Gantt Chart for the same and after that a table on the project scheduling
timeline for Object Detection Project which provides a brief description of the Sprints of
the development of the project.
Fig 2.1 Gantt Chart for Backend System

CHAPTER 3
SYSTEM
REQUIREMENTS
 HARDWARE REQUIREMENT
 SOFTWARE REQUIREMENT
 ENVIRONMET SETUP
IU/ITE/CE/2020/UDP-006 NEURAL NETWORK
3.1 HARDWARE REQUIREMENT
The total amount of data that will process through this hardware is approximately 10GB.
Table 3.1 denotes the hardware required to process the project.
Table 3.1 Hardware Requirements (Used)
Requirement Specification
RAM 32 GB DDR4
CPU Intel Core i9 9th Gen 9900K
GPU Nvidia GeForce RTX 2080
Memory ~ 5 GB
CPU CORE Octa Core
3.2 SOFTWARE REQUIREMENT
We developed the whole project including technologies image processing and machine
learning completely in a python programming language Table 3.2 denotes the software
required to process the project.
Table 3.2 Software Requirements
Requirement Specification
Platform Python
IDE PyCharm
Technology Image and Video Processing, Deep
Learning
Libraries Torch, NumPy, PIL, tqdm, argparse, os,
Matplotlib, terminaltables, TorchVision,
TensorBoard, etc.

One of the advantages of python is its vast library support. We used various libraries of
python for this project. Table 3.3 shows the libraries We used during the project and the
description of the libraries.
Table 3.3 Used Libraries of Python with Description
Library Description
Torch Torch is an open-source machine learning library, a scientific
computing framework, and a scripting language based on the Lua
programming language. It provides a wide range of algorithms for
deep learning and uses the scripting language LuaJIT, and an
underlying C implementation.
The core package of Torch is torch. It provides a flexible N-
dimensional array or Tensor, which supports basic routines for
indexing, slicing, transposing, type-casting, resizing, sharing storage
and cloning. [1]
NumPy NumPy is the fundamental package for scientific computing with
Python. NumPy is a library for the Python programming language,
adding support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to
operate on these arrays. [2]
PIL Python Imaging Library (abbreviated as PIL) (in newer versions
known as Pillow) is a free and open-source additional library for the
Python programming language that adds support for opening,
manipulating, and saving many different image file formats. It is
available for Windows, Mac OS X and Linux. [3]
tqdm TQDM supports nested progress bars. If you have Keras fit and
predict loops within an outer TQDM loop, the nested loops will
display properly. TQDM supports Jupyter/IPython notebooks. [4]
argparse The argparse module makes it easy to write user-friendly command-
line interfaces. It parses the defined arguments from the sys.argv.
The argparse module also automatically generates help and usage
messages, and issues errors when users give the program invalid

arguments.
A parser is created with ArgumentParser and a new parameter is
added with add_argument(). Arguments can be optional, required,
or positional. [5]
Os The OS module in Python provides functions for interacting with
the operating system.
Matplotlib Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python. It is a plotting library for
the Python programming language. [6]
terminaltables Easily draw tables in terminal/console applications from a list of
lists of strings.
Multi-line rows: add newlines to table cells and terminatables will
handle the rest.
Table titles: show a title embedded in the top border of the table.[7][8]
TorchVision The torchvision package consists of popular datasets, model
architectures, and common image transformations for computer
vision. Some of the popular packages that are present in
TorchVision are torchvision.datasets,torchvision.io,
torchvision.models, torchvision.ops, torchvision.transforms,
torchvision.utils , etc. [9]
TensorBoard TensorBoard provides the visualization and tooling needed for
machine learning experimentation:
 Tracking and visualizing metrics such as loss and accuracy
 Visualizing the model graph (ops and layers)
 Viewing histograms of weights, biases, or other tensors as
they change over time
 Projecting embeddings to a lower-dimensional space
 Displaying images, text, and audio data
 Profiling TensorFlow programs [10]
3.3 ENVIRONMENT SETUP

1. Download Anaconda3-2019.03-Windows-x86_64
2. Update Anaconda with following commands:

 conda update conda
 conda update anaconda
 conda update python
 conda update --all
3. Install & Update Nvidia GeForce drivers (Driver version: 442.19)
4. Install CUDA toolkit (CUDA version: 10.0)
5. Install cuDNN (Archive version: cudnn-10.0-windows10-x64-v7.6.0.64.zip)
6. Create appropriate Environment variables
7. Create environment for PyTorch using following command:

 conda create -n pytorch pip python
8. Install following requirements using pip install command:

 numpy: 1.18.1
 pillow: 6.2.2
 torch: 1.4.0
 tqdm
 terminaltables
 torchvision
 matplotlib
 argparse

CHAPTER 4
NEURAL NETWORK
 AI VS ML VS DL
 NEURAL NETWORK
 CONVOLUTIONAL NEURAL
NETWORK
 RELATED WORKS
4.1 AI VS ML VS DL
AI, ML and DL are interconnected in such a way that DL is a subset of ML which is in

turn a subset of AI. Their respective relations can be shown in Fig 4.1
Fig 4.1 AI vs ML vs DL
4.1.1 Artificial Intelligence
Artificial Intelligence (AI) which is the broad discipline of creating

intelligent machines. It is the overarching discipline that covers anything
related to making machines smart. Whether it’s a robot, a refrigerator, a
car, or a software application.
4.1.2 Machine Learning
Machine Learning (ML) is a subset of artificial intelligence (AI) refers to

systems that can learn by themselves. Systems that get smarter and smarter
over time without human intervention.
Machine Learning is the study of computer algorithms that improve
automatically with experience. Machine Learning algorithms build a
mathematical model that is based on the “training data”, to make
predictions, or decisions without being explicitly programmed to do so. [11]
4.1.3 Deep Learning
Deep Learning (DL) is ML but applied to large data sets.

Deep Learning works in a layered architecture and uses the artificial neural
network, a concept inspired by the biological neural network.
Deep Learning algorithms are trained to identify patterns and classify
various types of information to give the desired output when it receives an
input. [12]
4.2 NEURAL NETWORK
A neural network is a hugely parallel distributed made up of single processing units

processor inspired from biological neural network, which has a natural propensity for
storing exponential knowledge and making it available for use.
It is just like our brain because of following two reasons:
• Knowledge is gained by the network from its surrounding through a learning
process.
• Interneuron connection strengths, which are generally known as synaptic weights
are used as memory to store the knowledge that is gained through the learning
process.
Neural networks are multi-layer networks of neurons that will be used by people to
classify things and make predictions.
Artificial neurons are elementary units in an artificial neural network. An artificial

neuron is a mathematical function conceived as a model of biological neurons. [13] Fig 4.2
Shows the biological neuron on left and artificial neuron on the right.
Fig 4.2 Biological Neuron & Artificial Neuron

Artificial neuron working is defined as below:
 Firstly, the inputs are given to the perceptron, the basic Artificial neuron.
 Then, the weights are multiplied with each input
 Now, the obtained values are summed and then bias is added.
 The activation function is applied now to get the output. Some of the popular
activation functions are sigmoid, hyperbolic tangent(tanh), rectifier (Relu) and
more.
 At last the output is triggered as 0 or 1.
As artificial neurons are elementary units in an artificial neural network Fig 4.3 shows
artificial neural network where each round represents an artificial neuron.
Fig 4.3 Neural Network
Here,
 The First Layer represents the Input Layer.
 The Last Layer represents the output layer (i.e. the prediction).
 In between All layers are Hidden Layers
 Round Shows the Artificial Neuron Which is Described below

4.3 CONVOLUTIONAL NEURAL NETWORK
A Convolutional neural network (CNN) is a neural network that has one or more
convolutional layers and are used mainly for image processing, classification,
segmentation and also for other auto correlated data. The most common use for CNNs is
image classification. [14]
 A Convolutional Neural Network (CNN) consists of one or more convolutional

layers that are often present with a subsampling step and then they are followed by
one or more fully connected layers as in standard Multi-layer neural network.
 The architecture of CNN is created such as to take benefit of the 2D structure of
an input image (or other 2D input – A speech signal).
 The above mentioned is obtained with local connections and with tied weights
which are then followed by some sort of pooling which further results in
translation invariant features.
 Another benefit of Convolutional Neural Networks is that they are a lot easier to
train compared to other networks and they have very few parameters as compared
to fully connected networks with the same number of hidden units.
Fig 4.4 Convolutional Neural Network

The role of convolutional neural network is to transform the images into a format that is
easier to process, without losing the features which are necessary for getting a good
prediction.
The above mentioned is important when our goal is to design an architecture that is not
only good at learning features but is also scalable to massive datasets. Fig 4.4 shows the
Convolutional Neural Network.
4.3.1 The Kernel
The element which is involved in the process of carrying out the

convolution operation in the first part of the convolutional layer is called
the Kernel/Filter. [15]
Fig 4.5 Convolutional Process
In the Fig. 4.5 the left section is similar to 5 × 5 × 1 matrix which is input
image.
In the Fig. 4.5 the right section is similar to 3 × 3 × 1 matrix which is
Kernel. It is represented here as K.
 Image Dimensions = 5 (Height) × 5 (Breadth) × 1 (Number of

channels, e.g. RGB).
 Kernel/Filter, K =
Here, the kernel will shift 9 times because Stride Length = 1, every time
performing a matrix multiplication operation between K and the portion P

of the image over which the kernel is hovering. The filter will keep on
moving to the right with some stride value until it parses the complete
width. Then it will move down to the left most beginning of the image
where it will again continue its journey to the end until the complete image
is traversed.
4.3.2 Pooling Layer:
The function of the pooling layer is to reduce the spatial size of the
convolved feature. Because of this the computational power required to
process the data will decrease gradually through dimensionality reduction.
Also, it is useful for finding out the dominant features which are
independent of rotation and position thereby maintaining the process of
effectively training the model.
Pooling are of two types:

 Max Pooling
 Average pooling
Fig 4.6 Pooling Process
4.3.2.1 Max Pooling:
Max pooling works as a noise reducer. It removes the noisy

activations and performs de-noising along with dimensionality
reduction.

4.3.2.2 Average Pooling:
Average pooling simply performs dimensionality reduction for the

reduction of noise. Hence, we can conclude that Max pooling
performs better than average pooling.
4.4 RELATED WORKS
There have been many works in the field of object detection using computer vision
techniques which include sliding window algorithm and deformable part models, etc. But,
all of them lack the accuracy that is provided by the deep learning methods. There are two
main broad class methods:
 Two-stage detection (RCNN, Fast RCNN, Faster RCNN)

 YOLO and SSD
The major concepts that are used in the above techniques is shown below:
4.4.1 Classification + Regression
In this method the bounding box is predicted using regression and the class
that is present within the bounding box will be predicted with the help of
classification. The example of this architecture is shown in the image
below in Fig. 4.7
Fig 4.7 Classification + Regression

4.4.2 Two-Stage Method
In this method the region proposals are extracted with the help of some
other computer vision technique and then it will be resized to the fixed
input for the classification of the network which will then work as a
feature extractor. An SVM will then be trained to classify the object and
the background which will contain one SVM for each class. And a
bounding box regressor is also trained which will output corrections for
some proposal boxes. The idea of the above is shown in the image below.
This method is extremely effective but on the other hand it is also
computationally very expensive.
Fig 4.8 Two-Stage Method: Stage 1
Fig 4. 9 Two-Stage Method: Stage 2
4.4.3 Unified Method
The difference in this method is that instead of producing the region

proposals, we will use a pre-defined set of boxes to look for our objects.
Using the convolutional feature maps from the future layers in the

network, we will run another network over these feature maps to predict
the class scores and the bounding box offsets. The overview idea of the
above is shown in Fig. 4.10
The steps are mentioned below:

 Train a CNN with classification and Regression objective
 Then gather an activation from the future layers to infer
classification and localization with a fully connected layer or a
convolutional layer.
 During the training use IOU to relate the predictions with our
ground truth bounding box.
 While doing inference, use the non-max suppression to filter
multiple boxes around the same object
The more important techniques that refer to this strategy are: SSD (uses
different activation maps for the prediction of classes and the bounding
boxes and YOLO (used in this project) which uses a single activation map
for predicting classes and bounding boxes. Here, we use multiple scales to
achieve a higher mAP (Mean Average Precision) by detecting objects that
vary in size with very high accuracy.
Fig 4.10 Unified Method

CHAPTER 5
YOLO
 INTRODUCTION
 RELATED TERMS
 ARCHITECTURE
 APPROCH: STANDARD YOLO
VS SELF_MODIFIED YOLO
 APPROACH
IU/ITE/CE/2020/UDP-006 YOLO
5.1 INTRODUCTION
There are currently 3 versions of the YOLO algorithm that are being used in practice.
Each version has its advantages and disadvantages. But YOLO v3 is right now the most
popular Real-time object detection algorithm being used around the globe.
The YOLO v3 (YOU LOOK ONLY ONCE) is one of the faster algorithms currently
being used worldwide. Even though it is not the most accurate algorithm out there, but it
is a very good choice when there is a need for real-time object detection without loss of
too much accuracy.
YOLO v3 consists of 53 layers while YOLO v2 consists of only 19 layers due to which
the performance and accuracy of YOLO v3 is very much higher than YOLO v2, but
because of additional layers, YOLO v3 is slightly slower than YOLO v2. But in terms of
accuracy YOLO v3 is much better than YOLO v2.
Here, we have used the standard YOLO v3 algorithm with a change in the Non-Max
suppression process.
5.2 RELATED TERMS
5.2.1 IOU
 IOU can be computed as Area of Intersection divided over Area of Union

of two boxes
 IOU must be ≥0 and ≤1
 ground truth box to be IOU ≈ 1
 The left image IOU is very low

Fig 5.1 Intersect Over Union (IOU)
5.2.2 Anchor Box / Bounding Box
The bounding box is a rectangle that is drawn in such a way that it covers
the entire object and fits it perfectly. There exists a bounding box for every
instance of the object in the image. And for the box, 4 numbers are
predicted which are as follows:
 center_X center_Y width height
Fig 5.2 Bounding Box
5.2.3 mAP
5.2.3.1 Recall
 Recall is the ratio of true positive (true predictions) and the total of
ground truth positives (total number of cars) [16]
 How many relevant items are selected?

 The recall is the measure of how accurately we detect all the

objects in the data.
 Recall =
5.2.3.2 Precision
 precision is the ratio of true positive (true predictions) (TP) and the
total number of predicted positives (total predictions) [16]
 How many selected items are relevant?
 Precision =
Fig 5.3 Precision & Recall
5.2.3.3 mAP
 Average precision is calculated by taking the area under the

precision-recall curve.
 Average Precision combines both precision and recall together
 Mean Average Precision is the mean of the AP calculated for all
the classes. [16]

5.2.4 Threshold
5.2.4.1 Conf. Threshold
 Confidence Threshold is a base probability value above which the

detection made by the algorithm will be considered as an object.
Most of the time it is predicted by a classifier. [17]
5.2.4.2 NMS Threshold
 While performing non-max suppression, which bounding boxes

should be merged to a single bounding box is decided by the
nms_threshold during the computation of IOU between those
bounding boxes.
5.2.5 Activation Function
5.2.5.1 Sigmoid Function
 The Sigmoid Activation Function is sometimes known as the

logistic function or squashing function.
 The research that has been carried out in the Sigmoid functions
which resulted in three variants of sigmoid Activation Function,
which are used in the Deep Learning applications. Sigmoid
Function is mostly used in feedforward neural networks.
 It is a bounded differentiable real function, defined for real input
values, with positive derivatives everywhere and some degree of
smoothness.
 The sigmoid function is given by the Formula 5.2.1
 (5.2.1)
 The sigmoid function appears in the output layers of the DL

architectures, and they are useful for predicting probability-based
output. [18]

Fig 5.4 Sigmoid Activation Function
5.2.5.2 Rectified Linear Unit (ReLU) Function
 ReLU is the most widely used activation function for deep learning
applications with the most accurate results. It is faster compared to
many other Activation Functions. ReLU represents a nearly Linear
function and hence it preserves the properties of the linear function
that made it easy to optimize with gradient descent methods. The
ReLU activation function performs a threshold operation to each
input element where values less than zero are set to zero. [18]
 the ReLU is given by Formula 5.2.2
 (5.2.2)
Fig 5.5 ReLU Activation Function

5.2.5.3 Leaky ReLU (LReLU) Function
 The leaky ReLU, was introduced to sustain and keep the weights
updates alive during the entire propagation process. A parameter
named alpha was introduced as a solution to ReLU’s dead neuron
problem so that the gradients will not be zero at any time during
training.
 LReLU computes the gradient with a very small constant value for
the negative gradient with a very small constant value for the
negative gradient alpha in the range of 0.01 thus LReLU is
computed as:
 (5.2.3)
 The LReLU has a similar result as compared to standard ReLU

with an exception that it will have non-zero gradients over the
entire duration and hence suggesting that there is no significant
result improvement except in sparsity and dispersion when
compared to standard ReLU and other activation functions. [18]
Fig 5.6 Leaky ReLU Activation Function

5.2.6 Loss Function
A Loss Function is a method of evaluating how well our algorithm models

our dataset. If the difference between Actual values and predicted values
are very high then the loss function will output a very high number. If the
difference is less then it will output a lower number. When we make a
change in the algorithm to improvise the model then our loss function will
tell us if we are in the right direction or not.
Loss Function in YOLO v3:

There are 3 detection layers in the YOLO algorithm. Each of these 3 layers
is responsible for the calculation of loss at three different scales. Then the
losses that are calculated at the 3 scales are then summed up for
Backpropagation. Every layer of YOLO uses 7 dimensions to calculate the
Loss. The first 4 dimensions correspond to center_X, center_Y, width,
height of the bounding box. The next dimension corresponds to the
objectness score of the bounding box and the last 2 dimensions correspond
to the one-hot encoded class prediction of the bounding box.
Here, the following 4 losses will be calculated:
 MSE of center_X, center_Y, width and height of bounding box

 BCE of obbjectness score of a bounding box
 BCE of no objectness score of a bounding box
 BCE of multi-class predictions of a bounding box. [19]
There are many different types of Loss Functions but the ones that are
used here are
 Mean Square Error/Quadratic Loss/ L2 Loss
 Binary Cross Entropy
5.2.6.1 Mean Squared Error Loss (MSE)
 (5.2.4)

 Mean Squared Error is calculated as the average of squared

difference between predictions and actual observations. It is only
affected by the average value of error without worrying about their
direction. However, because of squaring, the predications which
are already far from the actual value are affected heavily in
comparison to less deviated predictions. MSE has very effective
mathematical properties due to which it is easier to calculate
gradients in it. [20]
5.2.6.2 Binary Cross Entropy Loss (BCE)
 (5.2.5)
 BCE loss is useful in the tasks of binary classification. In the BCE

loss function we only need one output node to classify the data into
two classes. The output value will be passed into a Sigmoid
Activation function and the range of the output will be (0-1). [20]
5.3 ARCHITECTURE
The network that is used in this project is based on YOLO V3. And the architecture is
shown in Fig 5.7
Fig 5.7 Yolo v3 Architecture

YOLO v3 works on Darknet-53. This means it has 53 layers in its network which are
trained on the Imagenet. And for the task of detection another 53 layers are added into the
layer making a total of 106 layer fully convolutional underlying architecture of YOLO v3.
The newer architecture of YOLO consists of residual skip connections and upsampling. It
makes detection at 3 different scales.
YOLO is a fully convolutional network and the output is eventually obtained by applying
a 1×1 kernel on the feature map. In YOLO v3, detection occurs by applying 1×1
detection kernel on the feature maps of different 3 sizes at three different places in the
network.
There are in total 5 types of layers that are used as building block of YOLO v3 algorithm.
They are explained below.
5.3.1 Convolution Layer
A convolution layer consists of a set of filters whose parameters need to be

learned. The height and width of the filters are smaller than those of the
input volume.
Here in YOLO v3 the shape of the detection kernel will be calculated
based on the formula 5.3.1
Shape of the detection kernel = 1 × 1 × (B × (5 + C)) (5.3.1)
Where,
 B is the number of bounding boxes that can be predicted by a
single cell.
 The number “5” is for the 4 bounding box attributes and one object
confidence
 C will determine the number of classes. [21]
For this project, B = 3, C = 2 (MASK and NO_MASK). Hence, the kernel

size will be 1 × 1 × 21. The feature map that will be produced by this

kernel will have identical height and width of the previous feature map and
will have the detection attributes along the depth as described above.
5.3.2 Shortcut Layer
A shortcut layer is a skip connection similar to the one that is used in the
Resnet.
The Output of the shortcut layer is obtained by adding feature maps from
the previous layer and the from parameter that is defined (in the
configuration file) backward from the shortcut layer.
5.3.3 Residual Block
A building block of ResNet is called a residual block or also known as

identity block.
A residual block is just when the activation of a layer is fast-forwarded to a
deeper layer in the neural network.
5.3.4 Upsample Layer
The working of the Upsample layer is pretty simple. It the Upsamples the
feature map in the previous layer by a factor of stride using bilinear
upsampling.
The need for upsampling is because as we go deeper into the network the
size of the image keeps on decreasing and upsampling helps to get the
image size bigger so that it can be added to other layers.
5.3.5 YOLO Layer
YOLO layer corresponds highly to the detection layer that was discussed
before. The anchors in the YOLO layer describe 9 anchors but only the
anchors which are indexed by the attributes of the mask tag are used.
5.4 APPROACH: STANDARD YOLO VS SELF-MODIFIED YOLO
Table 5.1 shows the main difference between the stand YOLO approach and the self-
modified YOLO approach which we have use in this project.

Table 5.1 Standard YOLO Approach Vs Self-Modified YOLO Approach
Standard YOLO Approach Self-Modified YOLO Approach

The flow of standard YOLO is as The flow of self-Modified YOLO is as
following: following:
 Object Detection Process  Object Detection Process

o Localization o Localization
o Class Prediction o Class Prediction
 Thresholding  Thresholding
 Non max suppression with respect  Non max suppression irrespective
to class of class label
 Bounding Box Labelling
The main reason for using self-modified YOLO approach is as following:
 There are many object detection definitions in which there is a chance that an
object is present inside the bounding box of another object and so according to the
standard yolo, inner and outer both objects could be detected because of non-max
suppression being applied with respect to the object class label.
 But in face mask detection the main object is face and in a real-life situation, it is
impossible that a face of one person is inside of another person's face' bounding
box and that is why there is no harm doing non-max suppression irrespective of
the object class label.
5.5 APPROACH
Here, we will discuss the working of the YOLO algorithm and how the algorithm detects
the object in the image.
In our project, we have an input image of 416 × 416.

5.5.1 Detection Process
 YOLO v3 makes detection at 3 scales which are obtained by precisely

down-sampling the dimensions of the input image by 32, 16 and 8
respectively.
 The very first detection will be made by the 82nd layer as shown in Fig
5.7, The first 81 layers will down-sample the image in the network in such
a way that when the image will reach the 81st layer, it will have a stride of
32. So, when our input size of the image is 416 × 416, the resultant output
of the feature map will be 13 × 13. And 1 detection will be done using the
1 × 1 kernel, which will give us the detection kernel of 13 × 13 × 21.
 Next the feature map from the 79th layer will be passed through few
convolutional layers before being upsampled by 2X to dimensions 26 × 26.
The feature map is then concatenated from the previous layer 61. Now, the
combined feature maps are again passed through few 1 × 1 convolutional
layer to combine from the previous layer 61. Then, the second detection is
done by the 94th layer, which will output a detection feature map of 26 ×
26 × 21.
 The same procedure is followed again, in which the feature map from 91st
layer is passed through a few convolutional layers before being
concatenated with a feature map from the 36th layer. As before, a few 1 ×
1 convolutional layer will follow to combine the information from the
previous 36th layer. We will make the final 3rd detection at the 106th
layer, which will provide us with a map size of 52 × 52 × 21.
The responsibility of the 13 × 13 layer will be to detect the larger objects,
whereas the 52 × 52 layer will be responsible for detecting smaller objects,
while the 26 × 26 layer will detect medium-sized objects.
5.5.1.1 Bounding Box Evaluation
There are 13 x 13 x 21 = 3549, 26 x 26 x 21 = 14196 and 52 x 52 x

21 = 56784 values.
From which for one bounding box there are 7 values used.

So,
 13 x 13 x 3 = 507
 26 x 26 x 3 = 2028
 52 x 52 x 3 = 8112
Detections are there. Total 10647 Detections will be there. Which

will be evaluated as follow:
 Predicted Box (Blue)

 Prior Box (Black Dotted)
Fig 5.8 Bounding Box Prediction
Here,
bx, by are the x, y center co-ordinates,
bw, bh are width and height of our prediction.
tx, ty, tw, th is what the network outputs.
cx and cy are the top-left co-ordinates of the grid.
pw and ph are anchors dimensions for the box. [22]
During training, MSE Loss is used.

And objectness score is predicted using logistic regression. Its
value will be 1 if the bounding box prior overlaps a ground truth
object by more than any other bounding box prior.

Only one bounding box prior is assigned for each ground truth
object.
Fig 5.9 Detection Process
5.5.2 Thresholding
 The yolo algorithm outputs 10,647 boxes, most of which are

irrelevant/redundant. Hence, we have to filter and chuck out the unneeded
boxes.
 We get rid of all the boxes which have a low probability of an object being
detected. This can be done by confidence threshold, and only keeping the
boxes which have a probability of more than a confidence threshold.
 This step gets rid of anomalous detections of objects.
Fig 5.10 Thresholding

5.5.3 Non-Max Suppression
Even after such a thresholding, we end up with many boxes for each object
detected. But we only need one box. This bounding box is calculated using
Non-max suppression.
Non-max suppression makes use of a concept called “intersection over
union” or IoU. It takes as input two boxes, and as the name implies,
calculates the ratio of the intersection and union of the two boxes.
Having defined the IoU, non-max suppression works as follows:

Repeat Until no boxes to process:
 Select the box with highest probability of detection.

 Remove all the boxes with a high IoU with the selected box.
 Mark the selected box as “processed”
This type of filtering makes sure that only one bounding box is returned
per object detected.
Fig 5.11 Non-Max Suppression
5.5.4 Bounding Box Labeling
In the process of non max suppression we have neglected the class label.
To assign the class label we will check if any one of the bounding boxes
have class label as MASK while merging.

 If YES:
Final merged bounding box will be labeled as MASK.
 Otherwise:
It shows that all the bounding boxes which are merged is with
NO_MASK label
Which results in final merged bounding box being labeled as
NO_MASK.
Fig 5.12 Bounding Box Labeling
5.5.5 Final Result
Fig 5.13 Final Result

CHAPTER 6
DETAILED
DESCRIPTION AND
IMPLEMETATION
 DATASET
 MODEL DESCRIPTION
 TRAINING
 DETECTION
 DIRECTORY STRUCTURE
IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION
6.1 DATASET
For the purpose of this project the dataset which has two classes (MASK and
NO_MASK) was obtained in the following manner.
Masked Faces dataset:

 Downloaded the Baidu Face Mask Detection Model DATASET which consisted
of approximately 4000 images.
 A video dataset of around 45 videos was gathered from friends and family.
 At last videos were obtained from YouTube.
No_Mask dataset:
 The No-Mask dataset was the WIDER FACE: A Face Detection Benchmark that
provided us with approximately 14000 No_Mask images.
6.1.1 Raw Dataset & Labelling
The data which we downloaded was in a raw format of video or image

type which has to be somehow labelled in order to provide a input in
YOLO training process. And for that we have used the following tools in
order to get the label file.
6.1.1.1 LabelImg
The data that we downloaded from the BYDU dataset is of image

type and is labelled by labelImg tool which provides the text file
containing the values as following:
 Label Center_X Center_Y Width Hight
 Where,
Center_X, Center_Y, Width and Hight are normalized
value in a range of (0-1).
6.1.1.2 DarkLabel
We have used DarkLabel tool to label our videos self-made videos

and videos downloaded from YouTube.

DarkLabel tool provides the output text file in the following

manner:
 FRAME#,N[,CX,CY,W,H,LABEL]
 Where
FRAME#: Frame No
N: No. of Bounding Boxes
CX: Center_X
CY: Center_Y
W: Width
H: Height
 Center_x, Center_Y, Width and Height are not in a
normalized form of 0-1 which is required in YOLO input
In order to get a particular frame and its corresponding label text

file we have written a python script.
6.1.2 Training Dataset
The name of the image and the corresponding text file has to be same.
Data has been shuffled and divided two parts: 80% and 20% for Training
and Validation purpose respectively.
6.1.2.1 Image File & .txt File
Fig 6.1 Sample Image File

For visualization purpose Bounding boxes are shown in Fig 6.1 and
its respected .txt file containing values of Bounding Box and Label
is shown in Fig 6.2
Fig 6.2 .txt File
6.2 MODEL DESCRIPTION
6.2.1 Configuration File
6.2.1.1 Description
Fig 6.3 to Fig 6.8 shows YOLO Configuration File information.
Fig 6.3 Configuration File: Network Information

Fig 6.4 Configuration File: Convolutional Layer Information
Fig 6.5 Configuration File: Route Layer Information
Fig 6.6 Configuration File: Upsample Layer Information
Fig 6.7 Configuration File: Shortcut Layer Information

Fig 6.8 Configuration File: YOLO Layer Information
6.2.1.2 Parsing
 Fig 6.x shows how the yolov3.cfg file is being parsed and get the
YOLO architecture information into module_def list.
 Where each list element contains each layer information as a
Dictionary.
Fig 6.9 Configuration File Parsing

6.2.2 Model Making
Fig 6.10 shows the code explanation that how YOLO Architecture is made
from module_def list.
Fig 6.10 YOLO Architecture Making Procedure

Fig 6.11 YOLO Architecture as Module_list
Fig 6.x shows how the YOLO Architecture information which is stored in
the form of ModuleList
 Where ModuleList contains the List of Modules
 Here, These Modules are Layer of YOLO Architecture
6.2.3 .data File
Fig 6.12 mask_dataset.data File
6.2.4 .names File
Fig 6.13 mask_dataset.names File

6.2.5 train.txt File
Fig 6.14 mask_dataset_train.txt File
6.2.6 validate.txt File
Fig 6.15 mask_dataset_validate.txt File
6.3 TRAINING
6.3.1 Loss Calculation
As shown in Fig 6.16, Here, the following 4 losses will be calculated:

 MSE of center_X, center_Y, width and height of bounding box
 BCE of obbjectness score of a bounding box
 BCE of no objectness score of a bounding box
 BCE of multi-class predictions of a bounding box.

Fig 6.16 Loss Calculation
6.3.2 Training Process
Fig 6.17 Training Process
6.4 DETECTION
6.4.1 Standard YOLO Vs Self-Modified YOLO
Table 6.1 provides the Face-Mask Detection result difference between

standard YOLO Approach and Self-Modified YOLO Approach.

Table 6.1 Standard YOLO Approach Vs Self-Modified YOLO Approach
Standard YOLO Approach Self-Modified YOLO Approach
Fig 6.18 Standard Approach Fig 6.19 Self-Modified Approach

6.4.2 Real Time Detection
Frame Per Second (FPS) & Real Time is shown while Real-Time Face-
Mask Detection that is shown in Fig 5.24 & 5.25
Fig 6.24 Real Time Face-Mask Detection
Fig 6.25 Real Time Face-Mask Detection

6.4.3 Detection In video
Fig 6.26 Face-Mask Detection in Video
6.4.4 Detection In Image
Fig 6.27 Face-Mask Detection in Image

6.5 DIRECTORY STRUCTURE
Fig 6.28 Project File Structure

CHAPTER 7
TESTING
 BLACK BOX TESTING
 WHITE BOX TESTING
 TESTING STRATEGY
 TEST SUITES
 TESTING: CHALLENGES &
SOLUTIONS
IU/ITE/CE/2020/UDP-006 TESTING
7.1 BLACK BOX TESTING
Black box testing treats the system as a ‘black-box’, so it does not explicitly use
Knowledge of the internal structure or code. Or in other words the Test engineer need not
know the internal working of the “Black box” or application. Main focus in black box
testing is on functionality of the system as a whole. The term ‘behavioral testing’ is also
used for black box testing and white box testing is also sometimes called ’structural
testing’. Behavioral test design is slightly different from black-box test design because the
use of internal knowledge isn’t strictly forbidden, but it’s still discouraged.
Each testing method has its own advantages and disadvantages. There are some bugs that
cannot be found using only black box or only white box. Majority of the application are
tested by black box testing method. We need to cover majority of test cases so that most
of the bugs will get discovered by black box testing. Black box testing occurs throughout
the software development and testing life cycle i.e. in Unit, Integration, System,
Acceptance and regression testing stages.
Advantages of Black Box Testing

 Since the tester and developer are independent of each other, testing is balanced
and unprejudiced.
 Tester can be non-technical.
 There is no need for the tester to have detailed functional knowledge of system.
 Tests will be done from an end user's point of view, because the end user should
accept the system. (This testing technique is sometimes also called Acceptance
testing.)
 Testing helps to identify vagueness and contradictions in functional specifications.
 Test cases can be designed as soon as the functional specifications are complete.
Disadvantages of Black Box Testing

 Test cases are challenging to design without having clear functional
specifications.
 It is difficult to identify tricky inputs if the test cases are not developed based on
specifications.

 It is difficult to identify all possible inputs in limited testing time. As a result,

writing test cases may be slow and difficult.
 There are chances of having unidentified paths during the testing process.
 There is a high probability of repeating tests already performed by the
programmer.
7.2 WHITE BOX TESTING
White box Testing is also called Structural or Glass box testing. White box testing
involves looking at the structure of the code. When you know the internal structure of a
product, tests can be conducted to ensure that the internal operations performed according
to the specification. And all internal components have been adequately exercised.
Why we do White Box Testing?

To ensure:
 That all independent paths within a module have been exercised at least once.
 All logical decisions verified on their true and false values.
 All loops executed at their boundaries and within their operational bounds internal
data structures validity.
Limitations of White-Box Testing:

 Not possible for testing each and every path of the loops in program. This means
exhaustive testing is impossible for large systems.
 This does not mean that WBT is not effective. By selecting important logical
paths and data structure for testing is practically possible and effective.
 Some conditions might be untested as it is not realistic to test every single one.
 Necessity to create full range of inputs to test each path and condition make the
white box testing method time-consuming.
7.3 TESTING STRATEGY
We divided the strategy to test the project using the above-mentioned plans into small
tasks.

From the two methods of testing, namely Black Box Testing and White Box Testing, we
are going to use:
White Box Testing for,
 Unit Testing
 Module Testing
 Sub-System Testing.
and Black Box Testing for,
 System Testing
 Acceptance Testing.
As mentioned in our project scheduling and planning, there are a total of four test cases in
the training phases.
7.4 TEST SUITES
7.4.1 Test Suite 1
Fig 7.1 Test suit 1: mAP: 0.64

7.4.2 Test Suite 2
7.4.3 Test Suite 3

7.4.4 Test Suite 4
7.5 TESTING: CHALLENGES & SOLUTIONS
 Detecting face mask in the image which had the following characteristics:
1. Side face
2. vertically front half face
3. Subject wearing Cap or Spectacles or Googles
4. Detecting masks made of handkerchief or other types of fancy or designer
masks.
Solution:
1. Gathered training images which had above-defined characteristics.
 System Utilization and Optimization
1. Reducing training time
2. Increasing GPU utilization
3. Maintaining high GPU memory usage
4. Maintaining high FPS rate of 25 – 30.
Solution:
1. Enhanced the code to perform most of the numerical calculations in the GPU
to get maximum advantage of parallel processing.
2. Tried to avoid writing unnecessary code that connected to I/O peripheral
which showed current status.

 Another big challenge was to maintain the input and output clarity of the image.
To overcome this challenge, we normalized the Bounding Boxes to [0,1] and
expanded the bounding boxes according to the output image size.

CHAPTER 8
LIMITATIONS AND
FUTURE
ENHANCEMENT
 LIMITATIONS
 FUTURE ENHANCEMENTS
IU/ITE/CE/2020/UDP-006 LIMITATIONS AND FUTURE ENHANCEMENT
8.1 LIMITATIONS
1. Very distant faces cannot be detected.

2. Moderate results are obtained in other masks except medical masks as following:
 Surgical mask
 N-95 mask
 Commonly used masks
3. Difficulty in detecting horizontal and inverted faces.
4. Problem in detecting half-worn masks
5. Sometimes shows Mask as output when face is covered by hand.
6. Unrealistic faces with or without face mask like following cannot be detected:
 Animated characters
 Emojis
8.2 FUTURE ENHANCEMENTS
 The first step towards future enhancement would be to improve accuracy while
detecting not commonly used masks and fancy masks.
 Overcome the limitation of horizontal and inverted face detection as well as the
inefficiency in detecting half-worn masks.
 We could design a software/application which will provide various alerts (SMS or
Email or Notification) when software detects a face without mask.

CHAPTER 9
CONCLUSION
 CONCLUSION
IU/ITE/CE/2020/UDP-006 CONCLUSION
9.1 CONCLUSION
An accurate and efficient Face-Mask detection system has been developed which
achieves astounding results. This project uses recent techniques in the field of computer
vision and deep learning. Custom dataset was created using labelImg and DarkLabel. This
can be used in real-time Face-Mask detection applications which can be used in Airports,
Hospitals, Offices, etc.

IU/ITE/CE/2020/UDP-006 BIBLIOGRAPHY
BIBLIOGRAPHY
REFERENCES
1. https://en.wikipedia.org/wiki/Torch_(machine_learning)
2. https://numpy.org/
3. https://en.wikipedia.org/wiki/Python_Imaging_Library
4. https://pythonhosted.org/keras-tqdm/
5. https://docs.python.org/3/library/argparse.html
6. https://www.windowssearch-
exp.com/search?q=matplot+library&qpvt=matplot+library
7. https://github.com/Robpol86/terminaltables
8. https://robpol86.github.io/terminaltables/
9. https://github.com/pytorch/vision
10. https://www.tensorflow.org/tensorboard?hl=ru
11. https://academic.microsoft.com/topic/119857082
12. https://intellipaat.com/blog/tutorial/artificial-intelligence-tutorial/ai-vs-ml-vs-dl/
13. https://en.wikipedia.org/wiki/Artificial_neuron
14. https://towardsdatascience.com/an-introduction-to-convolutional-neural-networks-
eb0b60b58fd7
15. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53
16. https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-
to-know-7e3e92dc4899
17. http://www.thresh.net/
18. https://deepai.org/publication/activation-functions-comparison-of-trends-in-
practice-and-research-for-deep-learning
19. https://towardsdatascience.com/calculating-loss-of-yolo-v3-layer-8878bfaaf1ff
20. https://towardsdatascience.com/understanding-different-loss-functions-for-neural-
networks-dd1ed0274718
21. https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

22. https://towardsdatascience.com/review-yolov3-you-only-look-once-object-
detection-eab75d7a1ba6
23. https://pjreddie.com/darknet/yolo/
24. https://arxiv.org/pdf/1506.02640.pdf
26. https://pjreddie.com/media/files/papers/YOLOv3.pdf
27. https://towardsdatascience.com/deep-learning-in-science-fd614bb3f3ce
28. https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset
29. http://shuoyang1213.me/WIDERFACE/
30. https://github.com/eriklindernoren/PyTorch-YOLOv3
31. https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-
28b1b93e2088
32. https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/
33. https://cs231n.github.io/convolutional-networks/
35. https://towardsdatascience.com/setup-an-environment-for-machine-learning-and-
deep-learning-with-anaconda-in-windows-5d7134a3db10
COURSES
 Machine Learning by Andrew Ng:

https://www.youtube.com/playlist?list=PLLssT5z_DsK-
h9vYZkQkYNWcItqhlRJLN
 Convolutional Neural Networks by Andrew Ng:
https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-
KnDzF
 OpenCV:
https://www.youtube.com/playlist?list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq
 PyTorch:
https://www.youtube.com/playlist?list=PLQVvvaa0QuDdeMyHEYc0gxFpYwHY
2Qfdh
 YOLO v3:
https://www.youtube.com/playlist?list=PLbMqOoYQ3MxxArhAqvki_WoWBTC
c8fDHG

Face Mask Detection using YOLO v3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Mask Detection using YOLO v3

Uploaded by

Copyright:

Available Formats

PROJECT REPORT

FACE-MASK DETECTION USING YOLO V3

Nisarg Pethani (IU1641050045)

In fulfillment for the award of the degree

INSTITUTE OF TECHNOLOGY AND ENGINEERING

FACE-MASK DETECTION USING YOLO

In the partial fulfillment of the requirement

Nisarg Pethani (IU1641050045)

Harshal Vora (IU1641050063)

Date: May 10th, 2020

Department of Computer Engineering i

4.1.2 Machine Learning......................................................... 16

Department of Computer Engineering ii

5.3.2 Shortcut Layer............................................................... 35

Department of Computer Engineering iii

6.4.2 Real Time Detection..................................................... 53

Department of Computer Engineering iv

Department of Computer Engineering v

Figure No Title Page No.

Figure 1.1 Classification vs Localization vs Detection 3

Figure 2.1 Gantt Chart for Backend System 9

Figure 4.2 Biological Neuron & Artificial Neuron 17

Figure 4.3 Neural Network 18

Figure 4.4 Convolutional Neural Network 19

Figure 4.5 Convolutional Process 20

Figure 4.6 Pooling Process 21

Figure 4.7 Classification + Regression 22

Figure 4.8 Two-Stage Method: Stage 1 23

Figure 4.9 Two-Stage Method: Stage 2 23

Figure 4.10 Unified Method 24

Figure 5.1 Intersect Over Union (IOU) 27

Figure 5.2 Bounding Box 27

Figure 5.3 Precision & Recall 28

Figure 5.4 Sigmoid Activation Function 30

Figure 5.5 ReLU Activation Function 30

Figure 5.6 Leaky ReLU Activation Function 31

Figure 5.7 Yolo v3 Architecture 33

Figure 5.8 Bounding Box Prediction 38

Figure 5.9 Detection Process 39

Department of Computer Engineering vi

Figure 5.10 Thresholding 39

Figure 5.11 Non-Max Suppression 40

Figure 5.12 Bounding Box Labeling 41

Figure 5.13 Final Result 41

Figure 6.1 Sample Image File 44

Figure 6.2 .txt File 45

Figure 6.3 Configuration File: Network Information 45

Figure 6.4 Configuration File: Convolutional Layer Information 46

Figure 6.5 Configuration File: Route Layer Information 46

Figure 6.6 Configuration File: Upsample Layer Information 46

Figure 6.7 Configuration File: Shortcut Layer Information 46

Figure 6.8 Configuration File: YOLO Layer Information 47

Figure 6.9 Configuration File Parsing 47

Figure 6.10 YOLO Architecture Making Procedure 48

Figure 6.11 YOLO Architecture as Module_list 49

Figure 6.12 mask_dataset.data File 49

Figure 6.13 mask_dataset.names File 49

Figure 6.14 mask_dataset_train.txt File 50

Figure 6.15 mask_dataset_validate.txt File 50

Figure 6.16 Loss Calculation 51

Figure 6.17 Training Process 51

Figure 6.18 Standard Approach 52

Figure 6.19 Self-Modified Approach 52