You are on page 1of 84

PROJECT REPORT

On

FACE-MASK DETECTION USING YOLO V3


ARCHITECTURE

Submitted by

Nisarg Pethani (IU1641050045)


Harshal Vora (IU1641050063)

In fulfillment for the award of the degree

Of

BACHELOR OF TECHNOLOGY
In

COMPUTER ENGINEERING

INSTITUTE OF TECHNOLOGY AND ENGINEERING


INDUS UNIVERSITY CAMPUS, RANCHARDA, VIA-THALTEJ
AHMEDABAD-382115, GUJARAT, INDIA,
WEB: www.indusuni.ac.in
MAY 2020
PROJECT REPORT
ON

FACE-MASK DETECTION USING YOLO


V3 ARCHITECTURE
AT

In the partial fulfillment of the requirement


for the degree of
Bachelor of Technology
in
Computer Engineering

PREPARED BY
Nisarg Pethani (IU1641050045)
Harshal Vora (IU1641050063)

UNDER GUIDANCE OF
Internal Guide
Mr. Hiren Mer
Assistant Professor,
Department of Computer Engineering,
I.T.E, Indus University, Ahmedabad

SUBMITTED TO
INSTITUTE OF TECHNOLOGY AND ENGINEERING
INDUS UNIVERSITY CAMPUS, RANCHARDA, VIA-THALTEJ
AHMEDABAD-382115, GUJARAT, INDIA,
WEB: www.indusuni.ac.in
MAY 2020
CANDIDATE’S DECLARATION

I declare that final semester report entitled “Face-Mask Detection using YOLO V3
Architecture” is my own work conducted under the supervision of the guide Mr. Hiren
Mer.

I further declare that to the best of my knowledge, the report for B.Tech final semester
does not contain part of the work which has been submitted for the award of B.Tech
Degree either in this university or any other university without proper citation.

___________________________________
Candidate’s Signature

Nisarg Pethani (IU1641050045)

___________________________________
Guide: Mr. Hiren Mer
Assistant Professor
Department of Computer Engineering,
Indus Institute of Technology and Engineering
INDUS UNIVERSITY– Ahmedabad,
State: Gujarat
CANDIDATE’S DECLARATION

I declare that final semester report entitled “Face-Mask Detection using YOLO V3
Architecture” is my own work conducted under the supervision of the guide Mr. Hiren
Mer.

I further declare that to the best of my knowledge, the report for B.Tech final semester
does not contain part of the work which has been submitted for the award of B.Tech
Degree either in this university or any other university without proper citation.

___________________________________
Candidate’s Signature

Harshal Vora (IU1641050063)

___________________________________
Guide: Mr. Hiren Mer
Assistant Professor
Department of Computer Engineering,
Indus Institute of Technology and Engineering
INDUS UNIVERSITY– Ahmedabad,
State: Gujarat
INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING
COMPUTER ENGINEERING
2019 -2020

CERTIFICATE

Date: May 10th, 2020

This is to certify that the project work entitled “Face-Mask Detection using YOLO V3
Architecture” has been carried out by Nisarg Pethani, Harshal Vora under my
guidance in partial fulfillment of degree of Bachelor of Technology in COMPUTER
ENGINEERING (Final Year) of Indus University, Ahmedabad during the academic
year 2019 - 2020.

___________________________ ________________________________
Mr. Hiren Mer Dr. Seema Mahajan
Assistant Professor, Head of the Department,
Department of Computer Engineering, Department of Computer Engineering,
I.T.E, Indus University I.T.E, Indus University
Ahmedabad Ahmedabad
ACKNOWLEDGEMENT

Towards the successful completion of my B.Tech in Computer Engineering final year project,
we feel greatly obliged to certain Specials.

I am thankful and would like to express my gratitude to my internal guide Mr. Hiren Mer for
his conscientious guidance and diligently helping me in this endeavor. I am grateful to him for
providing precise milestones to be achieved for my final year project. Also, I extend my
gratitude all teachers who taught me throughout my Engineering and thank them for the
knowledge they imparted to me, also helping me in providing suggestions for existing features
of the project and how could they be improved. Finally, I give thanks to all those who
indirectly helped me or contributed towards the completion of my final year project.

- Nisarg Pethani
- Harshal Vora
IU/ITE/CE/2020/UDP-006

TABLE OF CONTENT
Title Page No
ABSTRACT................................................................................................... v
LIST OF FIGURES........................................................................................ vi
LIST OF TABLES......................................................................................... ix
ABBREVIATIONS........................................................................................ x
CHAPTER 1 INTRODUCTION................................................................... 1
1.1 Project Summary.......................................................................... 2
1.2 Project Purpose............................................................................. 2
1.3 Project Scope................................................................................ 3
1.4 Objectives........................................................................ 3
1.5 Technology and Literature Overview.......................................... 4
1.5.1 Python........................................................................... 4
1.5.2 PyTorch......................................................................... 5
1.5.3 PyCharm........................................................................ 5
1.5.4 LabelImg....................................................................... 5
1.5.5 DarkLabel...................................................................... 6
1.6 Synopsis....................................................................................... 6
CHAPTER 2 PROJECT MANAGEMENT................................................... 7
2.1 Project Planning Objectives......................................................... 8
2.1.1 Project Development approach..................................... 8
2.1.2 Resource........................................................................ 8
2.1.2.1 Human Resource............................................ 8
2.1.2.2 Environment Resource................................... 8
2.2 Project Scheduling....................................................................... 8
2.3 Timeline Chart............................................................................. 9
CHAPTER 3 SYSTEM REQUIREMENTS.................................................. 10
3.1 Hardware Requirement................................................................ 11
3.2 Software Requirement.................................................................. 11
3.3 Environment Setup....................................................................... 14
CHAPTER 4 NEURAL NETWORK............................................................ 15
4.1 AI vs ML vs DL........................................................................... 16
4.1.1 Artificial Intelligence.................................................... 16

Department of Computer Engineering i


IU/ITE/CE/2020/UDP-006

4.1.2 Machine Learning......................................................... 16


4.1.3 Deep Learning............................................................... 16
4.2 Neural Network............................................................................ 17
4.3 Convolutional Neural Network.................................................... 19
4.3.1 Kernel............................................................................ 20
4.3.2 Pooling.......................................................................... 21
4.3.2.1 Max Pooling................................................... 21
4.3.2.2 Average Pooling............................................. 22
4.4 Related Works.............................................................................. 22
4.4.1 Classification + Regression........................................... 22
4.4.2 Two-Stage Method........................................................ 23
4.4.3 Unified Method............................................................. 23
CHAPTER 5 YOLO...................................................................................... 25
5.1 Introduction.................................................................................. 26
5.2 Related Terms.............................................................................. 26
5.2.1 IOU................................................................................ 26
5.2.2 Anchor Box / Bounding Box......................................... 27
5.2.3 mAP............................................................................... 27
5.2.3.1 Recall.............................................................. 27
5.2.3.2 Precision......................................................... 28
5.2.3.3 mAP................................................................ 28
5.2.4 Threshold....................................................................... 29
5.2.4.1 Conf. Threshold.............................................. 29
5.2.4.2 NMS Threshold.............................................. 29
5.2.5 Activation Function....................................................... 29
5.2.5.1 Sigmoid Function........................................... 29
5.2.5.2 ReLU Function............................................... 30
5.2.5.3 LReLU Function............................................ 31
5.2.6 Loss Function................................................................ 32
5.2.6.1 MSE Loss....................................................... 32
5.2.6.2 BCE Loss....................................................... 33
5.3 Architecture.................................................................................. 33
5.3.1 Convolution Layer........................................................ 34

Department of Computer Engineering ii


IU/ITE/CE/2020/UDP-006

5.3.2 Shortcut Layer............................................................... 35


5.3.3 Residual Block.............................................................. 35
5.3.4 Upsample Layer............................................................ 35
5.3.5 YOLO Layer................................................................. 35
5.4 Approach: Standard YOLO Vs Self-Modified YOLO................ 35
5.5 Approach...................................................................................... 36
5.5.1 Detection Process.......................................................... 37
5.5.1.1 Bounding Box Evaluation.............................. 37
5.5.2 Thresholding................................................................. 39
5.5.3 Non-Max Suppression................................................... 40
5.5.4 Bounding Box Labelling............................................... 40
5.5.5 Final Results.................................................................. 41
CHAPTER 6 DETAILED DESCRIPTION AND IMPLEMETATION....... 42
6.1 Dataset.......................................................................................... 43
6.1.1 Raw dataset & Labelling............................................... 43
6.1.1.1 LabelImg........................................................ 43
6.1.1.2 DarkLabel....................................................... 43
6.1.2 Training Dataset............................................................ 44
6.1.2.1 Image File & .text File................................... 44
6.2 Model Description........................................................................ 45
6.2.1 Configuration File......................................................... 45
6.2.1.1 Description..................................................... 45
6.2.1.2 Parsing............................................................ 47
6.2.2 Model Making............................................................... 48
6.2.3 .data File........................................................................ 49
6.2.4 .names File.................................................................... 49
6.2.5 train.txt File................................................................... 50
6.2.6 validate.txt File.............................................................. 50
6.3 Training........................................................................................ 50
6.3.1 Loss Calculation............................................................ 50
6.3.2 Training Process............................................................ 51
6.4 Detection...................................................................................... 51
6.4.1 Standard YOLO Vs Self-Modified YOLO................... 51

Department of Computer Engineering iii


IU/ITE/CE/2020/UDP-006

6.4.2 Real Time Detection..................................................... 53


6.4.3 Detection In video......................................................... 54
6.4.4 Detection In Image........................................................ 54
6.6 Directory Structure....................................................................... 55
CHAPTER 7 TESTING................................................................................. 56
7.1 Black Box Testing........................................................................ 57
7.2 White Box Testing....................................................................... 58
7.3 Testing Strategy........................................................................... 58
7.4 Test Suites.................................................................................... 59
7.4.1 Test Suite 1.................................................................... 59
7.4.2 Test Suite 2.................................................................... 60
7.4.3 Test Suite 3.................................................................... 60
7.4.4 Test Suite 4.................................................................... 61
7.5 Testing: Challenges & Solution................................................... 61
CHAPTER 8 LIMITATIONS AND FUTURE ENHANCEMENT.............. 63
8.1 Limitations................................................................................... 64
8.2 Future Enhancements................................................................... 64
CHAPTER 9 CONCLUSION........................................................................ 65
9.1 Conclusion.................................................................................... 66
BIBLIOGRAPHY.......................................................................................... 67

Department of Computer Engineering iv


IU/ITE/CE/2020/UDP-006

ABSTARCT

Object Detection is one of the most emerging and widely studied fields of computer
vision systems. The goal of Object Detection is to find out objects of certain classes along
with its location in a given image and assign a respective class label. With the help of
deep learning, the usage and efficiency of object detection systems has increased
tremendously. Our project incorporates state-of-the-art techniques for object detection
that can also be used for real-time object detection.
A major inconvenience in many object detection mechanisms is the dependency on other
computer vision approaches before using deep learning which results in loss of
performance in the system. In this project we make use of deep learning to solve the
problem of object detection in an end-to-end manner. The network is trained on a self-
developed dataset. The resulting module is very fast and accurate and can also be used for
real-time object detection.

Department of Computer Engineering v


IU/ITE/CE/2020/UDP-006

LIST OF FIGURES

Figure No Title Page No.

Figure 1.1 Classification vs Localization vs Detection 3

Figure 2.1 Gantt Chart for Backend System 9

Figure 4.1 AI vs ML vs DL 16

Figure 4.2 Biological Neuron & Artificial Neuron 17

Figure 4.3 Neural Network 18

Figure 4.4 Convolutional Neural Network 19

Figure 4.5 Convolutional Process 20

Figure 4.6 Pooling Process 21

Figure 4.7 Classification + Regression 22

Figure 4.8 Two-Stage Method: Stage 1 23

Figure 4.9 Two-Stage Method: Stage 2 23

Figure 4.10 Unified Method 24

Figure 5.1 Intersect Over Union (IOU) 27

Figure 5.2 Bounding Box 27

Figure 5.3 Precision & Recall 28

Figure 5.4 Sigmoid Activation Function 30

Figure 5.5 ReLU Activation Function 30

Figure 5.6 Leaky ReLU Activation Function 31

Figure 5.7 Yolo v3 Architecture 33

Figure 5.8 Bounding Box Prediction 38

Figure 5.9 Detection Process 39

Department of Computer Engineering vi


IU/ITE/CE/2020/UDP-006

Figure 5.10 Thresholding 39

Figure 5.11 Non-Max Suppression 40

Figure 5.12 Bounding Box Labeling 41

Figure 5.13 Final Result 41

Figure 6.1 Sample Image File 44

Figure 6.2 .txt File 45

Figure 6.3 Configuration File: Network Information 45

Figure 6.4 Configuration File: Convolutional Layer Information 46

Figure 6.5 Configuration File: Route Layer Information 46

Figure 6.6 Configuration File: Upsample Layer Information 46

Figure 6.7 Configuration File: Shortcut Layer Information 46

Figure 6.8 Configuration File: YOLO Layer Information 47

Figure 6.9 Configuration File Parsing 47

Figure 6.10 YOLO Architecture Making Procedure 48

Figure 6.11 YOLO Architecture as Module_list 49

Figure 6.12 mask_dataset.data File 49

Figure 6.13 mask_dataset.names File 49

Figure 6.14 mask_dataset_train.txt File 50

Figure 6.15 mask_dataset_validate.txt File 50

Figure 6.16 Loss Calculation 51

Figure 6.17 Training Process 51

Figure 6.18 Standard Approach 52

Figure 6.19 Self-Modified Approach 52

Department of Computer Engineering vii


IU/ITE/CE/2020/UDP-006

Figure 6.20 Standard Approach 52

Figure 6.21 Self-Modified Approach 52

Figure 6.22 Standard Approach 52

Figure 6.23 Self-Modified Approach 52

Figure 6.24 Real Time Face-Mask Detection 53

Figure 6.25 Real Time Face-Mask Detection 53

Figure 6.26 Face-Mask Detection in Video 54

Figure 6.27 Face-Mask Detection in Image 54

Figure 6.28 Project File Structure 55

Figure 7.1 Test suit 1: mAP: 0.64 59

Figure 7.2 Test suit 2: mAP: 0.60 60

Figure 7.3 Test suit 3: mAP: 0.74 60

Figure 7.4 Test suit 4: mAP: 0.78 61

Department of Computer Engineering viii


IU/ITE/CE/2020/UDP-006

LIST OF TABLES

Table No Title Page No.

Table 1.1 Python Advantages and Disadvantages 4

Table 1.2 Synopsis 6

Table 3.1 Hardware Requirements 11

Table 3.2 Software Requirements 11

Table 3.3 Used Libraries of Python with Description 12


Standard YOLO Approach Vs Self-Modified YOLO
Table 5.1 Approach 36
Standard YOLO Approach Vs Self-Modified YOLO
Table 6.1 Approach 52

Department of Computer Engineering ix


IU/ITE/CE/2020/UDP-006

ABBREVIATION

Abbreviations used throughout this whole document for Survey Application are:

AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
NLP Natural Language Processing
YOLO You Look Only Once
PIL Python Imaging Library
CNN Convolutional Neural Network
RCNN Recurrent Convolutional Neural Network
SSD Single Shot MultiBox Detector
IOU Intersection Over Union
mAP Mean Average Precision
NMS Non-Max Suppression
ReLU Rectified Linear Unit
LReLU Leaky Rectified Linear Unit
MSE Mean Square Error
BCE Binary Cross Entropy
FPS Frame Per Second
IO Input / Output

Department of Computer Engineering x


CHAPTER 1
INTRODUCTION
 PROJECT SUMMARY
 PROJECT PURPOSE
 PROJECT SCOPE
 PROJECT OBJECTIVES
 TECHNOLOGY AND
LITERATURE OVERVIEW
 SYNOPSIS
IU/ITE/CE/2020/UDP-006 INTRODUCTION

1.1 PROJECT SUMMARY

The most complicated problem in the project is to detect whether a person is wearing a
face mask or not and that involves classification and localization.
 Image classification which involved predicting the class of an image.
 And more complicated problem is image localization, where the image will have a
single object and the model should predict the class of the object as well as its
location and put a bounding box around the object.

An overview of the problem is shown in the Fig 1.1

Fig 1.1 Classification vs Localization vs Detection

Here, in our project the input to our model will be an image or a video (Mostly Real-
Time) and the output will be a bounding box corresponding to person face in the
image/video along with telling that that person has wear a face mask or not.

1.2 PROJECT PURPOSE

Face Mask detection is an important aspect in the Health care industry and it cannot be
taken lightly.
This project is to help identify face masks as an object in video surveillance cameras
across different places like hospitals, emergency departments, out-patient facilities,
residential care facilities, emergency medical services, and home health care delivery to

Department of Computer Engineering 2


IU/ITE/CE/2020/UDP-006 INTRODUCTION

provide safety to doctors, patients and reduce the outbreak of disease. Where the
detection of Face Mask would be required to happen in Real-time as the necessary actions
in case of any disobedience will be taken on the spot.

1.3 PROJECT SCOPE

 Airports:
The Face Mask Detection System can be used at airports to detect travelers
without masks. Face data of travelers can be captured in the system at the
entrance. If a traveler is found to be without a face mask, their picture is sent to
the airport authorities so that they could take quick action. If the person’s face is
already stored, like the face of an Airport worker, it can send the alert to the
worker’s phone directly
 Hospitals:
Using Face Mask Detection System, Hospitals can monitor if their staff is wearing
masks during their shift or not. If any health worker is found without a mask, they
will receive a notification with a reminder to wear a mask. Also, if quarantine
people who are required to wear a mask, the system can keep an eye and detect if
the mask is present or not and send notification automatically or report to the
authorities.
 Offices:
The Face Mask Detection System can be used at office premises to detect if
employees are maintaining safety standards at work. It monitors employees
without masks and sends them a reminder to wear a mask. The reports can be
downloaded or sent an email at the end of the day to capture people who are not
complying with the regulations or the requirements

1.4 OBJECTIVES

It is not feasible for a human to detect face mask in Real-time as there can be more than
hundreds of instances in a given frame, also it will be very time consuming and non-
efficient for a human to find a subject with or without the mask.

Department of Computer Engineering 3


IU/ITE/CE/2020/UDP-006 INTRODUCTION

Because of this reason we have to make a powerful model that can overcome the problem
of Real-time detection and inefficiency of a human.
Also, the model should be capable to provide Face mask detection on Real-time
surveillance camera feed, any video, or a set of images.

1.5 TECHNOLOGY AND LITERATURE OVERVIEW

Below subsections are intended to present the overview of the technologies that are used
in this project.

1.5.1 Python

Python is an interpreted, object-oriented, high-level, general-purpose


programming language which provides high support for machine learning
& Deep learning algorithms because of its library support.
Python is very simple and easy to learn. It has a syntax that is very easy to
learn and is very easily readable and it is very easily and efficiently
maintainable.
Python supports modules and has a lot of packages due to which
modularity and hence the code can be reused.

Some of the features of python programming:


 Support for ML & DL Libraries
 Extensible in C and C++
 Interactive
 Dynamic
 Object-oriented

Table 1.1 Python Advantages and Disadvantages

Advantages Disadvantages
Vast libraries support Slow speed
Improved Productivity Not memory efficient
IOT opportunities Weak in Mobile computing

Department of Computer Engineering 4


IU/ITE/CE/2020/UDP-006 INTRODUCTION

Potable, Free and Open source Design Restrictions


Dynamically typed Database Access
Embeddable Runtime errors

1.5.2 PyTorch

PyTorch is an open-source machine learning library based on the Torch


library. It is highly used in applications such as computer vision and
natural language processing (NLP). It was primarily developed by
Facebook’s AI Research Lab. It is Free and Open Source released under
the BSD license. PyTorch also has a C++ interface.

PyTorch has two main high-level features:


 Tensor computing (E.g. -Numpy) with strong acceleration via
graphics processing unit (GPU).
 Deep Neural Networks built on a tape-based automatic
differentiation system.

1.5.3 PyCharm

PyCharm is an integrated development environment (IDE) which is used


in computer programming. Although it supports most of the modern
programming languages, it is mainly used for python programming. It was
developed by a Czech company that goes by the name JetBrains.
Some of its functionalities are for code analysis, a graphical debugger, an
integrated unit tester, integration with version control systems and, it also
supports web development with Data Science and Anaconda as well as
Django.
PyCharm is cross-platform which means it works with Windows, macOS
and, Linux versions also.

1.5.4 LabelImg

LabelImg is a graphical annotation tool. It is mainly written in Python and


uses Qt for its graphical interface. Annotations can be saved as an .txt in

Department of Computer Engineering 5


IU/ITE/CE/2020/UDP-006 INTRODUCTION

YOLO or annotations can also be saved as format .xml in PASCAL VOC


format the format which is used by Imagenet.

1.5.5 DarkLabel

DarkLabel is a video annotation tool that makes labeling an object in a


video very simple and efficient. DarkLabel provides a basis for reasonable
comparison with rectangular object annotations and linear interpolation as
a bounding shape propagation technique, respectively. It is very easy to
handle, and the learning curve is almost equal to zero. It is very time
efficient and accurate.

1.6 SYNOPSIS

Table 1.2 Synopsis


Project Title Face-mask Detection
Daily work Approximately 5 hours
Time Duration Approximately 3.5 months
Software Specification Python, PyTorch, LabelImg, DarkLabel
Start Date January 17th, 2020
End Date May 6th, 2020

Department of Computer Engineering 6


CHAPTER 2
PROJECT
MANAGEMENT
 PROJECT PLANNING
OBJECTIVE
 PROJECT SCHEDULING
 TIMELINE CHART
IU/ITE/CE/2020/UDP-006 PROJECT MANAGEMENT

2.1 PROJECT PLANNING OBJECTIVES

The project is developed at Rajkot and the time duration for completing the project is
from 15th January, 2020 to May 5th, 2020.
During the project development period, we have submitted a report and presentations to
the internal guide on regular intervals whenever required.

2.1.1 Project Development approach

Our project is Face-mask Detection using Deep Learning Algorithm.


The motivation for this project is that machine learning and deep learning
is a very fast-growing subject in the field of computer vision.

2.1.2 Resource

2.1.2.1 Human Resource

The human resources required are


1. Project Guides.
2. Developers.

2.1.2.2 Environment Resource

The environment that supports the software project, often called


software engineering includes software and hardware.

2.2 PROJECT SCHEDULING

Project scheduling is one of the most important aspects of any project. Any project must
have a precise schedule before developing it.
When a project developer works on a scheduled project, it is more advantageous for
him/her to compare to an unscheduled project. It gives us a timeline for the motivation of
finishing a particular activity. Scheduling gives us an idea about project length, its cost,
its expected duration of completion and we can also find out the shortest way to complete
the project with the less overall cost of the project.

Department of Computer Engineering 8


IU/ITE/CE/2020/UDP-006 PROJECT MANAGEMENT

The project schedule describes dependency between activities. It states the estimated time
required to reach each milestone and allocation of people to activities.

2.3 TIMELINE CHART

The overall project is estimated to have completed in an approximation of 4 months,


which is around 110 days. That includes the learning phase, the requirements
specification for the project, the development phases, and the testing phase with an
integration phase in the end.
Fig 2.1 is the Gantt Chart for the same and after that a table on the project scheduling
timeline for Object Detection Project which provides a brief description of the Sprints of
the development of the project.

Fig 2.1 Gantt Chart for Backend System

Department of Computer Engineering 9


CHAPTER 3
SYSTEM
REQUIREMENTS
 HARDWARE REQUIREMENT
 SOFTWARE REQUIREMENT
 ENVIRONMET SETUP
IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

3.1 HARDWARE REQUIREMENT

The total amount of data that will process through this hardware is approximately 10GB.
Table 3.1 denotes the hardware required to process the project.

Table 3.1 Hardware Requirements (Used)

Requirement Specification
RAM 32 GB DDR4
CPU Intel Core i9 9th Gen 9900K
GPU Nvidia GeForce RTX 2080
Memory ~ 5 GB
CPU CORE Octa Core

3.2 SOFTWARE REQUIREMENT

We developed the whole project including technologies image processing and machine
learning completely in a python programming language Table 3.2 denotes the software
required to process the project.

Table 3.2 Software Requirements

Requirement Specification
Platform Python
IDE PyCharm
Technology Image and Video Processing, Deep
Learning
Libraries Torch, NumPy, PIL, tqdm, argparse, os,
Matplotlib, terminaltables, TorchVision,
TensorBoard, etc.

Department of Computer Engineering 11


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

One of the advantages of python is its vast library support. We used various libraries of
python for this project. Table 3.3 shows the libraries We used during the project and the
description of the libraries.

Table 3.3 Used Libraries of Python with Description

Library Description
Torch Torch is an open-source machine learning library, a scientific
computing framework, and a scripting language based on the Lua
programming language. It provides a wide range of algorithms for
deep learning and uses the scripting language LuaJIT, and an
underlying C implementation.
The core package of Torch is torch. It provides a flexible N-
dimensional array or Tensor, which supports basic routines for
indexing, slicing, transposing, type-casting, resizing, sharing storage
and cloning. [1]
NumPy NumPy is the fundamental package for scientific computing with
Python. NumPy is a library for the Python programming language,
adding support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to
operate on these arrays. [2]
PIL Python Imaging Library (abbreviated as PIL) (in newer versions
known as Pillow) is a free and open-source additional library for the
Python programming language that adds support for opening,
manipulating, and saving many different image file formats. It is
available for Windows, Mac OS X and Linux. [3]
tqdm TQDM supports nested progress bars. If you have Keras fit and
predict loops within an outer TQDM loop, the nested loops will
display properly. TQDM supports Jupyter/IPython notebooks. [4]
argparse The argparse module makes it easy to write user-friendly command-
line interfaces. It parses the defined arguments from the sys.argv.
The argparse module also automatically generates help and usage
messages, and issues errors when users give the program invalid

Department of Computer Engineering 12


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

arguments.
A parser is created with ArgumentParser and a new parameter is
added with add_argument(). Arguments can be optional, required,
or positional. [5]
Os The OS module in Python provides functions for interacting with
the operating system.
Matplotlib Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python. It is a plotting library for
the Python programming language. [6]
terminaltables Easily draw tables in terminal/console applications from a list of
lists of strings.
Multi-line rows: add newlines to table cells and terminatables will
handle the rest.
Table titles: show a title embedded in the top border of the table.[7][8]
TorchVision The torchvision package consists of popular datasets, model
architectures, and common image transformations for computer
vision. Some of the popular packages that are present in
TorchVision are torchvision.datasets,torchvision.io,
torchvision.models, torchvision.ops, torchvision.transforms,
torchvision.utils , etc. [9]
TensorBoard TensorBoard provides the visualization and tooling needed for
machine learning experimentation:
 Tracking and visualizing metrics such as loss and accuracy
 Visualizing the model graph (ops and layers)
 Viewing histograms of weights, biases, or other tensors as
they change over time
 Projecting embeddings to a lower-dimensional space
 Displaying images, text, and audio data
 Profiling TensorFlow programs [10]

3.3 ENVIRONMENT SETUP

Department of Computer Engineering 13


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

1. Download Anaconda3-2019.03-Windows-x86_64

2. Update Anaconda with following commands:


 conda update conda
 conda update anaconda
 conda update python
 conda update --all

3. Install & Update Nvidia GeForce drivers (Driver version: 442.19)

4. Install CUDA toolkit (CUDA version: 10.0)

5. Install cuDNN (Archive version: cudnn-10.0-windows10-x64-v7.6.0.64.zip)

6. Create appropriate Environment variables

7. Create environment for PyTorch using following command:


 conda create -n pytorch pip python

8. Install following requirements using pip install command:


 numpy: 1.18.1
 pillow: 6.2.2
 torch: 1.4.0
 tqdm
 terminaltables
 torchvision
 matplotlib
 argparse

Department of Computer Engineering 14


CHAPTER 4
NEURAL NETWORK
 AI VS ML VS DL
 NEURAL NETWORK
 CONVOLUTIONAL NEURAL
NETWORK
 RELATED WORKS
IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

4.1 AI VS ML VS DL

AI, ML and DL are interconnected in such a way that DL is a subset of ML which is in


turn a subset of AI. Their respective relations can be shown in Fig 4.1

Fig 4.1 AI vs ML vs DL

4.1.1 Artificial Intelligence

Artificial Intelligence (AI) which is the broad discipline of creating


intelligent machines. It is the overarching discipline that covers anything
related to making machines smart. Whether it’s a robot, a refrigerator, a
car, or a software application.

4.1.2 Machine Learning

Machine Learning (ML) is a subset of artificial intelligence (AI) refers to


systems that can learn by themselves. Systems that get smarter and smarter
over time without human intervention.
Machine Learning is the study of computer algorithms that improve
automatically with experience. Machine Learning algorithms build a
mathematical model that is based on the “training data”, to make
predictions, or decisions without being explicitly programmed to do so. [11]

4.1.3 Deep Learning

Deep Learning (DL) is ML but applied to large data sets.

Department of Computer Engineering 16


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

Deep Learning works in a layered architecture and uses the artificial neural
network, a concept inspired by the biological neural network.
Deep Learning algorithms are trained to identify patterns and classify
various types of information to give the desired output when it receives an
input. [12]

4.2 NEURAL NETWORK

A neural network is a hugely parallel distributed made up of single processing units


processor inspired from biological neural network, which has a natural propensity for
storing exponential knowledge and making it available for use.
It is just like our brain because of following two reasons:
• Knowledge is gained by the network from its surrounding through a learning
process.
• Interneuron connection strengths, which are generally known as synaptic weights
are used as memory to store the knowledge that is gained through the learning
process.

Neural networks are multi-layer networks of neurons that will be used by people to
classify things and make predictions.

Artificial neurons are elementary units in an artificial neural network. An artificial


neuron is a mathematical function conceived as a model of biological neurons. [13] Fig 4.2
Shows the biological neuron on left and artificial neuron on the right.

Fig 4.2 Biological Neuron & Artificial Neuron

Department of Computer Engineering 17


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

Artificial neuron working is defined as below:

 Firstly, the inputs are given to the perceptron, the basic Artificial neuron.
 Then, the weights are multiplied with each input
 Now, the obtained values are summed and then bias is added.
 The activation function is applied now to get the output. Some of the popular
activation functions are sigmoid, hyperbolic tangent(tanh), rectifier (Relu) and
more.
 At last the output is triggered as 0 or 1.

As artificial neurons are elementary units in an artificial neural network Fig 4.3 shows
artificial neural network where each round represents an artificial neuron.

Fig 4.3 Neural Network

Here,
 The First Layer represents the Input Layer.
 The Last Layer represents the output layer (i.e. the prediction).
 In between All layers are Hidden Layers
 Round Shows the Artificial Neuron Which is Described below

Department of Computer Engineering 18


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

4.3 CONVOLUTIONAL NEURAL NETWORK

A Convolutional neural network (CNN) is a neural network that has one or more
convolutional layers and are used mainly for image processing, classification,
segmentation and also for other auto correlated data. The most common use for CNNs is
image classification. [14]

 A Convolutional Neural Network (CNN) consists of one or more convolutional


layers that are often present with a subsampling step and then they are followed by
one or more fully connected layers as in standard Multi-layer neural network.
 The architecture of CNN is created such as to take benefit of the 2D structure of
an input image (or other 2D input – A speech signal).
 The above mentioned is obtained with local connections and with tied weights
which are then followed by some sort of pooling which further results in
translation invariant features.
 Another benefit of Convolutional Neural Networks is that they are a lot easier to
train compared to other networks and they have very few parameters as compared
to fully connected networks with the same number of hidden units.

Fig 4.4 Convolutional Neural Network

Department of Computer Engineering 19


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

The role of convolutional neural network is to transform the images into a format that is
easier to process, without losing the features which are necessary for getting a good
prediction.
The above mentioned is important when our goal is to design an architecture that is not
only good at learning features but is also scalable to massive datasets. Fig 4.4 shows the
Convolutional Neural Network.

4.3.1 The Kernel

The element which is involved in the process of carrying out the


convolution operation in the first part of the convolutional layer is called
the Kernel/Filter. [15]

Fig 4.5 Convolutional Process

In the Fig. 4.5 the left section is similar to 5 × 5 × 1 matrix which is input
image.
In the Fig. 4.5 the right section is similar to 3 × 3 × 1 matrix which is
Kernel. It is represented here as K.

 Image Dimensions = 5 (Height) × 5 (Breadth) × 1 (Number of


channels, e.g. RGB).

 Kernel/Filter, K =

Here, the kernel will shift 9 times because Stride Length = 1, every time
performing a matrix multiplication operation between K and the portion P

Department of Computer Engineering 20


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

of the image over which the kernel is hovering. The filter will keep on
moving to the right with some stride value until it parses the complete
width. Then it will move down to the left most beginning of the image
where it will again continue its journey to the end until the complete image
is traversed.

4.3.2 Pooling Layer:

The function of the pooling layer is to reduce the spatial size of the
convolved feature. Because of this the computational power required to
process the data will decrease gradually through dimensionality reduction.
Also, it is useful for finding out the dominant features which are
independent of rotation and position thereby maintaining the process of
effectively training the model.

Pooling are of two types:


 Max Pooling
 Average pooling

Fig 4.6 Pooling Process

4.3.2.1 Max Pooling:

Max pooling works as a noise reducer. It removes the noisy


activations and performs de-noising along with dimensionality
reduction.

Department of Computer Engineering 21


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

4.3.2.2 Average Pooling:

Average pooling simply performs dimensionality reduction for the


reduction of noise. Hence, we can conclude that Max pooling
performs better than average pooling.

4.4 RELATED WORKS

There have been many works in the field of object detection using computer vision
techniques which include sliding window algorithm and deformable part models, etc. But,
all of them lack the accuracy that is provided by the deep learning methods. There are two
main broad class methods:

 Two-stage detection (RCNN, Fast RCNN, Faster RCNN)


 YOLO and SSD

The major concepts that are used in the above techniques is shown below:

4.4.1 Classification + Regression

In this method the bounding box is predicted using regression and the class
that is present within the bounding box will be predicted with the help of
classification. The example of this architecture is shown in the image
below in Fig. 4.7

Fig 4.7 Classification + Regression

Department of Computer Engineering 22


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

4.4.2 Two-Stage Method

In this method the region proposals are extracted with the help of some
other computer vision technique and then it will be resized to the fixed
input for the classification of the network which will then work as a
feature extractor. An SVM will then be trained to classify the object and
the background which will contain one SVM for each class. And a
bounding box regressor is also trained which will output corrections for
some proposal boxes. The idea of the above is shown in the image below.
This method is extremely effective but on the other hand it is also
computationally very expensive.

Fig 4.8 Two-Stage Method: Stage 1

Fig 4. 9 Two-Stage Method: Stage 2

4.4.3 Unified Method

The difference in this method is that instead of producing the region


proposals, we will use a pre-defined set of boxes to look for our objects.
Using the convolutional feature maps from the future layers in the

Department of Computer Engineering 23


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

network, we will run another network over these feature maps to predict
the class scores and the bounding box offsets. The overview idea of the
above is shown in Fig. 4.10

The steps are mentioned below:


 Train a CNN with classification and Regression objective
 Then gather an activation from the future layers to infer
classification and localization with a fully connected layer or a
convolutional layer.
 During the training use IOU to relate the predictions with our
ground truth bounding box.
 While doing inference, use the non-max suppression to filter
multiple boxes around the same object

The more important techniques that refer to this strategy are: SSD (uses
different activation maps for the prediction of classes and the bounding
boxes and YOLO (used in this project) which uses a single activation map
for predicting classes and bounding boxes. Here, we use multiple scales to
achieve a higher mAP (Mean Average Precision) by detecting objects that
vary in size with very high accuracy.

Fig 4.10 Unified Method

Department of Computer Engineering 24


CHAPTER 5
YOLO
 INTRODUCTION
 RELATED TERMS
 ARCHITECTURE
 APPROCH: STANDARD YOLO
VS SELF_MODIFIED YOLO
 APPROACH
IU/ITE/CE/2020/UDP-006 YOLO

5.1 INTRODUCTION

There are currently 3 versions of the YOLO algorithm that are being used in practice.
Each version has its advantages and disadvantages. But YOLO v3 is right now the most
popular Real-time object detection algorithm being used around the globe.

The YOLO v3 (YOU LOOK ONLY ONCE) is one of the faster algorithms currently
being used worldwide. Even though it is not the most accurate algorithm out there, but it
is a very good choice when there is a need for real-time object detection without loss of
too much accuracy.

YOLO v3 consists of 53 layers while YOLO v2 consists of only 19 layers due to which
the performance and accuracy of YOLO v3 is very much higher than YOLO v2, but
because of additional layers, YOLO v3 is slightly slower than YOLO v2. But in terms of
accuracy YOLO v3 is much better than YOLO v2.

Here, we have used the standard YOLO v3 algorithm with a change in the Non-Max
suppression process.

5.2 RELATED TERMS

5.2.1 IOU

 IOU can be computed as Area of Intersection divided over Area of Union


of two boxes
 IOU must be ≥0 and ≤1
 ground truth box to be IOU ≈ 1
 The left image IOU is very low

Department of Computer Engineering 26


IU/ITE/CE/2020/UDP-006 YOLO

Fig 5.1 Intersect Over Union (IOU)

5.2.2 Anchor Box / Bounding Box

The bounding box is a rectangle that is drawn in such a way that it covers
the entire object and fits it perfectly. There exists a bounding box for every
instance of the object in the image. And for the box, 4 numbers are
predicted which are as follows:

 center_X center_Y width height

Fig 5.2 Bounding Box

5.2.3 mAP

5.2.3.1 Recall

 Recall is the ratio of true positive (true predictions) and the total of
ground truth positives (total number of cars) [16]
 How many relevant items are selected?

Department of Computer Engineering 27


IU/ITE/CE/2020/UDP-006 YOLO

 The recall is the measure of how accurately we detect all the


objects in the data.

 Recall =

5.2.3.2 Precision

 precision is the ratio of true positive (true predictions) (TP) and the
total number of predicted positives (total predictions) [16]
 How many selected items are relevant?

 Precision =

Fig 5.3 Precision & Recall

5.2.3.3 mAP

 Average precision is calculated by taking the area under the


precision-recall curve.
 Average Precision combines both precision and recall together
 Mean Average Precision is the mean of the AP calculated for all
the classes. [16]

Department of Computer Engineering 28


IU/ITE/CE/2020/UDP-006 YOLO

5.2.4 Threshold

5.2.4.1 Conf. Threshold

 Confidence Threshold is a base probability value above which the


detection made by the algorithm will be considered as an object.
Most of the time it is predicted by a classifier. [17]

5.2.4.2 NMS Threshold

 While performing non-max suppression, which bounding boxes


should be merged to a single bounding box is decided by the
nms_threshold during the computation of IOU between those
bounding boxes.

5.2.5 Activation Function

5.2.5.1 Sigmoid Function

 The Sigmoid Activation Function is sometimes known as the


logistic function or squashing function.
 The research that has been carried out in the Sigmoid functions
which resulted in three variants of sigmoid Activation Function,
which are used in the Deep Learning applications. Sigmoid
Function is mostly used in feedforward neural networks.
 It is a bounded differentiable real function, defined for real input
values, with positive derivatives everywhere and some degree of
smoothness.
 The sigmoid function is given by the Formula 5.2.1

 (5.2.1)

 The sigmoid function appears in the output layers of the DL


architectures, and they are useful for predicting probability-based
output. [18]

Department of Computer Engineering 29


IU/ITE/CE/2020/UDP-006 YOLO

Fig 5.4 Sigmoid Activation Function

5.2.5.2 Rectified Linear Unit (ReLU) Function

 ReLU is the most widely used activation function for deep learning
applications with the most accurate results. It is faster compared to
many other Activation Functions. ReLU represents a nearly Linear
function and hence it preserves the properties of the linear function
that made it easy to optimize with gradient descent methods. The
ReLU activation function performs a threshold operation to each
input element where values less than zero are set to zero. [18]
 the ReLU is given by Formula 5.2.2

 (5.2.2)

Fig 5.5 ReLU Activation Function

Department of Computer Engineering 30


IU/ITE/CE/2020/UDP-006 YOLO

5.2.5.3 Leaky ReLU (LReLU) Function

 The leaky ReLU, was introduced to sustain and keep the weights
updates alive during the entire propagation process. A parameter
named alpha was introduced as a solution to ReLU’s dead neuron
problem so that the gradients will not be zero at any time during
training.
 LReLU computes the gradient with a very small constant value for
the negative gradient with a very small constant value for the
negative gradient alpha in the range of 0.01 thus LReLU is
computed as:

 (5.2.3)

 The LReLU has a similar result as compared to standard ReLU


with an exception that it will have non-zero gradients over the
entire duration and hence suggesting that there is no significant
result improvement except in sparsity and dispersion when
compared to standard ReLU and other activation functions. [18]

Fig 5.6 Leaky ReLU Activation Function

Department of Computer Engineering 31


IU/ITE/CE/2020/UDP-006 YOLO

5.2.6 Loss Function

A Loss Function is a method of evaluating how well our algorithm models


our dataset. If the difference between Actual values and predicted values
are very high then the loss function will output a very high number. If the
difference is less then it will output a lower number. When we make a
change in the algorithm to improvise the model then our loss function will
tell us if we are in the right direction or not.

Loss Function in YOLO v3:


There are 3 detection layers in the YOLO algorithm. Each of these 3 layers
is responsible for the calculation of loss at three different scales. Then the
losses that are calculated at the 3 scales are then summed up for
Backpropagation. Every layer of YOLO uses 7 dimensions to calculate the
Loss. The first 4 dimensions correspond to center_X, center_Y, width,
height of the bounding box. The next dimension corresponds to the
objectness score of the bounding box and the last 2 dimensions correspond
to the one-hot encoded class prediction of the bounding box.

Here, the following 4 losses will be calculated:

 MSE of center_X, center_Y, width and height of bounding box


 BCE of obbjectness score of a bounding box
 BCE of no objectness score of a bounding box
 BCE of multi-class predictions of a bounding box. [19]

There are many different types of Loss Functions but the ones that are
used here are
 Mean Square Error/Quadratic Loss/ L2 Loss
 Binary Cross Entropy

5.2.6.1 Mean Squared Error Loss (MSE)

 (5.2.4)

Department of Computer Engineering 32


IU/ITE/CE/2020/UDP-006 YOLO

 Mean Squared Error is calculated as the average of squared


difference between predictions and actual observations. It is only
affected by the average value of error without worrying about their
direction. However, because of squaring, the predications which
are already far from the actual value are affected heavily in
comparison to less deviated predictions. MSE has very effective
mathematical properties due to which it is easier to calculate
gradients in it. [20]

5.2.6.2 Binary Cross Entropy Loss (BCE)

 (5.2.5)

 BCE loss is useful in the tasks of binary classification. In the BCE


loss function we only need one output node to classify the data into
two classes. The output value will be passed into a Sigmoid
Activation function and the range of the output will be (0-1). [20]

5.3 ARCHITECTURE

The network that is used in this project is based on YOLO V3. And the architecture is
shown in Fig 5.7

Fig 5.7 Yolo v3 Architecture

Department of Computer Engineering 33


IU/ITE/CE/2020/UDP-006 YOLO

YOLO v3 works on Darknet-53. This means it has 53 layers in its network which are
trained on the Imagenet. And for the task of detection another 53 layers are added into the
layer making a total of 106 layer fully convolutional underlying architecture of YOLO v3.

The newer architecture of YOLO consists of residual skip connections and upsampling. It
makes detection at 3 different scales.

YOLO is a fully convolutional network and the output is eventually obtained by applying
a 1×1 kernel on the feature map. In YOLO v3, detection occurs by applying 1×1
detection kernel on the feature maps of different 3 sizes at three different places in the
network.

There are in total 5 types of layers that are used as building block of YOLO v3 algorithm.
They are explained below.

5.3.1 Convolution Layer

A convolution layer consists of a set of filters whose parameters need to be


learned. The height and width of the filters are smaller than those of the
input volume.
Here in YOLO v3 the shape of the detection kernel will be calculated
based on the formula 5.3.1

Shape of the detection kernel = 1 × 1 × (B × (5 + C)) (5.3.1)

Where,
 B is the number of bounding boxes that can be predicted by a
single cell.
 The number “5” is for the 4 bounding box attributes and one object
confidence
 C will determine the number of classes. [21]

For this project, B = 3, C = 2 (MASK and NO_MASK). Hence, the kernel


size will be 1 × 1 × 21. The feature map that will be produced by this

Department of Computer Engineering 34


IU/ITE/CE/2020/UDP-006 YOLO

kernel will have identical height and width of the previous feature map and
will have the detection attributes along the depth as described above.

5.3.2 Shortcut Layer

A shortcut layer is a skip connection similar to the one that is used in the
Resnet.
The Output of the shortcut layer is obtained by adding feature maps from
the previous layer and the from parameter that is defined (in the
configuration file) backward from the shortcut layer.

5.3.3 Residual Block

A building block of ResNet is called a residual block or also known as


identity block.
A residual block is just when the activation of a layer is fast-forwarded to a
deeper layer in the neural network.

5.3.4 Upsample Layer

The working of the Upsample layer is pretty simple. It the Upsamples the
feature map in the previous layer by a factor of stride using bilinear
upsampling.
The need for upsampling is because as we go deeper into the network the
size of the image keeps on decreasing and upsampling helps to get the
image size bigger so that it can be added to other layers.

5.3.5 YOLO Layer

YOLO layer corresponds highly to the detection layer that was discussed
before. The anchors in the YOLO layer describe 9 anchors but only the
anchors which are indexed by the attributes of the mask tag are used.

5.4 APPROACH: STANDARD YOLO VS SELF-MODIFIED YOLO

Table 5.1 shows the main difference between the stand YOLO approach and the self-
modified YOLO approach which we have use in this project.

Department of Computer Engineering 35


IU/ITE/CE/2020/UDP-006 YOLO

Table 5.1 Standard YOLO Approach Vs Self-Modified YOLO Approach

Standard YOLO Approach Self-Modified YOLO Approach


The flow of standard YOLO is as The flow of self-Modified YOLO is as
following: following:

 Object Detection Process  Object Detection Process


o Localization o Localization
o Class Prediction o Class Prediction
 Thresholding  Thresholding
 Non max suppression with respect  Non max suppression irrespective
to class of class label
 Bounding Box Labelling

The main reason for using self-modified YOLO approach is as following:

 There are many object detection definitions in which there is a chance that an
object is present inside the bounding box of another object and so according to the
standard yolo, inner and outer both objects could be detected because of non-max
suppression being applied with respect to the object class label.

 But in face mask detection the main object is face and in a real-life situation, it is
impossible that a face of one person is inside of another person's face' bounding
box and that is why there is no harm doing non-max suppression irrespective of
the object class label.

5.5 APPROACH

Here, we will discuss the working of the YOLO algorithm and how the algorithm detects
the object in the image.

In our project, we have an input image of 416 × 416.

Department of Computer Engineering 36


IU/ITE/CE/2020/UDP-006 YOLO

5.5.1 Detection Process

 YOLO v3 makes detection at 3 scales which are obtained by precisely


down-sampling the dimensions of the input image by 32, 16 and 8
respectively.
 The very first detection will be made by the 82nd layer as shown in Fig
5.7, The first 81 layers will down-sample the image in the network in such
a way that when the image will reach the 81st layer, it will have a stride of
32. So, when our input size of the image is 416 × 416, the resultant output
of the feature map will be 13 × 13. And 1 detection will be done using the
1 × 1 kernel, which will give us the detection kernel of 13 × 13 × 21.
 Next the feature map from the 79th layer will be passed through few
convolutional layers before being upsampled by 2X to dimensions 26 × 26.
The feature map is then concatenated from the previous layer 61. Now, the
combined feature maps are again passed through few 1 × 1 convolutional
layer to combine from the previous layer 61. Then, the second detection is
done by the 94th layer, which will output a detection feature map of 26 ×
26 × 21.
 The same procedure is followed again, in which the feature map from 91st
layer is passed through a few convolutional layers before being
concatenated with a feature map from the 36th layer. As before, a few 1 ×
1 convolutional layer will follow to combine the information from the
previous 36th layer. We will make the final 3rd detection at the 106th
layer, which will provide us with a map size of 52 × 52 × 21.
The responsibility of the 13 × 13 layer will be to detect the larger objects,
whereas the 52 × 52 layer will be responsible for detecting smaller objects,
while the 26 × 26 layer will detect medium-sized objects.

5.5.1.1 Bounding Box Evaluation

There are 13 x 13 x 21 = 3549, 26 x 26 x 21 = 14196 and 52 x 52 x


21 = 56784 values.

From which for one bounding box there are 7 values used.

Department of Computer Engineering 37


IU/ITE/CE/2020/UDP-006 YOLO

So,
 13 x 13 x 3 = 507
 26 x 26 x 3 = 2028
 52 x 52 x 3 = 8112

Detections are there. Total 10647 Detections will be there. Which


will be evaluated as follow:

 Predicted Box (Blue)


 Prior Box (Black Dotted)

Fig 5.8 Bounding Box Prediction

Here,
bx, by are the x, y center co-ordinates,
bw, bh are width and height of our prediction.
tx, ty, tw, th is what the network outputs.
cx and cy are the top-left co-ordinates of the grid.
pw and ph are anchors dimensions for the box. [22]

During training, MSE Loss is used.


And objectness score is predicted using logistic regression. Its
value will be 1 if the bounding box prior overlaps a ground truth
object by more than any other bounding box prior.

Department of Computer Engineering 38


IU/ITE/CE/2020/UDP-006 YOLO

Only one bounding box prior is assigned for each ground truth
object.

Fig 5.9 Detection Process

5.5.2 Thresholding

 The yolo algorithm outputs 10,647 boxes, most of which are


irrelevant/redundant. Hence, we have to filter and chuck out the unneeded
boxes.
 We get rid of all the boxes which have a low probability of an object being
detected. This can be done by confidence threshold, and only keeping the
boxes which have a probability of more than a confidence threshold.
 This step gets rid of anomalous detections of objects.

Fig 5.10 Thresholding

Department of Computer Engineering 39


IU/ITE/CE/2020/UDP-006 YOLO

5.5.3 Non-Max Suppression

Even after such a thresholding, we end up with many boxes for each object
detected. But we only need one box. This bounding box is calculated using
Non-max suppression.
Non-max suppression makes use of a concept called “intersection over
union” or IoU. It takes as input two boxes, and as the name implies,
calculates the ratio of the intersection and union of the two boxes.

Having defined the IoU, non-max suppression works as follows:


Repeat Until no boxes to process:

 Select the box with highest probability of detection.


 Remove all the boxes with a high IoU with the selected box.
 Mark the selected box as “processed”

This type of filtering makes sure that only one bounding box is returned
per object detected.

Fig 5.11 Non-Max Suppression

5.5.4 Bounding Box Labeling

In the process of non max suppression we have neglected the class label.
To assign the class label we will check if any one of the bounding boxes
have class label as MASK while merging.

Department of Computer Engineering 40


IU/ITE/CE/2020/UDP-006 YOLO

 If YES:
Final merged bounding box will be labeled as MASK.
 Otherwise:
It shows that all the bounding boxes which are merged is with
NO_MASK label
Which results in final merged bounding box being labeled as
NO_MASK.

Fig 5.12 Bounding Box Labeling

5.5.5 Final Result

Fig 5.13 Final Result

Department of Computer Engineering 41


CHAPTER 6
DETAILED
DESCRIPTION AND
IMPLEMETATION
 DATASET
 MODEL DESCRIPTION
 TRAINING
 DETECTION
 DIRECTORY STRUCTURE
IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.1 DATASET

For the purpose of this project the dataset which has two classes (MASK and
NO_MASK) was obtained in the following manner.

Masked Faces dataset:


 Downloaded the Baidu Face Mask Detection Model DATASET which consisted
of approximately 4000 images.
 A video dataset of around 45 videos was gathered from friends and family.
 At last videos were obtained from YouTube.

No_Mask dataset:
 The No-Mask dataset was the WIDER FACE: A Face Detection Benchmark that
provided us with approximately 14000 No_Mask images.

6.1.1 Raw Dataset & Labelling

The data which we downloaded was in a raw format of video or image


type which has to be somehow labelled in order to provide a input in
YOLO training process. And for that we have used the following tools in
order to get the label file.

6.1.1.1 LabelImg

The data that we downloaded from the BYDU dataset is of image


type and is labelled by labelImg tool which provides the text file
containing the values as following:
 Label Center_X Center_Y Width Hight
 Where,
Center_X, Center_Y, Width and Hight are normalized
value in a range of (0-1).

6.1.1.2 DarkLabel

We have used DarkLabel tool to label our videos self-made videos


and videos downloaded from YouTube.

Department of Computer Engineering 43


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

DarkLabel tool provides the output text file in the following


manner:
 FRAME#,N[,CX,CY,W,H,LABEL]
 Where
FRAME#: Frame No
N: No. of Bounding Boxes
CX: Center_X
CY: Center_Y
W: Width
H: Height
 Center_x, Center_Y, Width and Height are not in a
normalized form of 0-1 which is required in YOLO input

In order to get a particular frame and its corresponding label text


file we have written a python script.

6.1.2 Training Dataset

The name of the image and the corresponding text file has to be same.
Data has been shuffled and divided two parts: 80% and 20% for Training
and Validation purpose respectively.

6.1.2.1 Image File & .txt File

Fig 6.1 Sample Image File

Department of Computer Engineering 44


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

For visualization purpose Bounding boxes are shown in Fig 6.1 and
its respected .txt file containing values of Bounding Box and Label
is shown in Fig 6.2

Fig 6.2 .txt File

6.2 MODEL DESCRIPTION

6.2.1 Configuration File

6.2.1.1 Description

Fig 6.3 to Fig 6.8 shows YOLO Configuration File information.

Fig 6.3 Configuration File: Network Information

Department of Computer Engineering 45


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

Fig 6.4 Configuration File: Convolutional Layer Information

Fig 6.5 Configuration File: Route Layer Information

Fig 6.6 Configuration File: Upsample Layer Information

Fig 6.7 Configuration File: Shortcut Layer Information

Department of Computer Engineering 46


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

Fig 6.8 Configuration File: YOLO Layer Information

6.2.1.2 Parsing

 Fig 6.x shows how the yolov3.cfg file is being parsed and get the
YOLO architecture information into module_def list.
 Where each list element contains each layer information as a
Dictionary.

Fig 6.9 Configuration File Parsing

Department of Computer Engineering 47


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.2.2 Model Making

Fig 6.10 shows the code explanation that how YOLO Architecture is made
from module_def list.

Fig 6.10 YOLO Architecture Making Procedure

Department of Computer Engineering 48


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

Fig 6.11 YOLO Architecture as Module_list

Fig 6.x shows how the YOLO Architecture information which is stored in
the form of ModuleList
 Where ModuleList contains the List of Modules
 Here, These Modules are Layer of YOLO Architecture

6.2.3 .data File

Fig 6.12 mask_dataset.data File

6.2.4 .names File

Fig 6.13 mask_dataset.names File

Department of Computer Engineering 49


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.2.5 train.txt File

Fig 6.14 mask_dataset_train.txt File

6.2.6 validate.txt File

Fig 6.15 mask_dataset_validate.txt File

6.3 TRAINING

6.3.1 Loss Calculation

As shown in Fig 6.16, Here, the following 4 losses will be calculated:


 MSE of center_X, center_Y, width and height of bounding box
 BCE of obbjectness score of a bounding box
 BCE of no objectness score of a bounding box
 BCE of multi-class predictions of a bounding box.

Department of Computer Engineering 50


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

Fig 6.16 Loss Calculation

6.3.2 Training Process

Fig 6.17 Training Process

6.4 DETECTION

6.4.1 Standard YOLO Vs Self-Modified YOLO

Table 6.1 provides the Face-Mask Detection result difference between


standard YOLO Approach and Self-Modified YOLO Approach.

Department of Computer Engineering 51


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

Table 6.1 Standard YOLO Approach Vs Self-Modified YOLO Approach

Standard YOLO Approach Self-Modified YOLO Approach

Fig 6.18 Standard Approach Fig 6.19 Self-Modified Approach

Fig 6.20 Standard Approach Fig 6.21 Self-Modified Approach

Fig 6.22 Standard Approach Fig 6.23 Self-Modified Approach

Department of Computer Engineering 52


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.4.2 Real Time Detection

Frame Per Second (FPS) & Real Time is shown while Real-Time Face-
Mask Detection that is shown in Fig 5.24 & 5.25

Fig 6.24 Real Time Face-Mask Detection

Fig 6.25 Real Time Face-Mask Detection

Department of Computer Engineering 53


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.4.3 Detection In video

Fig 6.26 Face-Mask Detection in Video

6.4.4 Detection In Image

Fig 6.27 Face-Mask Detection in Image

Department of Computer Engineering 54


IU/ITE/CE/2020/UDP-006 DETAILED DESCRIPTION AND IMPLEMETATION

6.5 DIRECTORY STRUCTURE

Fig 6.28 Project File Structure

Department of Computer Engineering 55


CHAPTER 7
TESTING
 BLACK BOX TESTING
 WHITE BOX TESTING
 TESTING STRATEGY
 TEST SUITES
 TESTING: CHALLENGES &
SOLUTIONS
IU/ITE/CE/2020/UDP-006 TESTING

7.1 BLACK BOX TESTING

Black box testing treats the system as a ‘black-box’, so it does not explicitly use
Knowledge of the internal structure or code. Or in other words the Test engineer need not
know the internal working of the “Black box” or application. Main focus in black box
testing is on functionality of the system as a whole. The term ‘behavioral testing’ is also
used for black box testing and white box testing is also sometimes called ’structural
testing’. Behavioral test design is slightly different from black-box test design because the
use of internal knowledge isn’t strictly forbidden, but it’s still discouraged.
Each testing method has its own advantages and disadvantages. There are some bugs that
cannot be found using only black box or only white box. Majority of the application are
tested by black box testing method. We need to cover majority of test cases so that most
of the bugs will get discovered by black box testing. Black box testing occurs throughout
the software development and testing life cycle i.e. in Unit, Integration, System,
Acceptance and regression testing stages.

Advantages of Black Box Testing


 Since the tester and developer are independent of each other, testing is balanced
and unprejudiced.
 Tester can be non-technical.
 There is no need for the tester to have detailed functional knowledge of system.
 Tests will be done from an end user's point of view, because the end user should
accept the system. (This testing technique is sometimes also called Acceptance
testing.)
 Testing helps to identify vagueness and contradictions in functional specifications.
 Test cases can be designed as soon as the functional specifications are complete.

Disadvantages of Black Box Testing


 Test cases are challenging to design without having clear functional
specifications.
 It is difficult to identify tricky inputs if the test cases are not developed based on
specifications.

Department of Computer Engineering 57


IU/ITE/CE/2020/UDP-006 TESTING

 It is difficult to identify all possible inputs in limited testing time. As a result,


writing test cases may be slow and difficult.
 There are chances of having unidentified paths during the testing process.
 There is a high probability of repeating tests already performed by the
programmer.

7.2 WHITE BOX TESTING

White box Testing is also called Structural or Glass box testing. White box testing
involves looking at the structure of the code. When you know the internal structure of a
product, tests can be conducted to ensure that the internal operations performed according
to the specification. And all internal components have been adequately exercised.

Why we do White Box Testing?


To ensure:
 That all independent paths within a module have been exercised at least once.
 All logical decisions verified on their true and false values.
 All loops executed at their boundaries and within their operational bounds internal
data structures validity.

Limitations of White-Box Testing:


 Not possible for testing each and every path of the loops in program. This means
exhaustive testing is impossible for large systems.
 This does not mean that WBT is not effective. By selecting important logical
paths and data structure for testing is practically possible and effective.
 Some conditions might be untested as it is not realistic to test every single one.
 Necessity to create full range of inputs to test each path and condition make the
white box testing method time-consuming.

7.3 TESTING STRATEGY

We divided the strategy to test the project using the above-mentioned plans into small
tasks.

Department of Computer Engineering 58


IU/ITE/CE/2020/UDP-006 TESTING

From the two methods of testing, namely Black Box Testing and White Box Testing, we
are going to use:
White Box Testing for,
 Unit Testing
 Module Testing
 Sub-System Testing.
and Black Box Testing for,
 System Testing
 Acceptance Testing.
As mentioned in our project scheduling and planning, there are a total of four test cases in
the training phases.

7.4 TEST SUITES

7.4.1 Test Suite 1

Fig 7.1 Test suit 1: mAP: 0.64

Department of Computer Engineering 59


IU/ITE/CE/2020/UDP-006 TESTING

7.4.2 Test Suite 2

Fig 7.2 Test suit 2: mAP: 0.60

7.4.3 Test Suite 3

Fig 7.3 Test suit 3: mAP: 0.74

Department of Computer Engineering 60


IU/ITE/CE/2020/UDP-006 TESTING

7.4.4 Test Suite 4

Fig 7.4 Test suit 4: mAP: 0.78

7.5 TESTING: CHALLENGES & SOLUTIONS

 Detecting face mask in the image which had the following characteristics:
1. Side face
2. vertically front half face
3. Subject wearing Cap or Spectacles or Googles
4. Detecting masks made of handkerchief or other types of fancy or designer
masks.
Solution:
1. Gathered training images which had above-defined characteristics.
 System Utilization and Optimization
1. Reducing training time
2. Increasing GPU utilization
3. Maintaining high GPU memory usage
4. Maintaining high FPS rate of 25 – 30.
Solution:
1. Enhanced the code to perform most of the numerical calculations in the GPU
to get maximum advantage of parallel processing.
2. Tried to avoid writing unnecessary code that connected to I/O peripheral
which showed current status.

Department of Computer Engineering 61


IU/ITE/CE/2020/UDP-006 TESTING

 Another big challenge was to maintain the input and output clarity of the image.
To overcome this challenge, we normalized the Bounding Boxes to [0,1] and
expanded the bounding boxes according to the output image size.

Department of Computer Engineering 62


CHAPTER 8
LIMITATIONS AND
FUTURE
ENHANCEMENT
 LIMITATIONS
 FUTURE ENHANCEMENTS
IU/ITE/CE/2020/UDP-006 LIMITATIONS AND FUTURE ENHANCEMENT

8.1 LIMITATIONS

1. Very distant faces cannot be detected.


2. Moderate results are obtained in other masks except medical masks as following:
 Surgical mask
 N-95 mask
 Commonly used masks
3. Difficulty in detecting horizontal and inverted faces.
4. Problem in detecting half-worn masks
5. Sometimes shows Mask as output when face is covered by hand.
6. Unrealistic faces with or without face mask like following cannot be detected:
 Animated characters
 Emojis

8.2 FUTURE ENHANCEMENTS

 The first step towards future enhancement would be to improve accuracy while
detecting not commonly used masks and fancy masks.
 Overcome the limitation of horizontal and inverted face detection as well as the
inefficiency in detecting half-worn masks.
 We could design a software/application which will provide various alerts (SMS or
Email or Notification) when software detects a face without mask.

Department of Computer Engineering 64


CHAPTER 9
CONCLUSION
 CONCLUSION
IU/ITE/CE/2020/UDP-006 CONCLUSION

9.1 CONCLUSION

An accurate and efficient Face-Mask detection system has been developed which
achieves astounding results. This project uses recent techniques in the field of computer
vision and deep learning. Custom dataset was created using labelImg and DarkLabel. This
can be used in real-time Face-Mask detection applications which can be used in Airports,
Hospitals, Offices, etc.

Department of Computer Engineering 66


IU/ITE/CE/2020/UDP-006 BIBLIOGRAPHY

BIBLIOGRAPHY

REFERENCES

1. https://en.wikipedia.org/wiki/Torch_(machine_learning)
2. https://numpy.org/
3. https://en.wikipedia.org/wiki/Python_Imaging_Library
4. https://pythonhosted.org/keras-tqdm/
5. https://docs.python.org/3/library/argparse.html
6. https://www.windowssearch-
exp.com/search?q=matplot+library&qpvt=matplot+library
7. https://github.com/Robpol86/terminaltables
8. https://robpol86.github.io/terminaltables/
9. https://github.com/pytorch/vision
10. https://www.tensorflow.org/tensorboard?hl=ru
11. https://academic.microsoft.com/topic/119857082
12. https://intellipaat.com/blog/tutorial/artificial-intelligence-tutorial/ai-vs-ml-vs-dl/
13. https://en.wikipedia.org/wiki/Artificial_neuron
14. https://towardsdatascience.com/an-introduction-to-convolutional-neural-networks-
eb0b60b58fd7
15. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53
16. https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-
to-know-7e3e92dc4899
17. http://www.thresh.net/
18. https://deepai.org/publication/activation-functions-comparison-of-trends-in-
practice-and-research-for-deep-learning
19. https://towardsdatascience.com/calculating-loss-of-yolo-v3-layer-8878bfaaf1ff
20. https://towardsdatascience.com/understanding-different-loss-functions-for-neural-
networks-dd1ed0274718
21. https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

Department of Computer Engineering 67


IU/ITE/CE/2020/UDP-006 NEURAL NETWORK

22. https://towardsdatascience.com/review-yolov3-you-only-look-once-object-
detection-eab75d7a1ba6
23. https://pjreddie.com/darknet/yolo/
24. https://arxiv.org/pdf/1506.02640.pdf
25. https://arxiv.org/pdf/1612.08242.pdf
26. https://pjreddie.com/media/files/papers/YOLOv3.pdf
27. https://towardsdatascience.com/deep-learning-in-science-fd614bb3f3ce
28. https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset
29. http://shuoyang1213.me/WIDERFACE/
30. https://github.com/eriklindernoren/PyTorch-YOLOv3
31. https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-
28b1b93e2088
32. https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/
33. https://cs231n.github.io/convolutional-networks/
34. https://arxiv.org/pdf/1311.2524.pdf
35. https://towardsdatascience.com/setup-an-environment-for-machine-learning-and-
deep-learning-with-anaconda-in-windows-5d7134a3db10

COURSES

 Machine Learning by Andrew Ng:


https://www.youtube.com/playlist?list=PLLssT5z_DsK-
h9vYZkQkYNWcItqhlRJLN
 Convolutional Neural Networks by Andrew Ng:
https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-
KnDzF
 OpenCV:
https://www.youtube.com/playlist?list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq
 PyTorch:
https://www.youtube.com/playlist?list=PLQVvvaa0QuDdeMyHEYc0gxFpYwHY
2Qfdh
 YOLO v3:
https://www.youtube.com/playlist?list=PLbMqOoYQ3MxxArhAqvki_WoWBTC
c8fDHG

Department of Computer Engineering 68

You might also like