You are on page 1of 79

Lauri Solin

Face mask wearing detection with


neural network

Metropolia University of Applied Sciences


Bachelor of Engineering
Information and Communication Technology
Bachelor’s Thesis
17 May 2021
Abstract

Author Lauri Solin


Title Face mask wearing detection with neural network

Number of Pages 62 pages 3 appendices


Date 17 May 2021

Degree Bachelor of Engineering

Degree Programme Information and Communication Technology

Professional Major Smart Systems

Instructors
Sami Sainio, Senior Lecturer

Face mask wearing has been recommended by public health experts (CDC) as a good pub-
lic health practice since the start of the COVID-19 pandemic already in the spring of 2020.
As a result, the purpose of this project was to develop a smart system which can determine
if pedestrians are wearing a face mask or not. Use cases could include access control or a
monitoring system with a camera in public spaces. A system like this could be installed to
grocery store entrances or entry points into public transportation such as train stations or
metro stations.

In order to monitor mask usage, automated systems should be used. To this end TensorFlow
object detection API was used to train an object detector to detect unmasked faces and face
masks from images. The TensorFlow library and RTX 2070S GPU (8 GB) were used in
training. Scripting was done using Python and Jupyter Notebooks. The Anaconda environ-
ment was used for development. The chosen dataset FMLD (face mask label dataset) was
downloaded and used. It contained three classes: incorrectly worn mask, unmasked face
and face mask. A large subset of 93% was sampled from the total FMLD dataset of 41,900
images.

During the first training attempt, it was attempted to detect unmasked persons, face mask
wearers, and incorrectly worn face masks. The face mask wearing detector was successfully
trained and tested on the evaluation set. From seeing the received results, the minimum
goals were achieved. The achieved mean average precision (mAP) at the IoU threshold of
0.5 was 70.5% mAP. This represents slight improvement of +3.3 percentage points mAP
compared to a pre-trained MTCNN face detector.

During the second training attempt with changed hyperparameters and dropping the incor-
rectly worn masks from the dataset, the obtained performance was even higher. With this
second model, only face masks and unmasked faces were being detected. During evalua-
tion, the mean average precision was measured at 86.4% mAP@IoU=0.5. The improvement
compared to the baseline MTCNN face detector model was +19.29 percentage points mAP.

Keywords TensorFlow, face detection, face mask, supervised learning,


computer vision, object detection, neural network, deep learn-
ing
Abstract

Tekijä Lauri Solin


Otsikko Kasvomaskien käytön havainnointi neuroverkkojen avulla

Sivumäärä 62 sivua 3 liitettä


Aika 17.5.2021

Tutkinto Insinööri (AMK)

Tutkinto-ohjelma Tieto- ja viestintätekniikka

Ammatillinen pääaine Smart Systems

Ohjaajat Sami Sainio, Lehtori

Insinöörityön tavoitteena oli kehittää kasvomaskien käytön havainnointiin tarkoitettu


ohjelma, jota voitaisiin käyttää COVID-19-pandemian etenemisen monitorointiin. Muita käyt-
tötarkoituksia voisi olla kulunvalvonta esimerkiksi sisätiloissa, joissa on kasvomaskipakko.
Erilaisia toteutustapoja ja valmiita kasvojen havainnointiratkaisuja tutkittiin. Ratkaisuna
päädyttiin kouluttamaan uudelleen objektien havaitsemismalli tunnistamaan kasvomaskeja
ja maskittomia kasvoja valokuvista.
Käytettyjä työkaluja olivat pääasiassa TensorFlow object detection API -rajapinta objektien
havaitsemista varten. Muita työkaluja olivat TensorFlow-kirjasto, Jupyter Notebook ja Py-
thon-ohjelmointikieli tietoaineiston esikäsittelyyn. Projekti toteutettiin Windows-käyttöjärjest-
elmällä Anaconda Python -ympäristössä. Käytetty näytönohjain oli Nvidia RTX 2070 Super
(8 GB).

Valittu tietoaineisto oli muiden tutkijoiden keräämä FMLD (face mask label dataset) -tietoa-
ineisto, joka sisälsi noin 41 900 valokuvaa. Tietoaineistossa olevat luokat olivat: väärin
käytetty kasvomaski, maskiton ja kasvomaski. Koko tietoaineistoa ei pystytty
hyödyntämään. Käytetty otos oli noin 93% alkuperäisestä aineistosta. Maskin käytön tun-
nistusmalli koulutettiin TensorFlow object detection API -rajapintaa käyttäen ja minimitavoi-
teet saavutetiin.
Ensimmäisellä koulutusyrityksellä objektien havaitsemismallin saavuttama keskitark-
kuuksien keskiarvo (mean average precision) oli 70.5% mAP käyttäen IoU-kynnysarvoa 0.5.
Parannus oli +3.3 prosenttiyksikköä mAP erästä MTCNN-neuroverkkomallia parempi.
MTCNN verrokki oli valmiiksi koulutettu kasvojen havaitsemiseen. Ensimmäisellä koulutusy-
rityksellä yritettiin havaita maskittomat, kasvomaskilliset ja väärin maskia käyttäneet hen-
kilöt.

Toisella koulutusyrityksellä, uusia hyperparametreja käyttäen sekä pudottamalla väärin


käytetyt kasvomaskit tietoaineistosta, saavutettiin vieläkin parempi tulos. Tällä toisella yri-
tyksellä yritettiin havainnoida ainoastaan maskittomia ja kasvomaskillisia kasvoja. Näin saa-
vutettu keskitarkkuuksien keskiarvo oli 86.4% mAP@IoU=0.5. Näin mitattuna parannus ver-
rattuna MTCNN-verrokkiin oli +19.29 prosenttiyksikköä mAP.

Avainsanoja neuroverkot, kasvojen havaitseminen, kasvomaski, valvottu


oppiminen, tietokonenäkö, objektien havaitseminen, syväop-
piminen, TensorFlow
Contents

List of Abbreviations

1 Introduction 1

2 Project plan 2

3 Research plan 4

4 Object detection and face detection 5

4.1 Pre-trained face detectors 9


4.1.1 Haar Cascade 9
4.1.2 MTCNN neural networks-based face detector 11

5 Theory of deep learning 13

5.1 Deep learning vs machine learning 13


5.2 Deep learning training pipeline 14
5.3 Activation functions 16
5.4 Convolution 17

6 Object detection terminology 18

6.1 Precision 18
6.2 Annotation and label 19
6.3 Recall 19
6.4 Mean average precision (mAP) 20
6.5 Intersection over Union (IOU) 21
6.6 IoU-threshold 22
6.7 Confidence score 24
6.8 Non-maximum suppression algorithm (NMS) 24
6.9 Objectness score (probability of containing object) 25
6.10 PASCAL performance metrics 25
6.11 COCO performance metrics 26

7 Deep learning-based object detectors 26


7.1 R-CNN architectures 26
7.2 SSD and anchor boxes 29
7.3 EfficientDet architecture 31

8 Proposed solution 34

9 Setting up the project environment 35

9.1 Tools, installation and setup 35


9.1.1 Tools 35
9.1.2 Installation 36
9.1.3 Setup 37

10 Project work tasks 39

11 Dataset pre-processing steps 40

12 First training run 44

12.1 Training settings and hyperparameters 44


12.2 Results achieved 45
12.3 Image inspection from test set 48
12.4 Suggested improvements 51

13 Second training run 52

13.1 Training settings and hyperparameters 52


13.2 Results achieved 52
13.3 Image inspection from test set 54

14 Conclusions 55

References 57

Appendices
Appendix 1. Tested requirements with version numbers
Appendix 2. First training run, EfficientDet D0, hyperparameters, 3 classes
Appendix 3. Second training run, EfficientDet D0, hyperparameters, 2 classes
List of Abbreviations

Bi-FPN Bi-directional feature pyramid network. A new variant of the feature pyramid
network used in the EfficientDet style object detectors.

CDC Centers for Disease Control and Prevention in the United States.

CNN Convolutional neural network. Typically contains convolutional layers, pool-


ing layers and possibly some fully connected layers.

CUDA NVIDIA CUDA® is a parallel computing platform and programming mode


which allows software programs to be run on graphics cards’ CUDA cores
for improved parallel computing performance.

CUDA Toolkit For this project, the toolkit mostly provides GPU-accelerated
libraries that will help in training neural networks. However, the CUDA
Toolkit contains more things such as a C/C++ compiler. CUDA toolkit and
CuDNN were both required for GPU accelerated neural networks training
in TensorFlow.

CuDNN CuDNN is an NVIDIA CUDA® Deep Neural Network library. It is a GPU


accelerated library of primitives for neural networks training and it is com-
patible with TensorFlow.

MTCNN Multi-task cascaded convolutional network. A deep learning-based model


used for face detection.

NMS Non-maximum suppression algorithm. A post-processing algorithm for


pruning extraneous bounding box proposals.

R-CNN Region-based convolutional neural network. It is an object detection archi-


tecture, the predecessor to fast R-CNN and faster R-CNN architectures.
Sometimes it used to describe the entire family of object detectors.
SSD Single shot multibox detector. A type of one-stage object detector, for ex-
ample SSD-MobileNet which uses MobileNet as a base convolutional neu-
ral network.

TFOD TensorFlow object defection API, an abbreviation composed by the author


of this thesis.

TFOD-API TensorFlow object defection API. An abbreviation of the thesis author.

THL Terveyden ja hyvinvoinnin laitos. Finnish institute for health and welfare.

XML Extensible Markup Language. A human readable and machine readable


markup language for documents. Filetype for PASCAL VOC annotation
files.

YOLO You look only once. It is an object detection architecture in computer vision.
It is neural networks -based. The key advantage of it is that it is a single
neural network which can perform multiple object detection tasks of local-
izing and classifying in one forward pass. Several versions of YOLO have
been implemented, but the original architecture was first invented in 2015
by Redmon et al. [18.]
1

1 Introduction

The COVID-19 virus outbreak became a pandemic in the spring of 2020. In the early
days of the pandemic, some acclaimed health experts recommended face mask wearing
since a vaccine did not exist yet at the time. According to a Korean professor of infectious
diseases, Kim Woo-Joo, Korea University College of Medicine, face mask wearing was
effective in South Korea. [1.] In addition, the American Centers for Disease Control and
Prevention (CDC) and the world-renowned infectious diseases expert Dr. Anthony Fauci
had recommended face mask wearing from relatively early on at the start of the pan-
demic. [2]

In public settings such as indoors at a grocery store or in a train, the most common
modes of COVID-19 transmission seemingly are droplets (coughing or sneezing), direct
contact, or aerosolized viral particles from breathing. Direct contact means accidentally
transmitting the virus, for example from wiping the nose and shaking somebody’s hand
and transmitting the virus to the other person. According to Professor Kim Woo-Ju, drop-
let size limit is around 5 micron (5000 nanometers) and less than that is considered an
aerosol. Droplets typically fall quicker to the ground whereas aerosols can linger in the
airmass for longer due to airflow. [1.]

It was established that FFP-3 filtration level respirators with corresponding eye protec-
tion, worn correctly, would be effective protection against the coronavirus. FFP-3 filtration
level respirators are mostly used in the care of COVID-19 patients in hospital setting
where aerosol-based virus infection risk can be higher from medical procedure such as
intubation. [3.] The most commonly available face masks are of the surgical mask variety,
the filtration level of which is significantly worse than FFP-2 or FFP-3 respirators against
aerosolized coronavirus particles. Currently as of March 2021, the Finnish Institute for
Health and Welfare (THL) recommends widespread face mask wearing. [4.]

Given the current circumstances, the goal of the project was to develop software to mon-
itor the face mask wearing habits in public spaces. The benefit would be greater compli-
ance with face mask mandates. Assuming good enough accuracy, such a system could
slow the spread of COVID-19 by preventing the access of unmasked people to an in-
doors space, for example.
2

2 Project plan

The plan was to design a system that could detect from a photo of pedestrians if they
were wearing face masks or not. The motive was to increase the public’s awareness and
remind people to wear a mask consistently when entering indoors places like a shopping
center or a grocery store. The American CDC recommends public face mask wearing to
everyone aged 2 and above. [5]

A practical purpose could be monitoring of mask wearing or reminding people about


mask wearing awareness in public spaces. An added bonus would be that the system
could congratulate people for mask wearing providing a morale boost. Similar warning
message-based systems exist in Finland for traffic and speeding (see figure 1). In such
a traffic system there is a speed detector and a traffic sign style of monitor at the side of
the road which displays the current speed, or a frowning sad face if the motorist is found
speeding. Other practical use could be in access control to prevent unmasked persons
from entering a worksite.

Figure 1. Speeding warning indicator. Copied from Karjalainen.fi (2014) [6]

The final system would need to have a camera and some trigger to take a picture. The
system would also need to find faces from the picture and analyze the recorded data.
Grocery stores, supermarkets and shopping malls typically already have automatic doors
at the entrance. An automatic door opening signal, for example from the door’s motion
detecting sensor would trigger the camera to take a photo from the outside. Photos would
only be taken from people entering the mall from the outside, to ensure that frontal faces
would be as visible as possible. [7.]
3

The initial plan was to use a pre-trained face detector to locate the faces from the image,
and then train a binary image classifier neural network to classify the faces as masked
or unmasked. The problem with this type of design would be that in case of bad input
data from the face detector, the classifier part would either receive bad data, or simply
not receive any data at all if the detector is not sensitive enough to detect all faces. A
backup plan was to retrain an object detector system towards face mask detection and
face detection (i.e., face mask wearing detection).

The backup plan to train an object detector towards face mask and face detection was
influenced by study from researchers Singh et al. They trained and tested YOLO, and
faster R-CNN object detectors for face mask and face detection purposes. The intended
purpose of their detector system was to perform monitoring of face mask wearing. In
theory, the backup plan was thus achievable. [8, pp. 1-2]

Face detection turned out to be a complex task in the research phase. Face detection
from a raw image, when using deep learning, is considered a multiple tasks problem. It
means that, for example, the deep learning face detector must perform a regression task
when finding a bounding box for a face, while another neural network performs classifi-
cation. Another subproblem is searching multiple faces from an image when the faces
can be of differing sizes. It is not known in advance how many face mask wearers or
regular faces there are in the input image. [9.]

One solution for face detection was recently popularized in the MTCNN architecture.
MTCNN has a proposal neural network to generate candidate windows of different size
for the subregions which might contain faces, which are then further refined by other
neural networks. Then the classification of the candidate face is necessary to the extent
of separating objects from the background of the image. Classification uses a different
loss function compared to the regression problem. Bounding box refinement is the re-
gression problem. The difference in the subtasks to achieve and their loss functions is
what makes the face detection a multiple tasks problem. Regression problems such as
facial landmark localization and bounding box refinement can use loss functions such as
Euclidian loss. The face classification task can use the loss function, for instance, cross-
entropy loss. [9; 10] However, the end result of face detection is always one or more
4

bounding boxes containing faces, assuming there is at least one. This is depicted in
figure 2 below.

Figure 2. MTCNN face detector predicts bounding box and facial landmarks. Copied from de
Paz Centeno (2021) [11]

3 Research plan

In the beginning of this project, more research had to be conducted about effective face
detection methods, and later about object detection methods. Computer vision and face
detection from images is a vast field of study but the research phase had to focus on
practically implementable methods which could show signs of effectiveness. The key
questions to be decided and determined in the research phase of thesis were:

• How to do face detection in general?


• Could a pre-trained face detector be effective at face mask detection?
• Can the selected model be effective at face mask detection and face de-
tection?
• Is the selected model available in a framework?
• Does a framework exist to train the selected model?
• Are there datasets available that are compatible with the model and frame-
work that can be used for face mask wearing detection?

For the project implementation there were two options to pursue, using a pre-trained face
detector together with a trained binary image classifier to classify faces into masked or
5

unmasked. The alternate option was to train an object detector model to the task of de-
tecting masked and unmasked faces. Both avenues were considered and it was decided
to train the object detector.

4 Object detection and face detection

Face detection belongs to the category of object detection in computer vision. Face de-
tection means that the system should analyze a given resolution image and find as many
faces as possible from the picture if it contains faces. [9.] Object detection does the same,
except that the class to detect is something other than faces, and there can be multiple
classes. [12.] Object detection requires finding and locating multiple objects if they are
present in the image. Figure 3 below illustrates the differences of tasks to achieve in the
object recognition field of computer vision. Semantic segmentation is a pixel-based task
to classify and locate categories from an image, and it divides the image to regions. The
difference between semantic and instance segmentation is that instance segmentation
allows counting and marking of individual objects. Instance segmentation utilizes object
masks which color all the pixels of an object in the training data, and the model must
predict the boundaries of the objects as closely as possible. Localization with classifica-
tion differs from object detection in that only one object instance is being searched for,
whereas many objects of multiple classes can exist in object detection. Face detection
can be solved using object detection techniques. [13.] An example purpose of object
detection is an advanced driver assistance system, which would need to detect bicyclists,
pedestrians, and cars, to name a few examples. [14.]

Figure 3. The categories of tasks in the object recognition field of computer vision. Copied from
Michelucci (2019) [13].
6

Much of the challenge in implementing face detection is that the number of faces in the
picture is not known in advance and neither is the size of the faces in relation to the
whole image. The bounding box also must be drawn on a detected face, so some level
of classification is necessary to distinguish face from irrelevant background details. [9;
12] Human faces do have a considerable variety in terms of skin color, eye color and
hair style. The viewing angle can complicate face detection because then the full-frontal
face would not be visible. A side profile is considered more difficult to detect than a full-
frontal face. [9; 15, pp.5525-5527]

Increasing difficulty comes from the variety of poses, angles, and extreme facial expres-
sions that the pedestrians and faces can be at or have. Occlusion and partially observed
faces are also more challenging to detect. [15, pp.5525-5527] Full-frontal face detection
has in recent times been considered to be readily solvable. The most current smart
phones have embedded software to accomplish face detection and bounding box draw-
ing when the phone user is using the frontal camera to take selfies or using the back
camera to take photos of people. [16.]

Face localization is similar to face detection except that only one face is expected to be
found from a given picture so that face localization might be sufficient for applications
such as finding the face from a portrait and drawing a bounding box on it. [12] In terms
of the problem domain, face localization would suffice as a solution to access control
enforcement for face mask wearing. Nevertheless if wanting to check multiple persons
at the same time from an image, then face detection is necessary.

Face detection has been a well-studied field within computer vision. Primarily interest
has been devoted to it because face detection is a necessary pre-step in any facial
recognition system. Considerable interest has been devoted to the subject because of
interest from the fields of law enforcement, border control and access control systems.
Face detection and face tracking are also needed in capturing and analyzing faces from
video footage. [17; 16] The so-called inference time of the face detector should be low.
Low inference time allows higher frames-per-second operation. Inference time and mean
average precision are typically important performance metrics of object detection mod-
els. But typically, only one of the two, inference time or mean average precision, can be
7

optimized especially in applications such as real-time object detection from video feed.
[18.]

The detection of face mask wearing in the current COVID-19 pandemic is problematic
because the regular face detection facial landmarks are not available to be seen e.g.,
the nose and the corners of the mouth. The current pretrained face detectors in open-
source frameworks, such as OpenCV, may not perform equally well in localizing and
detecting face mask wearing persons from images, but of course unmasked persons
would be regularly detected. As COVID-19 was a new phenomenon, detecting face mask
wearing had not been a priority for datasets and face detectors.

Some new face mask wearing detectors had been trained very recently by other re-
searchers during 2020. An example would be the AIZooTech face mask detector, which
is a well-performing face mask wearing detector, a recently trained model, by Chinese
developers. The detector can monitor whether person is wearing a face mask or not.
According to the developers’ GitHub, it is based on a deep learning object detection
model. [19.]

The researchers Batagelj et al. had studied the performance of existing face detectors
and some newly developed face mask wearing detectors (such as AIZooTech) on the
WIDERFACE dataset and the MAFA dataset, and a combination of the WIDERFACE
and MAFA datasets. Batagelj et al. sampled 4,935 images from MAFA, 2,217 images
from WIDERFACE and a number of images combined from MAFA and WIDERFACE.
These sampled subsets and the combination dataset are shown in table 1 below. The
combination dataset is the DB column. It was not explicitly clear how many images were
in the DB column of the table, but the researchers claim [20, p. 13] that the DB column
contains “the combined set of all the MAFA and Wider Face images.” The researchers
used a bootstrapping process in the testing procedure for sampling purposes. [20, pp.10-
15]

The MAFA dataset is a less-well known (compared to WIDERFACE) dataset of masked


faces that existed before the COVID-19 pandemic. The entire MAFA dataset itself con-
tains 30,811 images and the WIDERFACE dataset is of roughly similar size. [21] The
8

results of the tests that Batagelj et al. concluded on face detectors are shown in table 1
below. [20, pp.10-15]

Table 1. Performance of pre-trained face detectors. Copied from Batagelj et al (2021) [20,
pp.13-14]

Notably the MTCNN face detector implementation was not able to detect face mask
faces very well, as can be seen in the MAFA column with the average precision with the
value of 43.47%. The performance of MTCNN drops to half performance measured in
average precision, compared to WIDER. In table 1, Batagelj et al. refer to AP as the
PASCAL VOC 2012 mean average precision metric with an IoU threshold of 0.5. [20,
pp.10-14] The mean average precision and the IoU threshold will be explained later in
chapter 6.

The WIDERFACE dataset contains mostly regular faces without face masks, but it is still
considered a challenging face detection dataset with a wide variety in the faces to detect.
The entire WIDERFACE dataset contains 32,203 images. [22.]

Batagelj et al. constructed their own FMLD dataset as well, by sampling the images and
annotations from the MAFA dataset’s face mask images and the WIDERFACE dataset’s
images for their own proposed solution to the face mask wearing detection problem. The
FMLD dataset was not downloadable, but PASCAL VOC XML annotations were provided
and lists of filenames for the dataset were provided on their GitHub repository. [23; 20]
Based on the test results from Batagelj et al., it was estimated that the pretrained Haar
9

Cascade style detector and the MTCNN face detector would have insufficient average
precision in face mask detection specifically and thus would be unsuitable.

A popular object detection benchmark is MS COCO (common objects in context) object


detection, created by Microsoft. The exact details of the dataset and the number of object
classes has varied over the years. In the COCO benchmark 2017 there were 80 catego-
ries (classes) of objects to classify and detect from images, and approximately 164,000
images. Notably the COCO dataset does have a (human) person class as a labelled
object detection class in the dataset, but there is no dedicated face detection. [24; 25]
Fine-tuning a face detector model would have been the most preferred option for a sys-
tem intended to detect both unmasked and masked faces, as opposed to a general-
purpose object detector.

4.1 Pre-trained face detectors

OpenCV library’s Haar Cascade face detector was investigated and a third-party library
implementation of an MTCNN face detector was studied for viability toward the face
mask wearing detection. [26; 11] Their working principles were examined.

4.1.1 Haar Cascade

Researchers Paul Viola and Michael J. Jones invented in 2001 a novel algorithm for
object detection. In their acclaimed and highly cited paper, they focused on face detec-
tion more precisely. [27.] The Haar Cascade style face detector, as implemented in the
OpenCV Python library, is based on the machine learning-based solution that that Viola
and Jones developed.

The structure of the Haar Cascade classifier is a degenerate decision tree. The image
input comes to the decision tree and starts being analyzed sequentially for features. Only
if the feature is accepted in the decision process will a new feature further on in the tree
be tested against the input. The features are based on rectangular areas of different size
applied unto sections of the image to try to decide if there could be a face in that region.
[27; 28] In practice, a feature is tested on a sub-window basis to see if sub-windows will
10

be passed along in the degenerate decision tree and the negative regions of the image
(where there are no faces) are rejected. Figure 4 below shows a basic diagram of it. [28]

Figure 4. Haar Cascade classifier processes the sub-windows of the image. Copied from Viola
and Jones (2001)[28]

Difference calculation is performed between the sums of pixel intensities of the shaded
area and the white area of the filter to see how well the sub-window conforms to a pos-
sible location of a face. Figure 5 below shows an example of a rectangular Haar-feature
being applied to the image. [28] Viola and Jones also improved the efficiency of calcu-
lating the features and sums of pixel intensities for the rectangular area with their novel
integral image method. The integral image method makes calculating the feature pixel
intensities of the rectangular areas efficient. [27; 28]

Viola and Jones had to research which features were the most effective in the degener-
ate decision tree style detector to detect faces. All tested features were rectangular.
However, 180,000 possible features existed for each sub-window. The researchers used
the AdaBoost training algorithm in their study to determine the most efficient types of
features for face detection. Their original research used a dataset with small 24x 24 res-
olution images. [28.]
11

Figure 5. Haar Cascade features tested on a face. Copied from OpenCV. [29]

In preliminary testing in the very early stages of the thesis writing, the Haar Cascade face
detector from OpenCV was tested on masked COVID-19 protesters’ images downloaded
from unsplash.com. The detection results were worse than what Professor Jason Brown-
lee obtained in his blog of testing the Haar Cascade face detector for regular face detec-
tion. [9.] Haar Cascade face detector was not pursued further.

4.1.2 MTCNN neural networks-based face detector

MTCNN stands for a multi-task cascaded convolutional neural network. It is a deep learn-
ing-based approach to solve the face detection problem. In MTCNN, first an image pyr-
amid is created of different size sub-windows of the input image. Then a P-Net proposal
network generates promising regions of the image which might contain faces. The next
stage is the R-Net refining network which creates bounding boxes proposals, and the
third stage is the O-Net output network creating facial landmarks. Figure 6 below shows
the basic process of the MTCNN face detector. [9.]
12

Figure 6. MTCNN face detection. Copied from Zhang et al (2016) [10]

NMS stands for non-maximum suppression, and it is an algorithm that is run to discard
the worse bounding box proposals in cases that multiple bounding boxes were predicted
for the same object. Only the most confidently proposed bounding box is retained. It is
an algorithm that is used on other object detector architectures as well. A more detailed
explanation is given in chapter 6.

The multiple tasks that the overall network performs are then face classification and
bounding box regression and facial landmarks localization for each face that was deter-
mined. The input image is processed in a pipeline manner first from P-Net into R-Net
and into O-Net. [9; 10] The downside of this architecture is that there seems to be no
end-to-end training process for fine-tuning or training this type of complex multi-task net-
work in an available deep learning framework.[10] End-to-end training means that the
whole system can be trained in one attempt at the same time, instead of separately
13

freezing the learning process for some neural networks. An example of end-to-end train-
ing in practice is the training of image classifier neural network in the Keras deep learning
library. [30.] A third-party implementation of a pre-trained face detecting MTCNN neural
network was available, but it was not pursued any further due to the research conducted
into face detector effectiveness by Batagelj et al. as described in table 1 earlier in chapter
4. [20]

5 Theory of deep learning

5.1 Deep learning vs machine learning

Deep learning and machine learning share some commonalities but there are differences
as well, according to the definition given by François Chollet. Machine learning uses data
and the ground truths to generate rules about the process. These rules can then be
applied to unfamiliar new datasets to process them and hopefully get correct and sensi-
ble results. Machine learning requires input data, true labels (ground truth), and a way to
measure the success of the progress of the task. Then a feedback signal can be given
back to adjust the process based on the difference from the predictions and true labels,
and the algorithm can be adjusted. According to Chollet, deep learning belongs to the
category of machine learning, being a subset of it as figure 7 below illustrates. [31.]

Figure 7. AI, machine learning, and deep learning. Copied from Chollet (2017) [31.]

Deep learning differs from machine learning in the sense that it typically uses deep neural
networks as the predicting model. The model contains weights and bias values, and the
14

model takes the data as input together with true target labels, and then adjusts the pro-
cess during training time. Then during the operation, the model must take only the input
data as an input and use the weights and biases to generate predictions about the data.
Machine learning contains some methods that exhibit only “shallow learning” which learn
only one or two-layered representation of data, in contrast to deep learning. [31.]

5.2 Deep learning training pipeline

Chollet describes a key advantage that neural networks have in deep learning. In deep
learning, the neural network directly learns the required features to solve the problem,
so that the problem-solving process is simplified. In machine learning, typically more
feature engineering is required than in deep learning. [31.] This has certainly been the
recent experience in categories of deep learning such as image classification, in which
neural networks have achieved completely superior performance to other machine learn-
ing methods. The supervised learning process for deep learning is also shown in figure
8 below.

Figure 8. Deep learning pipeline for training. Copied from Chollet (2017) [31]

Note that in Figure 8 above, the weight update is backpropagation. In supervised learning
with neural networks, the loss function is used to monitor the task’s progress and the
loss function needs to be minimized. The method to monitor the progress of the deep
learning task is to use the gradient descent of the loss function, and backpropagation to
update the weights accordingly. [31.]
15

Different loss functions are available based on the task to accomplish, but they take as
input the predictions done in the forward pass of the network and compare it to true
targets. An example of a loss function is given below. It is called the mean squared error
and can be used in regression problems. [18.]

1 𝑁
𝐸(𝑊, 𝑏) = ∑ (𝑦̂ − 𝑦̂𝑖 )2 (1)
𝑁 𝑖=1 𝑖

𝑤ℎ𝑒𝑟𝑒 𝑦̂𝑖 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛, 𝑦̂𝑖 = 𝑡𝑟𝑢𝑒 𝑙𝑎𝑏𝑒𝑙, 𝑏 = 𝑏𝑖𝑎𝑠 𝑣𝑒𝑐𝑡𝑜𝑟,


𝐸(𝑊, 𝑏) = 𝑙𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛, 𝑊 = 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑚𝑎𝑡𝑟𝑖𝑥

Formula 1 above is copied from Elgendy (2021) [18].

The gradient is the derivative of the loss function with respect to the weights. The gradi-
ent gives an indication to move weights towards the better weights. The weights of the
neural network are then adjusted in backpropagation to minimize the loss function. The
reason why this is possible is that the gradient is computed as partial derivatives of the
loss function with respect to each weight, so that then a new weight value can be com-
puted by taking a small step movement toward better quality weight. The learning rate
mitigates the step size to move. An example of this procedure can be described by the
formula below. [32.]

𝑊{𝑛+1} = 𝑊𝑛 − 𝑦̂ ∗▽ 𝐽(𝑊𝑛 ) (2)

𝜕𝐽(𝑊)/𝜕𝑊1
▽ 𝐽(𝑊) = ( ⋮ ) (3)
𝜕𝐽(𝑊)/𝜕𝑊𝑛_𝑥

𝑤ℎ𝑒𝑟𝑒 𝑦̂ = 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑅𝑎𝑡𝑒, 𝑊𝑛 = 𝑜𝑙𝑑𝑊𝑒𝑖𝑔ℎ𝑡, 𝑊{𝑛+1} = 𝑛𝑒𝑤𝑊𝑒𝑖𝑔ℎ𝑡,


▽ 𝐽(𝑊) = 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡

Formulas 2 and 3 above are copied from Michelucci (2018) [32].

Then each new weight update can be done specifically during the backpropagation
phase using the chain rule of calculus. When the loss function is minimized, the model
16

performance will increase, and it will be more accurate. It should, however, be noted that
the gradient descent and backpropagation do not guarantee that the global minimum of
the loss function will be achieved, except if the loss function was convex, and this is not
the general case. [33.] A popular and effective implementation of gradient descent used
in practice is the stochastic minibatch gradient descent (SGD).

5.3 Activation functions

Activation functions are used to add non-linearity to the output of a neuron or filter in
neural networks. In convolutional neural networks (CNN) which must be used in image
classification tasks, CNNs use filters (kernels) as the neurons in the layer. If the activation
function is done to ConvLayer, then it is typically applied elementwise to the feature
maps, which are output from the convolution operation. [34.] Figure 9 below depicts the
SWISH (SiLU) activation function. It is used by default in the EfficientDet D0 object de-
tector model.

Figure 9. Plot and function of SWISH activation function. Copied from Ravichandiran (2019) [35]
17

5.4 Convolution

Convolutional neural networks use convolutional layers such as Conv2D (two-dimen-


sional sliding of the kernel across the input, to the right and downwards). The primary
advantages of convolutional layers are that there is weight sharing compared to fully
connected neural networks, and that the ConvLayers can extract features from the image
while maintaining the spatial information. This spatial information would be lost when
using a fully connected network and flattening the image as a vector into the network.
[18.]

Weight sharing means that in larger resolution image classification problems (larger res-
olution than the 28x28 MNIST digit problem), the number of trainable parameters re-
mains reasonable. When image resolution exceeds small figures, convolutions are re-
quired. Convolution is helped by weight-sharing because the same weights of a single
filter are used to convolve the entire input. [36.]

Convolution is a mathematical operation which involves sliding the kernel (filter) across
the image and convolving the input into an output. The process is achieved with ele-
mentwise multiplication and sum. Each filter can learn different features from the image
during the training process, and the weights are the values of the filters. Typically, the
layers closest to the input convolutional layers learn low semantic information features
like edge-detection, whereas the upper layers further down the network learn high se-
mantic information features such as eyes and noses. The basic convolution is shown
below in figure 10. [18.]
18

Figure 10. Convolution applied to an image by a filter, producing a feature map. Copied from
Elgendy (2021) [18]

In figure 10 above, a single kernel is being used to convolve the input image. The input
image in this example was a single color channel black and white picture.

6 Object detection terminology

In order to better understand object detection and face detection, some terms have to be
defined and explained. In object detection terminology when discussing the results of
object detection, true negative is of no concern (true positives, true negatives, false neg-
atives and false negatives). By definition, true negative in object detection would mean
that the detector detects the background successfully, but this is the exact opposite goal
of the task of finding objects from the background.

6.1 Precision

In object detection, precision and recall are the key terms which describe performance,
as opposed to accuracy as in image classification tasks. Precision is governed by the
19

formula below. Precision tells how good the object detector is at correct detection. In fact,
precision is sometimes called by its other name, the positive predictive value. In face
detection the true positive would be a correctly detected face. The false positive would
be something that was detected, which was not actually a face (for example a cat), yet it
was still detected from the background. Typically, precision is given as a value on an
interval [0,1] but it is sometimes described in research papers on an interval [0,100] as
a percentage. [37; 38, pp.8-10]

Tp Tp
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = (4)
Fp+Tp allDetections

Where 𝑇𝑝 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠, 𝐹𝑝 = 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

In an excellent object detector model, both the precision and recall would be close to 1.0
the maximum value, however this is rarely achieved in practice. Trade-offs can be made
between precision and recall performance in object detectors depending on the problem
context. [38, pp.8-10]

6.2 Annotation and label

Annotations are the ground truths for the image dataset in object detection. Annotations
in the PASCAL VOC format are XML files in one-to-one mapping into the input images.
In addition, annotations contain the bounding box coordinate information and the class
label (class name) in the XML. [39.] Essentially the XML file contains the required infor-
mation from a specific image as textual information.

6.3 Recall

Recall in face detection as an example describes basic performance. Recall is also


known as true positive rate, or sensitivity. Recall is defined as follows: [38, pp.8-10]

Tp Tp
𝑟𝑒𝑐𝑎𝑙𝑙 = Fn+Tp = allGroundTruths (5)
20

Where 𝑇𝑝 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠, 𝐹𝑛 = 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠.

In face detection, false negatives would be actual faces that existed in the image, but
which were never detected by the model.[37] Typically recall is given as a value on an
interval [0,1], but occasionally in research papers it is given on interval [0,100] as a per-
centage especially when concerning average recall. Recall measures the ability to ex-
tract all the relevant cases from total cases. [38, pp.8-10] Recall captures this aspect of
the detector’s performance of describing how good the detection is for capturing infor-
mation from the background.

6.4 Mean average precision (mAP)

Mean Average Precision (mAP) is the main metric in object detector’s performance.
Mean average precision is a snapshot metric of the precision-recall curve of an object
detector.

As an example, assuming an object detector which would detect just a single class
(faces), the detection results would be arranged into the precision recall curve format.
The formal definition for average precision is the area under the curve of a precision-
recall curve (for the face class). Each class for the object detector would have their own
precision-recall curves at different IoU thresholds. The model’s detections during evalu-
ation, are first transformed into the precision-recall curve format. An example plot of the
precision recall curve is shown below in figure 11. Then the average precision is calcu-
lated for the face class as the definite integral. This precision-recall curve is only for the
IoU=0.5 level, however. Using COCO metrics, it is required to calculate the average of
the ”average precisions” at the IoU thresholds for all the required precision-recall curves.
Then, that average will be the (mAP) mean average precision for this example, since
there is only one class. [40; 41; 18]

In the object detectors with more classes, it would be required to first compute the IoU
threshold-based AP values for each class. For any class, the average precision is the
mean of the AP values at the necessary IoU thresholds. Let this number be the class
AP. And then calculate mean over all the class APs, which depends on the number of
classes to detect. [40; 41; 42; 18]
21

Figure 11. Example precision-recall curve. Copied from Elgendy (2021) [18].

When interpreting the precision recall curve from recall=1.0, it means that, when all true
positives were detected, the precision was at the level of 0.6. When recall=0.6 and there-
fore 60% of the true positives were detected, then the precision remained at 0.9. [43]

In practice, the mAP is computed by the pycocotools and TFOD-API automatically during
testing/evaluation. Furthermore, the COCO style of calculating the average precision is
not strictly speaking the definite integral but a discrete sum from the 101-points interpo-
lated average precision of the precision-recall curve is used by pycocotools. [44]

6.5 Intersection over Union (IOU)

In face detection, IoU (intersection over union) means the comparison of the predicted
bounding box to the ground truth bounding box. IoU is used to determine the correct-
ness of true positives, false positives and false negatives. IoU is most often given on the
interval from [0,1] where 1.0 is a perfectly correct prediction. [45; 18]

AreaOfOverlap
𝐼𝑜𝑈 = AreaOfUnion
(6)
22

Figure 12. IoU graphical interpretation. Copied from Khandelwal (2020) [37].

Figure 12 above shows the graphical interpretation of the IoU formula. It shows the
model’s predicted bounding box compared to the ground truth bounding box. The red
bounding box is the predicted bounding box, and the green bounding box is the ground
truth bounding box. [37.]

To be precise, IoU (also known as Jaccard Index) can only be compared and computed
between the model’s prediction of a bounding box and the same class ground truth
bounding box. Object detectors must perform both the localization task for the bounding
box and the classification task for the class label estimate. [38, p.8]

6.6 IoU-threshold

The IoU threshold is a tunable hyperparameter which judges the correctness of predic-
tion.

• True positive, if IoU >= IoU_threshold


• False positive if IoU < IoU_threshold
• A present ground truth is unable to be detected, resulting in a false negative
• True negative, irrelevant background from an image is correctly identified
and ignored.
[37.]
23

The IoU threshold is often used in the context of the evaluation of the model based on
test set data, but it does affect the model’s weights. Otherwise, the training loop in Ten-
sorFlow would not be able to judge the correctness of detections if the IoU threshold did
not have an effect on model’s weights.

The more precisely described procedure used in the object detector to determine false
positives, true positives and importantly the false negatives will be given in pseudocode
in listing 1 below. To determine when a false negative happened, another threshold (hy-
perparameter) to compare the confidence score against is required. If the confidence
score were below the threshold initially, then that detection would be immediately sup-
pressed. Missed ground truths by the detector are generally considered to be false neg-
atives. [46.]

for each detection that has a confidence score > threshold:

among the ground-truths, choose one that belongs to the same class and has
the highest IoU with the detection

if no ground-truth can be chosen or IoU < threshold (e.g., 0.5):


the detection is a false positive
else:
the detection is a true positive

Listing 1. Pseudocode describing the effects of hyperparameters in object detection. Copied


from Zeng, Nick (2018) [46].

To re-iterate, false negatives typically occur only from missed ground truths which were
not detected. The false positives category of the object detector results can occur in
multiple ways. [47, p.26]

• The IoU was below the threshold value and the predicted label was the
correct class label, but because the IoU was too low, then the localization
was not good enough.
• Irrelevant background was detected as an object.
• An object is localized correctly, but the predicted label was the wrong class
label. For example, this could happen in a problem context where the object
detector must detect multiple classes of objects such as the following: 1)
correctly worn face mask, 2) incorrectly worn face mask, 3) unmasked face.
It could happen that similarities of classes confused the object detector.
[47, p.26]
24

6.7 Confidence score

Confidence score is a component of the model’s prediction output. It is also compared


against a threshold value (hyperparameter) to determine the false negative. When the
model has made a prediction in object detection, the prediction will contain the confi-
dence score, bounding box estimated coordinates, and a class estimate as the compo-
nents of the prediction. The confidence score describes how sure the model is about its
prediction being correct. It is a probability in range [0, 1.0]. [46; 38, pp.7-8] When the
confidence score is too low, the object detector model decided that there is nothing to be
detected in that situation.

6.8 Non-maximum suppression algorithm (NMS)

Non-maximum suppression is a post-processing algorithm in the object detection model.


Its purpose is to prune off extraneous bounding box proposals, which were generated
around the same object. The NMS algorithm is described shortly as follows in “Deep
Learning for Vision Systems” by Mohamed Elgendy. NMS can be run during inference
time to aid in making better final predictions for bounding boxes. [18.] The NMS algorithm
does not guarantee certain performance but it does introduce more hyperparameters to
tune. In sum, it can be effective in reducing false positive predictions.

• Discard bounding boxes that are below a tunable confidence threshold


• From remaining bounding boxes select box with highest probability (confi-
dence)
• For those boxes with the same classification result as the selected highest
confidence box, calculate the overlap of the area, called intersection over
union. Then average together the boxes with a high enough overlap.
• Suppress any bounding boxes with low intersection over union value below
the NMS threshold (another tunable parameter)
[18.]

Listing 2. Description of the NMS algorithm. [18]


25

Figure 13 below shows how the extra bounding boxes are pruned off in the NMS algo-
rithm. In the best case the algorithm produces the correct single bounding box for the
single object. [18.]

Figure 13. Visualization of the NMS post processing algorithm. Copied from Elgendy (2021). [18]

6.9 Objectness score (probability of containing object)

The region proposal algorithm or region proposal neural network (RPN) generates
bounding boxes which can have varying objectness scores. It is a measure of how con-
fident the region proposing subsystem is in the bounding box that it contains an object
or not. It is given as a value from [0,1]. The term is used in for example faster R-CNN
family of object detectors. [18.]

6.10 PASCAL performance metrics

The PASCAL object detection benchmark measures object detectors by their mAP
scores at IoU=0.5. This metric mAP at IoU=0.5 is a special case of the COCO perfor-
mance metrics, so in principle in this regard COCO and PASCAL performance figures
can be compared between each other. The mean average precision at the IoU threshold
of 0.5 is the most common performance figure used in research and literature, and at
least it will be given. [48, p.273; 49]
26

6.11 COCO performance metrics

In the MS COCO object detection benchmark, the key metric that is used for comparison
is the mAP at IoU [0.5: 0.05: 0.95]. It is the mean average precision at multiple IoU-
thresholds at the step size of 0.05. Other metrics are available as well, such as average
recall and mean average precision values for small (area < 32x32), medium (32x32 =<
area =< 96x96) and large (area > 96x96) objects in terms of pixels. [48, p.273; 49].
COCO metrics are given for the mAP as the area under the precision-recall curve be-
tween [0,1], not as a percentage. The COCO metrics show the mean average precision
(mAP) as AP. [49.]

7 Deep learning-based object detectors

It is difficult to determine many overarching general principles across different style of


architectures in object detectors. Object detector architectures are divided into one-stage
and two-stage detectors. Typically, convolutional neural networks are used as a base
network module somewhere in the architecture. Base networks are variations of com-
monly used CNNs for image classification like ResNet, VGG19 and MobileNet. Common
types of modern object detector architectures are YOLO, the R-CNN family (meta-archi-
tecture), EfficientDet and the SSD architectures. [18.] Object detection is solved some-
what differently by these object detectors. Fully detailed descriptions of all these object
detectors are beyond the scope of this thesis. But a general overview can be given of
the R-CNN, faster R-CNN and ultimately the chosen EfficientDet architecture.

7.1 R-CNN architectures

In two-stage detectors, there is a region-proposal stage, which identifies interesting ar-


eas from the input image, and a second stage, which attempts to classify those candidate
regions of the image. The candidates are called regions of interest (RoI). Examples of
the types of two-stage detectors are R-CNN, fast R-CNN and faster R-CNN architectures
for object detection. This is the order in which they were developed chronologically. [14;
50; 18] R-CNN stands for region-based convolutional network. [18; 50]
27

In the region proposal stage, some method for region generation is required, e.g., either
a search algorithm, such as the selective search algorithm, or a region proposal neural
network (RPN). The RPN generates regions of interest, which are then processed by the
second stage in classification. Selective search is a type of greedy search algorithm that
was used in early versions of R-CNN type of networks but has been superseded because
selective search was the bottleneck in terms of processing time. It is a CPU-bound op-
eration. Selective search attempts to cluster the image into blob-like segments by com-
bining regional similarities and this would result in region proposals. The benefit of se-
lective search is that it is a defined algorithm as opposed to a neural network, whose
exact behavior can be more obscure to analysis. However, the excessive processing
time of selective search is the clear downside. [51, pp.3215-3217; 18; 50; 48, pp.274-
276] Faster R-CNN is the latest and the best in the R-CNN family of detectors and its
mean average precision scores are state-of-the-art. But the fine-tuning and re-training
on new datasets is more complicated compared to one stage detectors. Inference time
is longer than that of one-stage detectors. Faster R-CNN can only process images at 5
fps, running on a graphics card, which is too slow for real-time use cases such as detec-
tion from a security camera. [50; 18; 48, pp.274-276]

The simplest example to showcase is the regular R-CNN model invented in 2014 by
Ross Girschik et al. Fast R-CNN and faster R-CNN are improvements upon it. Figure 14
below shows the basic architecture for the R-CNN quite clearly. The SVM refers to the
support vector machine which acts as a classifier module in the R-CNN. The bounding
box regressor is the localization module. ConvNet is the feature extractor module which
takes as input the selective search regions of the interest of the raw image that were
warped to fit exactly into the CNN input layer. When considering the R-CNN as a baseline
for processing time, the fast R-CNN variant represents 25X speedup improvement, and
the faster R-CNN represents 250X speedup improvement. [18.]
28

Figure 14. The original R-CNN model for object detection (Ross Girschik et al). Copied from
Elgendy (2021) [18]

The current version is called faster R-CNN. It is a more complex design and differs from
the earlier ones. Faster R-CNN is primarily composed of two modules, RPN and fast R-
CNN. The RPN network replaces the selective search algorithm. The RPN network is an
attention-focusing module in the whole system to find interesting areas for further analy-
sis. The base network CNN is used for feature extraction, for example VGG-16. And the
weights of the base network are shared between the RPN and fast R-CNN modules as
shown in figure 15 below. [18.]
29

Figure 15. Overall architecture of Faster R-CNN object detector. Copied from Elgendy (2021) [18]

The RPN outputs objectness scores and corresponding bounding boxes. The RPN net-
work does in-fact have an internal binary classifier allowing only high objectness score
region proposals to proceed, but it does not yet make the final classification. Notice also
that the CNN VGG-16 is directly taking the input image, and the region proposals are
only generated from the feature maps of the CNN, instead of from raw image. Using the
RPN and generating regions from the feature maps results in big improvement over the
earlier R-CNN versions. There is also a layer called RoI-pooling (region of interest) be-
fore the RPN outputs can be sent to the final classification and bounding box stages,
because the input must have fixed dimensions. Then each region of interest is classified,
and the final bounding box is drawn. It can be noted that for an object detection problem
involving only a single class (such as regular face detection) but allowing multiple object
instances, the RPN would suffice in solving that kind of a problem. [18.] The faster R-
CNN was attempted to be trained using the TFOD-API, but it was not possible to do so
in practice at least with a reasonable batch size of 8. It crashed due to a GPU memory
exhausted error.

7.2 SSD and anchor boxes

Examples of modern and effective one-stage detector architectures are YOLO and SSD
(single shot detector multibox detector). In a one-stage detector the whole input image
30

is analyzed and processed once in a forward pass through the neural network, providing
network predictions for the bounding boxes and class predictions. Both YOLO (you look
only once) and the SSD architectures do this. In one-stage detectors, there is no explicit
region proposal stage, but it is built into the network. One-stage object detectors have
enjoyed good success in real-time object detection such as detecting from webcam
where short inference time is critical. [18; 48, pp.277-278] The faster R-CNN compared
to SSD does represent a trade-off between accuracy for speed during runtime. [12] Ac-
cording to Mohamed Elgendy, “One stage detectors skip the region proposal stage and
detect over the dense sampling of possible locations.” [18.]

Single stage detectors typically use anchor boxes (priors, prior boxes) that are refer-
enced to the input image. This anchor boxes design allows one-stage detectors such as
SSD detectors to process and refine the bounding boxes efficiently during the training
time and make bounding box proposals during inference. [39.] The anchor boxes are
pre-generated fixed location boxes of differing aspect ratios arrayed on a grid. Several
anchor boxes of different aspect ratios exist on each feature map cell, and the anchor
boxes are tiled through the network in a convolutional manner. [52.] The anchor boxes
design is also applied to the TFOD-API implementation of the EfficientDet object detec-
tors according to the pipeline.config file. Whether or not the anchor boxes work identically
in EfficientDet and SSD detectors could not be verified due to insufficient sources about
this. The EfficientDet research paper does not elaborate on the anchor box method used.
[53.]

Internally in SSD, the bounding boxes are represented and refined as the deltas from
the fixed anchor boxes’ coordinates. For all the anchor boxes, in all the locations of a
feature map, the model predicts the coordinate deltas (four valued coordinate deltas) for
the bounding box and confidence scores for all classes (C1, C2 … Cp). The best confi-
dence score will be the predicted class. The detector elements are some forms of con-
volutional kernels being convolved across feature maps. [52.] Figure 16 below shows
feature maps together with anchor boxes.
31

Figure 16. Anchor boxes. Image copied from SSD: Single Shot Multibox Detector, Liu et al (2016)
[52]

In figure 16 the different scale of feature maps 8x8 and 4x4 are shown. This allows de-
tection of different sized objects. For example, since the dog is bigger than the cat, the
dog is only detected in the 4x4 level feature map. All detections happen at multiple scales
along the depth of the SSD detector’s feature layers, and finally the NMS algorithm
prunes off extraneous bounding boxes. [52; 18] Tweaking the anchor boxes appears to
be a feature engineering aspect of improving the detection performance of SSD detec-
tors.

7.3 EfficientDet architecture

The EfficientDet D0 architecture is a state-of-the-art object detector invented in 2020 by


researchers at Google Research’s Brain Team. The nomenclature D0 refers to a specific
image resolution (512x512) and model scaling, and the other EfficientDet variants are
also available for larger neural networks. EfficientDet is a one-stage object detector. Ef-
ficientDet uses the relatively modern EfficientNet CNN as the backbone network. The
key innovations brought about by the EfficientDet architecture were essentially com-
pound scaling and bi-directional feature pyramid network layers (bi-FPN). [54.]

EfficientDet detectors use the EfficientNet backbone. [54] EfficientDet D0 was available
and found to be working in the TensorFlow object detection API when training it. Effi-
cientDet D0 was chosen as the object detector to be finetuned to face mask detection
and face detection on the FMLD dataset.
32

Furthermore, the actual base network EfficientNet is also a state-of-the-art CNN network
which has achieved top results in ImageNet classification benchmark in 2020 and 2021.
[55] The compound scaling means that the original researchers discovered a method to
scale up, according to a ratio, all the dimensions of the CNN network in an efficient and
fixed manner to improve object detection performance. The options to size up and scale
neural networks are to change CNN width (channels of layers), depth (number of layers)
or image input resolution. [53; 54] The same compound scaling also improved image
classification performance in general. For example, ResNet1000 compared to Res-
Net101, without utilizing the compound scaling, they both have similar accuracy in image
classification, despite ResNet1000 having a greater network depth. This was very ineffi-
cient in terms of network architecture and the number of trainable parameters. Table 2
below shows compound scaling improvements. [56.]

Table 2. Image classification improvements brought by compound scaling. Table copied from
Arora (2020) [56]

The bi-directional feature pyramid network (bi-FPN) is used to improve the multi-scale
feature fusion in the CNN networks. It means that as in a normal CNN, the layers closer
to the input image learn the fine-grained features, whereas further layers learn more
higher scale features from the image. The multi-scale feature fusion attempts to combine
similarities between the scales of features as they exist at different points along the depth
of the network. The purpose is to aggregate the features extracted from different scales
of the feature maps. To be clear, this multi-scale feature fusion was done to some extent
33

by the earlier design of the feature pyramid networks, but bi-FPN improves this. Figure
17 shows the bi-FPN on the right-hand side. [53.]

Figure 17. Bi-directional feature pyramid network along with earlier variations of the feature pyr-
amid networks. Image copied from “EfficientDet: Scalable and Efficient Object Detec-
tion”, Tan et al (2020) [53]

In the design of the Bi-FPN, the Google Research team took inspiration from the earlier
PANet (Path Aggregation Network) style of feature pyramid networks, but they optimized
it by removing connections that did not contribute much to performance, and it shortened
the training time. Furthermore, bi-FPN layers utilize weighted feature fusion. It means
that whereas earlier variants of FPNs treated the cross-scale input connections equally
with summation, the bi-FPN layers use learnable weights to modify inputs. The research-
ers called it fast normalized fusion, and its formula is further described in their paper.
[53.]

The overall design of EfficientDet can be seen in figure 18 below. Note the EfficientNet
backbone network and how the connections from different depths of the backbone net-
work (P3, P4, P5, P6, P7) are sent to bi-FPN layers. At the end, there are class prediction
net and box prediction nets. EfficientDet uses depthwise separable convolutions instead
of regular convolution in the model, and batch normalization and activation functions are
done at each convolution. The SWISH activation function is used. [53.]
34

Figure 18. The overall architecture of EfficientDet. Image copied from “EfficientDet: Scalable and
Efficient Object Detection”, Tan et al (2020) [53]

8 Proposed solution

Finding applicable datasets containing masked faces and regular faces was difficult. The
Indian researchers Singh et al. had performed a COVID-19 face mask wearing detection
study and they constructed their own dataset. [8, p.8] The reason why the same dataset
could not be used was incompatibility of the Darknet annotations of the dataset and
YOLO was not available in TFOD-API. It was not known how to convert the Darknet
annotations of the dataset into tf-records or whether there existed tools for this.

The FMLD (face mask labeled dataset) dataset was chosen. It had been created by
Batagelj et al. [23] This dataset contained a few erroneously annotated images that
caused a crash, and some other problems in TensorFlow. However, these problems
were also solved in the dataset pre-processing phase.

It was decided to retrain or finetune an existing object detection model in TFOD-API to


create the face mask wearing detector. The goal was to create a similar face mask de-
tector and face detector as the researchers Singh et al. According to them, the general
form of solution to the problem of face mask wearing detection should be an object de-
tection-based model such as the YOLO neural network. [8, p.8]
35

The FMLD dataset’s creators Batagelj et al. undertook a different solution entirely com-
pared to Singh et al. Batagelj et al. created a custom two phase detection and classifi-
cation pipeline, which used a separate RetinaFace detector and a ResNet classifier
whereas Singh et al. solved the problem directly and concisely using the object detector
model (YOLOv3 and faster R-CNN) trained to face mask wearing detection. It was de-
cided for reasons of simplicity to follow along the path of Singh et al. [8; 20, pp. 15-21]

The deep learning framework TensorFlow object detection API was chosen. TFOD-API
did show promise as a starting framework for beginners in object detection. TFOD-API
comes with several pretrained models trained in the COCO object detection dataset.
EfficientDet D0 was used in the end to obtain results.

The decision to finetune the existing pre-trained object detection model was influenced
by professor Jason Brownlee’s technical blog as well as Umberto Michelucci, author of
“Advanced Applied Deep Learning: Convolutional Neural Networks and Object Detec-
tion”. [13; 57]. According to Brownlee, “object detection is a challenging computer vision
task that involves predicting both where the objects are in the image and what type of
objects were detected.” [57.]

Similarly, Umberto Michelucci suggests the following:

If you followed the previous sections, you understand that developing your own
models for YOLO from scratch is not feasible for a beginner (and for almost all
practitioners), so, as we have done in previous chapters, we need to use pre-
trained models to use object detection in your projects. [13]

9 Setting up the project environment

9.1 Tools, installation and setup

9.1.1 Tools

TFOD-API is a high-level interface to use the TensorFlow library for training object de-
tection models. TFOD-API must be installed separately on top of the regular TensorFlow
library. Python scripts were used in the dataset pre-processing stage to combine the
36

FMLD dataset from the provided XML annotations. Microsoft Excel was used in finding
bad data from the dataset during dataset preprocessing.

9.1.2 Installation

The installation procedure of the object detection API on the local PC was not without
some problems. Evidently, the TFOD-API installation instructions were outdated in a few
regards. The installation was done for Windows 10. The official TensorFlow object de-
tection API install guide was mostly correct, but some problems were encountered in the
process although they were also solved. [58.] Link to the official install guide is here
(<https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html >).
The tested versions for libraries are in Appendix 1.

Problem: TensorFlow inadvertently updates from 2.2.0 to 2.4.1

The inadvertent update happens when following the official install tutorial and attempting
to use the TensorFlow 2.2.0 version.

Solution: Let the requirements install and let TensorFlow update to version 2.4.1

Allowing the updates to proceed worked. Correct and compatible versions of the NVIDIA
CuDNN and CUDA toolkit must be downloaded and installed if desiring to use the GPU
for training the models. The correct and compatible versions of CUDA Toolkit 11.0 and
CuDNN 8.0.4 for TensorFlow 2.4.1 were downloaded and reinstalled.

Problem: Pycocotools is unable to utilize coco_detection_metrics

When attempting to train the model, the pycocotools was unable to be used and the
training crashed when using “coco_detection_metrics” in the pipeline.config file.

Solution 1: Pip uninstall pycocotools and try again

Download and install Visual C++ 2015 buildtools and add them to path variable. Then
execute the pip install command as described in the TFOD-API install guide. Then some
37

necessary files will be compiled for version 2.0 of pycocotools. The script will notify about
successful build and installation specifically. [58.]

Solution 2: Install pycocotools for windows using pip

Alternatively installing pycocotools-windows will also work. Evaluation results were able
to be received using” coco_detection_metrics”.

Continuing the installation and setup, there were two quick tests that were performed to
verify the installation was successful. The first command is simply executed from the
Anaconda prompt. Let’s call it the test 1 command. The command verifies that GPU-
support is enabled for training neural networks. [58.]

python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000,


1000])))"

The successful output of the command shows CUDA and CuDNN libraries successfully
opened. Refer to Appendix 1 to see the output of the successful test 1 command.

The second test command is to perform the TFOD-API tests. It should be done after
having compiled protocolbuffers with protoc and after having installed the TensorFlow
object detection API. Execute the command from within TensorFlow/models/research/ in
the Anaconda prompt. Let’s call it test 2 command. [58.]

python object_detection/builders/model_builder_tf2_test.py

The successful output should show CUDA-related DLL libraries being successfully
opened at the start using TFOD-API. Refer to Appendix 1 to see the successful output
of the command.

9.1.3 Setup

The third and final test was retraining a pre-trained model on a small, annotated dataset
of images. This requires the installation of an additional tool for labelling images of a
dataset. The utility program is called “Label-Image”. A dataset of around 50-100 pictures
38

of faces and facemasks was created to verify the environment works properly. These
images were annotated in Label-Image according to the PASCAL VOC format. [59.] To
be clear this was not the final dataset, but it was used to verify that TFOD-API worked in
practice.

When annotating pictures, all the regular faces and face mask wearing faces should be
annotated. Leaving some faces un-annotated hinders the model’s detection performance
during training. The best practice is to annotate everything you expect to be detected.
Tiny (far away) and very blurred faces could be left un-annotated. [60.]

A Jupyter Notebook was prepared to help in the setup and model training phase. Exam-
ple code from Nicholas Renoitte was used as a basis. His code is not strictly necessary
to train the object detector as everything can be done from the Anaconda prompt. But
using a notebook made it easier to format the commands correctly. [61; 59] The next
step was to create a labelmap file, with filetype PBTXT. It contains case-sensitively cor-
rect names of the class labels mapped to integers. ID 0 should not be used in TensorFlow
as a class ID because it is denoting background. The labelmap only needs to be created
once per dataset if there are no changes. [59.]

After this, tf-records files must be generated from the dataset images and annotations
files and labelmap. A conversion script to do this was downloaded from TFOD-API tuto-
rial website (generate_tfrecord.py). TensorFlow tf-record is a serialized file that will be
used by the TFOD-API training loop when training starts. [38, p.6; 59; 64] The dataset
images and annotations are converted into the tf-records format.[59] When generating
them it is recommended to output the optional CSV files along with the end-result tf-
records, because the CSV files are more human-readable and show easily if the conver-
sion worked. Finally, the model should be trained for a few hundred steps to simply verify
the environment works and that tf-events files and checkpoint (.ckpt) files were generated
from the training run into the model folder.
39

10 Project work tasks

After the environment installation and setup, the work tasks in the project were as follows
for the sake of clarity. After having installed a basic Anaconda environment for Tensor-
Flow and verifying the TFOD-API also works, the further steps were as follows:

1. Download some pre-trained models from TFOD-API Model Zoo GitHub. The Ef-
ficientDet D0 model was downloaded. Also, the pipeline.config file of the pre-
trained model is copied into a new trainable_model folder where the model’s new
training data will be written to (tf-events files and .ckpt files). The pipeline.config
file contains many of the important hyperparameters of the model that can be
changed in TFOD-API, without editing the source code directly. [62; 63]

2. Download the original WIDERFACE and MAFA datasets. Download the FMLD
dataset’s files from the dataset creators’ GitHub repository. [23] The FMLD da-
taset did not come as a ready package from the Github repository, so it must be
assembled. The FMLD dataset’s creators did provide the XML annotations. Py-
thon script was used to find a one-to-one mapping from annotations to images
and to copy the matched files in order to construct the FMLD dataset.

3. Shuffle the dataset and split it to training and test sets. It was done using a ready-
made TFOD-API tutorial Python script. [64] Technically, using a validation set in
TFOD is optional. To use validation during training time, a separate evaluation
job must be launched from another Anaconda prompt to monitor the progress
using the validation set. [65.]
4. Generate the labelmap and tf-records for the FMLD dataset. Change and exper-
iment with hyperparameters in pipeline.config. Run training from the Anaconda
environment prompt.
5. Run evaluation of the model using the test set. Show performance metrics in
Tensorboard. The same performance metrics are shown in the Anaconda prompt
after having run the evaluation as well.
40

The screenshot below (see figure 19) shows what the model training looks like while
training is still in progress. Notably the printout shows the current level of the loss func-
tion.

Figure 19. SSD-MobileNet detector being trained. Ultimately the EfficientDet model was used
instead of MobileNet, though.

11 Dataset pre-processing steps

The FMLD dataset, as it was, was in some ways problematic and required preprocessing
steps. Firstly, the original WIDERFACE dataset from which FMLD sampled images con-
tained some bizarre and unethical images for face detection purposes. Some bizarre and
shocking categories for face detection included real car accident pictures of injured peo-
ple, street fights, and distressed people in a flooded area fleeing to safety. The easiest
way to clean the dataset was by searching recursively into the XML annotations of the
FMLD dataset and removing these categories of XML annotation files entirely. Then fi-
nally, the dataset was reassembled with mapping from the annotations to images. Not
41

all the images were able to be individually inspected since there were so many of them.
The FMLD annotations provided the baseline from which images were further removed
as will be described next. The removed categories of pictures were as follows:

• car, march, riot, rescue, raid, streetfight


• row_boat, angler, swim, parade, traffic, election campaign

The images from the removed categories were often somewhat irrelevant to face detec-
tion especially traffic category which had many cars and mopeds. The parade, traffic,
and march categories had mostly tiny faces in huge panoramic photos of large parades
outside. Indoors images at a reasonable resolution would have been more preferrable.
The election campaign category contained many pictures of political posters which were
annotated as human faces.

The second step in preprocessing was to remove from the remaining images those which
had annotations for the “invalid_face” class label. The dataset did not convert into the tf-
records format very well with this extra class, and it seemed redundant. A method to
ignore specific classes was not found in TFOD-API so that deleting the class was the
easiest way to proceed. The simplest method was recursively searching the dataset and
removing those annotations files. The remaining classes in the labelmap and dataset
were: 'masked_face', 'unmasked_face', and 'incorrectly_masked_face'. However, yet
more problems were encountered with the dataset.

Problem: The tf-records generating script could not parse XML annotations

The original generate_tfrecord.py script was not able to parse the FMLD dataset’s XML
annotations files, but the author’s own small annotated dataset had been able to be con-
verted. This prevented the FMLD dataset from being converted to the tf-records format.

Solution: Edit the generate_tfrecord.py script

The script was edited so it could parse the XML files in the FMLD dataset. This fixed the
problem. The bounding box coordinates for each object existed in another index place in
this dataset’s XML files. The line in the generate_tfrecord.py script with the fixed version
is as follows:
42

value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(float(member[6][0].text)), ## use index 6
int(float(member[6][1].text)), ## fmld xml has float,
int(float(member[6][2].text)), ## instead of int
int(float(member[6][3].text))

Listing 3. The fix for generate_tfrecord.py

Another problem was a crash that happened from TensorFlow. Evidently, some annota-
tions files had bounding box coordinates outside the actual image resolution. The error
message from the crash is shown below.

raise errors.InvalidArgumentError(
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected
'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'maxi-
mum box coordinate value is larger than 1.100000: '
1.1067415

Problem: Crash occurred in TensorFlow during training because of bad bounding


box coordinates

The reason for the error message was found to be caused by bounding box coordinates
in annotations being outside image resolution.

Solution: Remove the offending files with bounding box coordinates outside im-
age

The CSV files from the dataset were searched inside Excel for bounding boxes outside
image resolution. Then all the offending annotations files were removed. Roughly 12-13
files were deleted in this way. No further problems of this type were encountered.

The third problem was that during training, TFOD notified with a warning about incorrect
sRGB color profiles from some of the images in the dataset. It was difficult to pinpoint
the affected files. The warning message is shown below.

W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect


sRGB profile
W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: cHRM chunk does
not match sRGB
43

Problem: TensorFlow warning about incorrect sRGB profile

The problem appeared to be caused by corrupted images in the original dataset when
being accessed by the LIBPNG library.

Solution: Put all the dataset images through an image converter

All the dataset images remaining were sent through an image converter from .jpg to a
.jpg file. The warning indicated that corrupted images were in the dataset, but finding the
exact corrupted images individually was not feasible at this time. Linux users were ad-
vised to preprocess images in bulk using a utility program called ImageMagick. [66] The
used image converter was Image Converter by Johnny Westlake. It was downloaded
from Microsoft Store. This fixed all the warnings.

After fixing these problems, the whole dataset was shuffled and divided into train and
test sets again to maintain better class balance. And the tf-records were generated again.
Overall, the training set contained 32,412 images, and the test set contained 6,639 im-
ages, so the proportions were 83% training set, and 17% test set. The number of the
training set bounding boxes was as follows: masked_face, 23,616; unmasked_face,
28,306; incorrectly_masked_face, 1,174.

The test set bounding boxes were as follows: masked_face, 4,868; unmasked_face,
5,769; incorrectly_masked_face, 253. Notably the incorrectly_masked_face was very
underrepresented class in the dataset.
44

12 First training run

12.1 Training settings and hyperparameters

The TensorFlow object detection API is easy to start training with, because in principle
once the project’s folder structure and the required files are in-order and in-place, the
training can simply be started from the Anaconda prompt as follows.

python Tensorflow/models/research/object_detection/model_main_tf2.py
--model_dir=Tensorflow/workspace/trainable_model/Efficient_Det_D0_smallres
--pipeline_config_path=Tensorflow/workspace/trainable_model/Effi-
cient_Det_D0_smallres/pipeline.config
--num_train_steps=15000
–-alsologtostderr

The hyperparameters, training options and evaluation options were configured in the
EfficientDet D0 model’s pipeline.config file. A fairly basic version of pipeline.config was
used. The entire pipeline.config file is available in Appendix 2. The modified settings
were as follows:

• batchsize = 8, num_classes = 3, num_steps = 15000


• fine_tune_checkpoint_type = ”detection”
• In the eval_config section num_visualizations = 3000 was added. And in
eval_input_reader section shuffle = True was added.

The learning schedule of cosine decay was kept, but its parameters were changed. Be-
low is an excerpt of the pipeline.config.

learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.008
total_steps: 15000
warmup_learning_rate: 0.001
warmup_steps: 2000
}
}

The learning schedule defines how learning rate should change during training. Cosine
decay was used. Warmup_learning_rate means the starting learning rate, which will
ramp up to the learning_rate_base during the warmup_steps. After that, cosine decay is
45

applied until total_steps were trained during training. Figure 20 below shows an example
plot of cosine decay.

Figure 20. Cosine decay for learning rate plotted against steps trained. Copied from Correa
(2019) [67]

The EfficientDet D0 model was trained for 15,000 steps, and the total training time was
roughly 2 hrs 30 min. The default loss functions and loss function weighting were used
in the training. The total loss function in EfficientDet D0 was formulated as follows:

𝐿𝑜𝑠𝑠𝑡𝑜𝑡𝑎𝑙 = 𝐿𝑜𝑠𝑠𝑐𝑙𝑠 + 𝐿𝑜𝑠𝑠𝑙𝑜𝑐 + 𝐿𝑜𝑠𝑠𝑟𝑒𝑔

Lossloc was smooth L1 (Huber) location loss. The Losscls classification loss was the
weighted sigmoid focal loss. And Lossreg was comprised of regularization related compo-
nents in the model, such as L2 regularizers and batch normalization layers.

12.2 Results achieved

The performance of the trained model EfficientDet D0 was evaluated using the entire
test set at the end of the training. After evaluation, because the model was trained with
COCO metrics according to the pipeline.config file, the so-called performance metrics
will be shown in the COCO-style format. The performance metrics for EfficientDet D0 are
shown in figure 21 below with performance measured against the test set. The key fig-
ures from the COCO metrics are mAP @IoU = [0.5: 0.05: 0.95], for all sized detections,
and mAP @IoU = 0.5, for all sized detections. The latter is the PASCAL VOC object
detection benchmark performance metric. [49]
46

Comparing the achieved result to the performance of MTCNN that Batagelj et al. had
tested, this EfficientDet D0 appeared to have very slight performance improvement of
about +3.3 percentage points mAP versus MTCNN. The comparison was 70.5% AP vs
67.11% AP, respectively, comparing to table 1 presented earlier in the thesis. [20, pp.8-
10]

In principle, both MTCNN and this trained EfficientDet D0 model were shown similar
images except that in this project only a 93% subset of the original FMLD dataset was
able to be used. Also, the MTCNN that Batagelj et al. used for testing was a pre-trained
face detector. Most likely, some improvement in mAP was still achieved though. [20,
pp.8-10]
47

Figure 21. First training run with COCO performance metrics shown

Other remarks to be made about the performance metrics were that EfficientDet D0 was
worse for detecting smaller faces, as opposed to medium and large faces as is shown in
the figure above for the mAP scores where area=small, area=medium area=large.

Average recall (AR) achieved by the EfficientDet D0 was not particularly good, only being
small=0.38, medium= 0.57, large=0.64, for the object sizes, respectively. Average recall
values shown are the highest achievable recall by the model. Based on the AR results
this EfficientDet could not really be used in the access control setting. In the access
48

control setting, the recall would have to be very high (closer to 1.0) to detect a person
using the system about to enter through the door. According to COCO benchmark web-
site, the average recall is described as follows:

AR is the maximum recall given a fixed number of detections per image, averaged
over categories and IoUs. AR is related to the metric of the same name used in
proposal evaluation but is computed on a per-category basis. [49]

Note also from figure 21 with the COCO performance metrics that the loss function val-
ues are shown. The overall loss function is total_loss=0.299. The classification loss was
the only major component of the total loss function of the model. According to the TFOD-
API tutorial, getting the loss function to about 0.5-1.0 is usually indicative that “fair” results
can be achieved for detections. [63]

A slight improvement of +4.5 percentage points mAP with EfficientDet D0


mAP@IoU=[0.5:0.95] was noted versus the COCO benchmark 2017 pre-trained weights
of EfficientDet D0. The comparison was 38.1% mAP trained EfficientDet vs 33.6% mAP
pretrained EfficientDet. Most likely having a fewer number of class labels in face mask
wearing detection helped in improving detection results. [68.]

12.3 Image inspection from test set

Image comparisons from test set evaluation were visually inspected in the Tensorboard
program. 3000 test set images were loaded in the Tensorboard program. Examples of
model predictions vs ground truths are shown in figures 22, 23, 24, and 25 below. The
model’s predictions with confidence scores and bounding boxes are on the left-hand
side, and ground truths are on the right-hand side. The model had difficulties in detecting
smaller faces in general as is shown in figure 22 below. In this case the confidence
scores of the model’s detections were below the required 50% threshold and thus re-
sulted in false negatives. Figure 23 shows better detection results of the teachers’ faces
and the respective confidence scores are above the threshold value.
49

Figure 22. On the left side, the model did not detect any faces resulting in false negatives

Larger faces appeared to be detected better, as is seen in the picture of teachers below
in figure 23.

Figure 23. In the image above, the larger faces were correctly detected on the left side.

The picture below with the hospital staff (see figure 24) shows that sometimes the da-
taset annotations were sub-optimal for object detection training. The images were not
completely annotated, with missing the ground truth bounding box for a clearly masked
face of the female nurse in the right-hand side ground truth pictures. This is considered
a mistake in the annotation process.
50

Figure 24. Occasionally there were missing ground truth bounding boxes for faces.

Note how in the above comparison in figure 24 the object detector correctly detected the
masked female nurse in the background on the left-hand side, but it was not annotated
as ground truth in the right-hand side image even though it should have been.

An example is shown below where face mask detection worked quite nicely (see figure
25).

Figure 25. Face mask wearer was detected with confidence score of 75% on the left-hand side.
51

12.4 Suggested improvements

Improving the EfficientDet’s detection of small objects remains a challenge in the field of
object detection for one-stage detectors. It has been empirically demonstrated in the
COCO benchmarks that one-stage detectors like EfficientDet typically have five-fold
worse performance in mAP with small object detection compared to large objects. This
is shown in table 3 below. Table 3 describes the EfficientDet D0 performance with the
AP for small objects being 12.0% whereas the AP for large objects was 51.2%. [69]

Table 3. EfficientDet and YOLOv4 results in COCO 2017 average precision. Table copied from
Solawetz (2020) [69]

Possible fixes to try for improving object detection performance for EfficientDet could be
the following.

Use rotational data image augmentation to tilt the image +/- X degrees. Small value like
5 degrees would be useful. However, this option was currently unavailable in the object
detection API as a feature. Rotational augmentation has been used in object detection
context and has improved results. [70]

Train for even more steps and lower the initial learning rate. The first training run of 2
hours and 30 minutes was quite short for object detector training.
52

Use weight from 1.1 - 1.2 for classification loss component of the total loss function. The
evaluation results showed that the biggest component of total loss was classification loss
so attempting to tackle it directly should be attempted. [65.]

Remove all the incorrectly_masked_face class images. This would make the classifica-
tion task more reasonable binary classification. It simplifies the object detection task im-
mediately. It could bring improvement because the incorrectly_masked_face class was
always more underrepresented even in the original FMLD dataset. [69.]

Tweak the multiscale_anchor_generator and NMS algorithm in pipeline.config to hope-


fully detect the small faces better. [69] Aspect ratios could be changed to 0.2, 0.25 and
0.5, to correspond more closely with the expected shapes of human faces. [65]

13 Second training run

13.1 Training settings and hyperparameters

The second training run was attempted to improve the earlier results using the suggested
fixes from the first training run. The second training run used a more modified pipe-
line.config file compared to the earlier one in the first run. The NMS and anchor boxes
hyperparameters were changed, together with image data augmentation options. The
optimizer was changed to Adam, and the learning rate was reduced. Also, the dataset
was changed into containing only “masked_face”, “unmasked_face” classes, simplifying
the problem, and dropping the class “incorrectly_masked_face”. The second training run
pipeline.config is given in Appendix 3. The model was trained for 25,000 steps.

13.2 Results achieved

After training the model for 25,000 steps, the model was evaluated using the test set.
Again, this test set only contained the “masked_face” and “unmasked_face” classes. The
results are shown in figure 26 below.
53

Figure 26. Second training run with COCO performance metrics using only masked_face and
unmasked_face classes.

Notably looking at the Figure 26 performance metrics, mAP@ IoU=0.5 was measured at
0.864, and an increase of +15.9 percentage points mAP was seen, compared to the first
training run EfficientDet D0. The average recall scores improved for medium and large
objects but slightly decreased for small objects. Notably mAP scores improved signifi-
cantly for medium and large objects but slightly decreased for small objects (faces). This
54

level of performance would generally be improvement on the first training run and gen-
erally mediocre level of performance for object detection. However, this second training
run model would be a trade-off to improve the detection of medium and large faces at
the expense of small faces.

The trained EfficientDet achieved a significant +19.29 percentage points mAP improve-
ment over the MTCNN pre-trained face detector as tested by Batagelj et al., again refer-
ring to table 1 in chapter 4 of this thesis. The comparison is 86.4% AP for the EfficientDet
D0 vs 67.11% AP for MTCNN.

13.3 Image inspection from test set

Figure 27 below shows some image samples from the second training run’s evalution
process. 3000 images from the test set were again reviewed in Tensorboard. The
model’s predictions are on the left-hand side and ground truths are on the right hand
side. Some improved performance was noticed visually.

Figure 27. Example face detections and face mask detections from test set.
55

14 Conclusions

In summary, during this project it was attempted to develop a software solution to detect
face mask usage from pictures. The objective was to create a monitoring system. During
the literature review phase of the project two avenues to this end were investigated:
direct face detection and object detection-based methods. During the research stage it
was discovered that the pretrained MTCNN and Haar Cascade face detectors would be
less effective so the focus in the project turned towards object detection methods similar
to the approach of Singh et al. [8; 20] Although Batagelj et al. did manage to achieve
excellent results using the state-of-the-art RetinaFace detector as a part of their solution.
[20]

While studying the object detection field, it was decided to utilize the high level TFOD-
API to train the neural network model because of its ease of use. The PASCAL VOC
annotated FMLD dataset from Batagelj et al. was chosen because it was compatible with
TFOD-API and applicable to the project. [20] The basics of object detection terminology
and architectures were examined mostly using online sources and e-books, with Mo-
hamed Elgendy’s e-book being particularly illuminating. [18] Articles from Padilla et al.
and Ouyang et al. also provided helpful background information about object detection
terminology. [38; 48]

Overall, the project achieved success in that an object detector was able to be trained
and tested for face mask wearing detection. Reasonable results were achieved keeping
in mind the limited computing resources available. It seems that from the achieved per-
formance metrics on a test set, small detection performance gains most likely were
achieved compared to a pretrained MTCNN face detector, as compared with Batagelj et
al. [20, pp.13-14] It should be pointed out that the state-of-the-art face mask wearing
detectors e.g., AIZooTech and RetinaFace, still achieve higher performance than the
current system.

It appears that the achieved mean average precision scores during the second training
run using the EfficientDet model exceeded the performance that Singh et al. had ob-
tained in their study. They also used a two-class object detector trained towards masked
and unmasked faces detection with their custom dataset. [8, pp.14]
56

Deep learning-based object detection appears to be a very active research field as evi-
denced by the EfficientDet detector, for instance. Most of the practically oriented e-books
that the author came across on the subject were written for the PyTorch platform for deep
learning as opposed to TensorFlow. Object detection is a complex problem to solve in
the object recognition and computer vision field.

Object detection from images is quite sensitive to the quality of the image and the quality
of the annotations. Excessive blur, out of focus pictures, darkness and brightness, and
occlusion of faces and non-controlled backgrounds in the images of pedestrians make
the training of face detectors more difficult. [50.] Using the TFOD-API turned out to be
efficient. The major problems in the project involved mostly dataset pre-processing and
the actual installation of the TensorFlow on the local machine so it could be utilized.
Using the more commonly used and well-known object detector model such as faster R-
CNN would have been preferred, but the GPU memory usage was too high. Although it
appeared later that reducing batch size to 2 would have worked for training it.

While the TFOD-API is a good framework for beginners to object detection, it was at
times difficult to find accurate information and documentation about hyperparameters
and training configuration options. The official documentation about TFOD-API was not
very accessible. TFOD-API is also a very high-level API and hides many of the complex-
ities of the object detection underneath the TensorFlow code. Possible improvements to
the project would have been to attempt to increase the average recall to about the level
of 0.8-0.9. Exporting TensorFlow object detector’s trained inference graph would have
been the next step in the project. Doing so would have allowed saving the model’s
weights and using the model in any kind of Python programs for inference and exiting
the whole TensorFlow object detection API training environment.
57

References

1. Woo-Joo, Kim. 2020. Sinun olisi hyvä kuunnella tätä johtavaa COVID-19
asiantuntijaa Etelä-Koreasta. Online. ASIAN BOSS. 28 March 2020. Asian
Boss, YouTube. <https://youtu.be/gAk7aX5hksU>. Accessed 28 March 2020.

2. Reuters Staff. 2020. Fact check: Outdated video of Fauci saying “there’s no
reason to be walking around with a mask”. Online. 8 October 2020. Reuters.
<https://www.reuters.com/article/uk-factcheck-fauci-outdated-video-masks-
idUSKBN26T2TR >. Accessed 24 March 2021.

3. Finnish Institute of Occupational Health. 2021. Suu-nenäsuojaimet, silmiensu-


ojaimet ja FFP-luokan hengityksensuojaimet hoitotyössä. Online. 21 January
2021. Finnish Institute of Occupational Health. <https://hyvatyo.ttl.fi/korona-
virus/ohje-suu-ja-nenasuojus>. Accessed 24 March 2021.

4. Finnish Institute for Health and Welfare. 2020. Hengityksensuojaimien käyttö.


Online. 27 October 2020. <https://thl.fi/fi/web/infektiotaudit-ja-rokotukset/taudit-
ja-torjunta/infektioiden-ehkaisy-ja-torjuntaohjeita/hengityksensuojaimien-kaytto
>. Accessed 24 March 2021.

5. CDC. 2021. Guidance for Wearing Masks. Online. 19 April 2021 (updated).
<https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/cloth-face-
cover-guidance.html>. Accessed 25 March 2021.

6. Karjalainen. Online. Karjalainen news. 13 October 2014. <https://www.kar-


jalainen.fi/uutiset/uutis-alueet/kotimaa/item/58077>. Accessed 25 March 2021.

7. Ashish. 2020. How do automatic doors work. Online. 1 August 2020. Science-
ABC. <https://www.scienceabc.com/innovation/automatic-sliding-doors-working-
motion-detector-pressure-sensor-infrared.html>. Accessed 24 March 2021.

8. Singh, Sunil; Ahuja, Umang; Kumar, Munish; Kumar, Krishan; Sachdeva, Moni-
ka. 2021. Face mask detection using YOLOv3 and faster R-CNN models:
COVID-19 environment. Multimedia Tools and Applications 2021. SpringerLink.
Accessed 5 March 2021.

9. Brownlee, Jason. How to Perform Face Detection with Deep Learning. Online.
24 August 2020 (updated). <https://machinelearningmastery.com/how-to-per-
form-face-detection-with-classical-and-deep-learning-methods-in-python-with-
keras/>. Accessed 13 September 2020.

10. Zhang, Kaipeng; Zhang, Zhanpeng; Li, Zhifeng; Qiao, Yu. 2016. Joint Face De-
tection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE
Signal Processing Letters, 23(10), pp.1499-1503. <https://arxiv.org/ftp/arxiv/pa-
pers/1604/1604.02878.pdf>. Accessed 20 March 2021.

11. de Paz Centeno, Iván. MTCNN face detection implementation for TensorFlow,
as a PIP package. Online. 14 January 2021 (updated).
<https://github.com/ipazc/mtcnn >. Accessed 25 March 2021.
58

12. Shanmugamani, Rajalingappaa. 2018. Deep Learning for Computer Vision.


Packt Publishing. Electronic book. O’Reilly Safari Online. Accessed 25 March
2021.

13. Michelucci, Umberto. 2019. Advanced Applied Deep Learning: Convolutional


Neural Networks and Object Detection. Apress. Electronic book. O’Reilly Safari
Online. Accessed 26 March 2021.

14. Mathworks. What is object detection? Online. <https://se.mathworks.com/dis-


covery/object-detection.html>. Accessed 9 April 2021

15. Yang, Shuo; Luo, Ping; Loy, Chen Change; Tang, Xiaoou. 2016. Wider Face: A
Face Detection Benchmark. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) 2016, pp. 5525-5533. Accessed 26 March 2021.

16. Liu, Haowei. 2014. Face technologies on mobile devices. Waltham, MA, USA:
Elsevier Inc. Electronic book. O’ Reilly Safari Online. Accessed 25 March 2021.

17. Dwivedi, Divyansh. Face Detection for Beginners. Online. 27 April 2018.
<https://towardsdatascience.com/face-detection-for-beginners-e58e8f21aad9>.
Accessed 25 March 2021.

18. Elgendy, Mohamed. 2021. Deep Learning for Vision Systems. Shelter Island,
NY, USA: Manning Publications Co. Electronic book. O’Reilly Safari Online. Ac-
cessed 12 April 2021.

19. AIZOOTech. 24 March 2021 (updated). <https://github.com/AIZOOTech/Face-


MaskDetection>. Accessed 17 April 2021.

20. Batagelj, Borut; Peer, Peter; Štruc, Vitomir; Dobrišek, Simon. 2021. How to Cor-
rectly Detect Face-Masks for COVID-19 from Visual Information? Applied Sci-
ences, 2021, 11(5):2070. <https://www.mdpi.com/2076-3417/11/5/2070/htm>.
Accessed 1 April 2021.

21. Shiming Ge; Jia Li; Qiting Ye; Zhao Luo. Detecting Masked Faces in the Wild
with LLE-CNNs. Online. 2017. <https://imsg.ac.cn/research/maskedface.html>
Accessed 1 April 2021.

22. Multimedia Laboratory, Department of Information Engineering, The Chinese


University of Hong Kong. 2015. WIDER FACE: A Face Detection Benchmark.
Online. 19 November 2015. <http://shuoyang1213.me/WIDERFACE/>. Ac-
cessed 30 March 2021.

23. Batagelj, Borut; Peer, Peter; Štruc, Vitomir; Dobrišek, Simon. 2021. Face Mask
Label Dataset (FMLD). Online. 2021. <https://github.com/borutb-fri/FMLD>. Ac-
cessed 7 April 2021.

24. COCO Consortium. 2020. COCO object detection benchmark. Online.


<https://cocodataset.org/#detection-2020>. Accessed 3 April 2021.

25. COCO. Online. 9 April 2021 (updated). Tensorflow. <https://www.tensor-


flow.org/datasets/catalog/coco>. Accessed 8 May 2021.
59

26. OpenCV. Cascade classifier. Online. <https://docs.opencv.org/3.4/db/d28/tuto-


rial_cascade_classifier.html>. Accessed 1 March 2021.

27. Pound, Mike. 2018. Detecting Faces (Viola Jones Algorithm). Online. Com-
puterphile. <https://youtu.be/uEJ71VlUmMQ>. Accessed 20 March 2021.

28. Viola, Paul & Jones, Michael. 2001. Rapid Object Detection using a Boosted
Cascade of Simple Features. Proceedings of the 2001 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, CVPR 2001, 2001,
pp. I-I. <https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-
01.pdf>. Accessed 27 March 2021.

29. OpenCV. Cascade classifier. Online. <https://docs.opencv.org/3.4/db/d28/tuto-


rial_cascade_classifier.html>. Accessed 1 March 2021.

30. Ng, Andrew. What Is End-to-end Deep Learning? Online. Coursera.


<https://www.coursera.org/lecture/machine-learning-projects/what-is-end-to-
end-deep-learning-k0Klk>. Accessed 4 May 2021.

31. Chollet, François. 2017. Deep Learning with Python. Shelter Island, NY, USA:
Manning Publications. Electronic book. O’Reilly Safari Online. Accessed 23
April 2021.

32. Michelucci, Umberto. 2018. Applied Deep Learning: A Case-Based Approach to


Understanding Deep Neural Networks. Apress. Electronic book. O’Reilly Safari
Online. Accessed 22 April 2021.

33. Géron, Aurélien. 2017. Hands-On Machine Learning with Scikit-Learn and Ten-
sorFlow. Sebastobol, CA, USA: O’Reilly Media Inc. Electronic book. O’Reilly
Safari Online. Accessed 1 May 2021.

34. Singhal, Harsh. 2017. Convolutional Neural Network with TensorFlow imple-
mentation. Online. 17 June 2017. Medium. <https://medium.com/data-science-
group-iitr/building-a-convolutional-neural-network-in-python-with-tensorflow-
d251c3ca8117>. Accessed 23 April 2021.

35. Ravichandiran, Sudharsan. 2019. Hands-On Deep Learning Algorithms with Py-
thon. Birmingham, UK: Packt Publishing Ltd. Electronic book. O’Reilly Safari
Online. Accessed 23 April 2021.

36. Ng, Andrew. Convolutional Neural Networks. Online. Coursera.


<https://www.coursera.org/lecture/convolutional-neural-networks/why-convolu-
tions-Xv7B5>. Accessed 24 April 2021.

37. Khandelwal, Renu. 2020. Evaluating performance of an object detection model.


Online. 6 January 2020. Towardsdatascience. <https://towardsdatasci-
ence.com/evaluating-performance-of-an-object-detection-model-
137a349c517b>. Accessed 21 April 2021.

38. Padilla, Rafael; Passos, Wesley L.; Dias, Thadeu L.B.; Netto, Sergio L.; da
Silva, Eduardo A.B. 2021. A Comparative Analysis of Object Detection Metrics
60

with a Companion Open-Source Toolkit. Electronics, 10, (3): 279.


<https://www.mdpi.com/2079-9292/10/3/279/htm>. Accessed 16 April 2021.

39. Solawetz, Jacob; Dwyer, Brad. 2020. Glossary of Common Computer Vision
Terms. Online. 5 October 2020. <https://blog.roboflow.com/glossary/>. Ac-
cessed 15 April 2021.

40. Arlen, Timothy C. 2018. Understanding the mAP evaluation metric for object de-
tection. Online. 1 March 2018. Medium. <https://medium.com/@timothycar-
len/understanding-the-map-evaluation-metric-for-object-detection-
a07fe6962cf3>. Accessed 23 April 2021.

41. Kumar, Harshit. 2019. Evaluation metrics for object detection and segmenta-
tion: mAP. Online. 20 September 2019.
<https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detec-
tion-and-segmentation>. Accessed 16 April 2021.

42. Reddy, Yeshwanth; Ayyadevara, Kishore V. 2020. Modern Computer Vision


with PyTorch. Birmingham, UK: Packt Publishing Ltd. Electronic book. O’Reilly
Safari Online. Accessed 16 April 2021.

43. Kurbiel, Thomas. 2020. Gaining an intuitive understanding of Precision, Recall


and Area Under Curve. Online. 28 April 2020. Towardsdatascience. <https://to-
wardsdatascience.com/gaining-an-intuitive-understanding-of-precision-and-re-
call-3b9df37804a7>. Accessed 15 April 2021.

44. Hui, Jonathan. 2018. mAP (mean Average Precision) for Object Detection.
Online. 7 March 2018. Medium. <https://jonathan-hui.medium.com/map-mean-
average-precision-for-object-detection-45c121a31173>. Accessed 16 April
2021.

45. Michelucci, Umberto. 2019. Advanced Applied Deep Learning: Convolutional


Neural Networks and Object Detection. Apress. Electronic book. O’Reilly Safari
Online. Accessed 26 March 2021.

46. Zeng, Nick. 2018. An Introduction to Evaluation Metrics for Object Detection.
Online. 16 December 2018. <https://blog.zenggyu.com/en/post/2018-12-16/an-
introduction-to-evaluation-metrics-for-object-detection/>. Accessed 15 April
2021.

47. Yan, Siqi. 2020. On the Robustness of Object and Face Detection: False Posi-
tives, Attacks and Adaptability. Doctoral thesis. School of Information Technolo-
gy and Electrical Engineering: The University of Queensland. UQ eSpace. Ac-
cessed 6 May 2021.

48. Liu, Li; Ouyang, Wanli; Wang, Xiaogang; Fieguth, Paul; Liu, Xinwang; Pie-
tikäinen, Matti. 2020. Deep Learning for Generic Object Detection: A Survey. In-
ternational Journal of Computer Vision 128, 2020, pp.261-318.
<https://doi.org/10.1007/s11263-019-01247-4>. Accessed 16 April 2021.

49. COCO Consortium. 2020. COCO dataset evaluation metrics. Online.


<https://cocodataset.org/#detection-eval>. Accessed 15 April 2021.
61

50. Dadhich, Abhinav. 2018. Practical Computer Vision. Packt Publishing. Electron-
ic book. O’Reilly Safari Online. Accessed 22 April 2021.

51. Zhao, Zhong-Qiu; Zheng, Peng; Xu, Shou-Tao; Wu, Xindong. 2019. Object De-
tection With Deep Learning: A Review. IEEE Transactions on neural networks
and learning systems, volume 30, issue 11, November 2019, pp. 3215-3217.

52. Liu, Wei; Anguelov, Dragomir; Erhan Dumitru; Szegedy, Christian; Reed, Scott;
Fu, Cheng-Yang; Berg, Alexander C. 2016. SSD: Single Shot MultiBox Detec-
tor. In: Leibe B., Matas J., Sebe N., Welling M. (eds.). Computer Vision – ECCV
2016. ECCV 2016. Lecture Notes in Computer Science, vol 9905. Springer,
Cham., pp. 21-37. < https://doi-org.ezproxy.metropolia.fi/10.1007/978-3-319-
46448-0_2> <https://arxiv.org/pdf/1512.02325>. Accessed 1 May 2021.

53. Tan, Mingxing; Pang, Ruoming; Le, Quoc V. 2020. EfficientDet: Scalable and
Efficient Object Detection. 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2020, pp. 10778-10787.
<https://arxiv.org/abs/1911.09070>. Accessed 18 April 2021.

54. Arora, Aman. 2021. EfficientDet – Scalable and efficient object detection.
Online. 11 January 2021. < https://amaarora.github.io/2021/01/11/effi-
cientdet.html#compound-scaling >. Accessed 18 April 2021.

55. Papers with code. ImageNet Leaderboard. Online. < https://paperswith-


code.com/sota/image-classification-on-imagenet >. Accessed 18 April 2021.

56. Arora, Aman. 2020. EfficientNet: Rethinking Model Scaling for Convolutional
Neural Networks. Online. 13 August 2020.
<https://amaarora.github.io/2020/08/13/efficientnet.html#comparing-conven-
tional-methods-with-compound-scaling>. Accessed 18 April 2021.

57. Brownlee, Jason (PhD). 2020. How to Train an Object Detection Model with
Keras. Online. 2 September 2020. <https://machinelearningmastery.com/how-
to-train-an-object-detection-model-with-keras/>. Accessed 1 March 2021.

58. Vladimirov, Lyudmil. Tensorflow object detection API install guide. Online. Ten-
sorflow. <https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/lat-
est/install.html>. Accessed 31 March 2021.

59. Renoitte, Nicholas. 2020. Real Time Face Mask Detection with Tensorflow and
Python | Custom Object Detection w/ MobileNet SSD. Online. 1 November
2020.
<https://youtu.be/IOI0o3Cxv9Q?list=PL0pM29ykw_wDs3sEJTk8r4U1iDDz2w3
Re>. Accessed 28 March 2021.

60. Annotation Best Practices for Object Detection. Online. <https://nanon-


ets.github.io/tutorials-page/docs/annotate>. Accessed 9 April 2021.

61. Renoitte, Nicholas. Realtime object detection. Online. 31 October 2020.


<https://github.com/nicknochnack/RealTimeObjectDetection/blob/main/Tuto-
rial.ipynb>. Accessed 28 March 2021.
62

62. Morgunov, Anton. 2021. How to Train Your Own Object Detector Using Tensor-
Flow Object Detection API. Online. 4 March 2021. NeptuneAI. <https://nep-
tune.ai/blog/how-to-train-your-own-object-detector-using-tensorflow-object-de-
tection-api>. Accessed 9 April 2021.

63. Vladimirov, Lyudmil. Configuring a training job. Online. Tensorflow. <https://ten-


sorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#con-
figuring-a-training-job>. Accessed 30 March 2021.

64. Vladimirov, Lyudmil. Partition the dataset. Online. Tensorflow. <https://tensor-


flow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#partition-
the-dataset>. Accessed 30 March 2021.

65. Morgunov, Anton. 2021. TensorFlow Object Detection API: Best Practices to
Training, Evaluation & Deployment. Online. 7 May 2021. NeptuneAI.
<https://neptune.ai/blog/tensorflow-object-detection-api-best-practices-to-train-
ing-evaluation-deployment>. Accessed 9 April 2021.

66. Randers-Pehrson, Glenn. 2014. Libpng warning: iCCP: known incorrect sRGB
profile. Online. 30 March 2014. Stackoverflow. < https://stackover-
flow.com/questions/22745076/libpng-warning-iccp-known-incorrect-srgb-profile
>. Accessed 14 April 2021.

67. Correa, Sebastian. 2019. Cosine learning rate decay. Online. 2 September
2019. Medium. <https://medium.com/@scorrea92/cosine-learning-rate-decay-
e8b50aa455b>. Accessed 22 April 2021.

68. TensorFlow 2 Detection Model Zoo. Online. 7 May 2021 (updated). Tensorflow.
<https://github.com/tensorflow/models/blob/master/research/object_detec-
tion/g3doc/tf2_detection_zoo.md>. Accessed 27 March 2021.

69. Solawetz, Jacob. 2020. Tackling the Small Object Problem in Object Detection.
Roboflow. Online. 19 August 2020. Roboflow. < https://blog.roboflow.com/de-
tect-small-objects/>. Accessed 26 April 2021.

70. Zoph, Barret; Cubuk, Ekin D.; Ghiasi, Golnaz; Lin, Tsung-Yi; Shlens, Jonathon;
Le, Quoc V. 2020. Learning Data Augmentation Strategies for Object Detection.
In: Vedaldi A., Bischof H., Brox T., Frahm JM. (eds) Computer Vision – ECCV
2020. ECCV 2020. Lecture Notes in Computer Science, vol 12372, pp 566-583.
Springer, Cham. <https://doi.org/10.1007/978-3-030-58583-9_34>. Accessed
28 April 2021.
Appendix 1
1 (2)

Tested requirements with version numbers

• Tensorflow 2.4.1
• pycocotools 2.0 (alternatively pycocotools windows 2.0.0.2)
• CUDA toolkit 11.0 (from Nvidia website)
• CuDNN 8.0.4 (from Nvidia website)
• labelmg labeling tool 1.8.3 (useful utility program for labeling a small test
dataset)
• RTX 2070S graphics card
• Geforce 465.89 gaming driver
• protoc protobuffer compiler 3.15.6 was used to compile protocolbuffers for
Tensorflow object detection API

Test 1 to verify the environment works

After having installed the CUDA Toolkit and CuDNN, this test 1 command is ex-
ecuted from the Anaconda environment prompt to verify that GPU-support is
functioning. [58]

python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000,


1000])))"
Appendix 1
2 (2)

Test 2 to verify the environment works

From within TensorFlow/models/research/ folder in the Anaconda environment prompt,


this test 2 command is executed to verify the TFOD-API installation [58].

python object_detection/builders/model_builder_tf2_test.py
Appendix 2
1 (4)

First training run, EfficientDet D0, hyperparameters, 3 classes

model {
ssd {
num_classes: 3
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 512
max_dimension: 512
pad_to_max_dimension: true
}
}
feature_extractor {
type: "ssd_efficientnet-b0_bifpn_keras"
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 4e-05
}
}
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.03
}
}
activation: SWISH
batch_norm {
decay: 0.99
scale: true
epsilon: 0.001
}
force_use_bias: true
}
bifpn {
min_level: 3
max_level: 7
num_iterations: 3
num_filters: 64
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 1.0
x_scale: 1.0
height_scale: 1.0
width_scale: 1.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
Appendix 2
2 (4)

iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 4e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.01
}
}
activation: SWISH
batch_norm {
decay: 0.99
scale: true
epsilon: 0.001
}
force_use_bias: true
}
depth: 64
num_layers_before_predictor: 3
kernel_size: 3
class_prediction_bias_init: -4.6
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
scales_per_octave: 3
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 1e-08
iou_threshold: 0.5
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 1.5
alpha: 0.25
Appendix 2
3 (4)

}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
add_background_class: false
}
}
train_config {
batch_size: 8
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_scale_crop_and_pad_to_square {
output_size: 512
scale_min: 0.1
scale_max: 2.0
}
}
sync_replicas: true
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.008
total_steps: 15000
warmup_learning_rate: 0.001
warmup_steps: 2000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
fine_tune_checkpoint: "Tensorflow/workspace/pretrained_models/effi-
cientdet_d0_coco17_tpu-32/checkpoint/ckpt-0"
num_steps: 15000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
use_bfloat16: true
fine_tune_checkpoint_version: V2
}
train_input_reader {
label_map_path: "Tensorflow/workspace/annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "Tensorflow/workspace/annotations/train.record"
}
}
eval_config {
num_visualizations: 3000
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1
Appendix 2
4 (4)

}
eval_input_reader {
label_map_path: "Tensorflow/workspace/annotations/label_map.pbtxt"
shuffle: true
num_epochs: 1
tf_record_input_reader {
input_path: "Tensorflow/workspace/annotations/test.record"
}
}
Appendix 3
1 (4)

Second training run, EfficientDet D0, hyperparameters, 2 classes

model {
ssd {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 512
max_dimension: 512
pad_to_max_dimension: true
}
}
feature_extractor {
type: "ssd_efficientnet-b0_bifpn_keras"
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 4e-05
}
}
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.03
}
}
activation: SWISH
batch_norm {
decay: 0.99
scale: true
epsilon: 0.001
}
force_use_bias: true
}
bifpn {
min_level: 3
max_level: 7
num_iterations: 3
num_filters: 64
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 1.0
x_scale: 1.0
height_scale: 1.0
width_scale: 1.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
Appendix 3
2 (4)

}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 4e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.01
}
}
activation: SWISH
batch_norm {
decay: 0.99
scale: true
epsilon: 0.001
}
force_use_bias: true
}
depth: 64
num_layers_before_predictor: 3
kernel_size: 3
class_prediction_bias_init: -1.0
use_dropout: true
dropout_keep_probability: 0.8
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 2.0
aspect_ratios: 0.25
aspect_ratios: 0.5
aspect_ratios: 1.0
aspect_ratios: 1.5
scales_per_octave: 4
normalize_coordinates: true
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 0.01
iou_threshold: 0.4
max_detections_per_class: 100
max_total_detections: 200
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
Appendix 3
3 (4)

}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 1.5
alpha: 0.25
}
}
classification_weight: 1.1
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
add_background_class: false
}
}
train_config {
batch_size: 8
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_scale_crop_and_pad_to_square {
output_size: 512
scale_min: 0.99
scale_max: 1.2
}
}
data_augmentation_options {
random_adjust_brightness {
max_delta: 0.01
}
}
data_augmentation_options {
random_adjust_contrast {
min_delta: 0.99
max_delta: 1.01
}
}
sync_replicas: true
optimizer {
adam_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.0005
total_steps: 25000
warmup_learning_rate: 1e-05
warmup_steps: 2500
}
}
}
use_moving_average: false
}
fine_tune_checkpoint: "Tensorflow/workspace/pretrained_models/effi-
cientdet_d0_coco17_tpu-32/checkpoint/ckpt-0"
num_steps: 25000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 200
Appendix 3
4 (4)

unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
use_bfloat16: false
fine_tune_checkpoint_version: V2
}
train_input_reader {
label_map_path: "Tensorflow/workspace/annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "Tensorflow/workspace/annotations/train.record"
}
}
eval_config {
num_visualizations: 3000
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1
}
eval_input_reader {
label_map_path: "Tensorflow/workspace/annotations/label_map.pbtxt"
shuffle: true
num_epochs: 1
tf_record_input_reader {
input_path: "Tensorflow/workspace/annotations/test.record"
}
}

You might also like