You are on page 1of 8

Available

Available online
online at
at www.sciencedirect.com
www.sciencedirect.com

ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia
Procedia Computer
Computer Science00
Science00 (2018)
(2018) 000–000
000–000 www.elsevier.com/locate/procedia
ScienceDirect www.elsevier.com/locate/procedia

Procedia Computer Science 183 (2021) 768–775

10th
10th International
International Conference
Conference of
of Information
Information and
and Communication
Communication Technology
Technology (ICICT-2020)
(ICICT-2020)

Reaserch
Reaserch and
and implementation
implementation of
of social
social distancing
distancing monitoring
monitoring
technology
technology based
based on
on SSD
SSD
Jingchen
Jingchen Qin*,
Qin*, Ning
Ning Xu
Xu
School
School of
of Information
Information Engineering,
Engineering, Wuhan
Wuhan University
University of
of Technology,
Technology, Wuhan
Wuhan 430070,
430070, China
China

Abstract
Abstract
In
In the
the fight
fight against
against the
the Covid-19,
Covid-19, social
social distancing
distancing has
has proven
proven toto be
be aa very
very effective
effective measure
measure to
to mitigate
mitigate the
the spread
spread ofof the
the disease.
disease.
As
As resumption of work, production and classes accelerates, it is necessary to limit people’s social distance to reduce the
resumption of work, production and classes accelerates, it is necessary to limit people’s social distance to reduce the rate
rate of
of
the virus spread. To solve this problem, a method for monitoring social distancing based on SSD object
the virus spread. To solve this problem, a method for monitoring social distancing based on SSD object detection technology isdetection technology is
proposed in
proposed in this
this study.
study. This
This method
method utilizes
utilizes SSD300
SSD300 modelmodel to to detect
detect people
people in
in aa video or picture,
video or picture, and
and labels
labels aa Red
Red Line
Line as
as
warning
warning on on the
the people
people whose
whose distances
distances are
are less
less than
than the
the default
default one,
one, implementing
implementing real-time
real-time social
social distancing
distancing monitoring,
monitoring, and
and
the mAP reaches 88.44%.
the mAP reaches 88.44%.
©
© 2020 The Authors. Published by Elsevier B.V.
©2020
2021TheThe Authors.
Authors. Published
Published byby Elsevier
Elsevier B.V.
B.V.
Peer-review
Peer-review under
under responsibility
responsibility of
of organizing
organizing committee of
of the
committeelicense 10th International
10th International Conference
Conference of of Information
Information and
the (http://creativecommons.org/licenses/by-nc-nd/4.0/) and
This is an open access article under the CC BY-NC-ND
Communication
Communication
Peer-review under Technology
Technology (ICICT-2020).
(ICICT-2020).
responsibility of the scientific committee of the 10th International Conference of Information and Communication
Technology.
Keywords:
Keywords: SSD;
SSD; object
object detection;
detection; social
social distancing
distancing monitoring
monitoring

1.
1. Introduction
Introduction

Covid-19
Covid-19 isis currently
currently pandemic
pandemic worldwide.
worldwide. Current
Current policies
policies of
of at
at least
least 1
1 meter
meter physical
physical distancing
distancing are
are associated
associated
with a large reduction in infection, and distances of 2 meters might be more effective[1]. Vision-based object detection
[1]
with a large reduction in infection, and distances of 2 meters might be more effective . Vision-based object detection
and
and tracking
tracking technology
technology can
can be
be used
used to
to help
help monitoring
monitoring social
social distancing,
distancing, mitigating
mitigating the
the spread
spread of
of the
the disease.
disease.
With
With the development of artificial intelligence technology, vision-based object detection and tracking
the development of artificial intelligence technology, vision-based object detection and tracking technology
technology
has
has gradually
gradually penetrated
penetrated into
into many
many aspects
aspects ofof people's
people's lives,
lives, such
such as
as video
video surveillance,
surveillance, human-computer
human-computer interaction,
interaction,
behavior
behavior understanding,
understanding, using
using deep
deep neural networks. Many
neural networks. Many deep
deep neural
neural networks
networks have
have demonstrated
demonstrated strong
strong robustness
robustness
in the field of object detection by autonomously learning object features [2]. The current real-time image detection
[2]
in the field of object detection by autonomously learning object features . The current real-time image detection
models
models based
based on
on deep
deep learning
learning mainly
mainly use
use algorithm
algorithm frame-works
frame-works such
such as
as SSD
SSD andand YOLO.
YOLO. SSDSSD was
was proposed
proposed by
by Wei
Wei

** Corresponding
Corresponding E-mail
E-mail address:dqdgoliver@163.com
address:dqdgoliver@163.com
1877-0509
1877-0509 2020 2020 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
This
This is
is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection
Selection and peer-review under responsibility of the scientific committee
and peer-review under responsibility of the scientific committee ofof the
the 10th
10th International
International Conference
Conference of
of Information
Information and
and
Communication Technology, ICICT2020
Communication Technology, ICICT2020

1877-0509 © 2021 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 10th International Conference of Information and Communication
Technology.
10.1016/j.procs.2021.02.127
Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775 769
2 Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000

Liu in 2016. It has fast recognition speed and high accuracy, and is suitable for multi-target recognition field[3]. This
article attempts to expand its functions for social distancing monitoring in the fight against the Covid-19.

2. Overview of object detection

Object detection is one of the research hotspots in the field of computer vision. The goal of object detection is to
determine whether there is a object of a specific category (such as cars, people, cats and dogs) in a given image; If it
exists, return the spatial location and coverage of each object, such as returning a bounding box.
Traditional object detection is generally divided into three stages: first, use a sliding window of different sizes to
frame a part of the picture as a proposed region; then extract relevant visual features of these areas; finally, use
traditional machine learning classifier for classification and recognition [4].Traditional object detection methods have
the following disadvantages: recognition accuracy is not high; the amount of calculation is relatively large, and the
running speed is slow; multiple correct recognition results may appear[5].
Current object detection algorithms based on neural networks are mainly divided into two categories: two-stage
and one-stage . The two-stage algorithm is mainly a series of RCNN algorithms. It needs to first generate the
candidate boxes, and then classify and regress them through CNN (Convolutional Neural Network). One-stage
algorithm can directly generate the classification probability and position coordinates of the object through one stage.
One-stage algorithms mainly includes YOLO, SSD, etc. Due to the differences in processing methods, these two
types of algorithms have different detection performances. Two-stage algorithm is dominant in classification and
positioning accuracy, while one-stage algorithm is dominant in processing speed[6].
R-CNN series algorithms have high accuracy, but with the problem of excessive calculation, even Faster R-CNN
can only run at a frame rate of 7 FPS (Frame per Second)[7], which is difficult to achieve real-time detection. Small
object often exists in the monitoring screen, and SSD algorithm has excellent detection effect on small objects due
to the unique network architecture. Therefore, this paper focuses on SSD algorithm, which can take into account
real-time and accuracy, to realize real-time social distance monitoring.

3. SSD algorithm principle

The main idea of SSD (Single Shot MultiBox Detector) is to use CNN to extract features, evenly perform
intensive sampling at different positions in the picture[8]. Different scales and aspect ratios can be used for sampling.
Object classification and bounding-box regression are performed simultaneously.

3.1. SSD framework

The overall architecture of SSD model is shown in Fig. 1. The backbone network adopted by the SSD is VGG
network. SSD draws on the ‘anchor’ mechanism of Faster-RCNN, where the ‘anchor’ is a box with an adjustable
position and size. SSD uses multi-scale feature maps of different depths to predict objects of different sizes. For
small objects, use shallower feature maps with higher resolution and set smaller prior boxes on the feature maps.
Larger prior boxes are set on the deeper feature maps to detect larger objects.

Fig. 1. SSD architecture[7]


770 Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775
Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000 3

We take the features of the third convolution of conv4, fc7, the second convolution of conv6, the second
convolution of conv7, the second convolution of conv8, the second convolution of conv9 for further convolution to
obtain the prediction results. We call them the effective feature layers. Each n*n size feature map has n*n feature
map cells which generate several default boxes of fixed size. The number of default boxes generated by single
feature map cell varies with different feature maps. The shape of the default box is determined according to
scales(ratio of default box side length to original image side length) and aspect ratios. With each feature map cell as
the center, generate a series of default boxes. The following will introduce the rules for SSD to generate default
boxes.
The minimum and maximum side lengths of the square default box are min�‫ �݅�ݏ‬and min�‫ �݅�ݏ‬香 max�‫�݅�ݏ‬.
Setting an aspect ratio will generate 2 rectangles with length and width:


�‫ ̴�݁�ܿ�݁���ݏ‬香 min�‫� �݅�ݏ‬ 香 min�‫�݅�ݏ‬ ���
�‫̴�݁�ܿ݁���ݏ‬
The aspect ratio is artificially set to 2 or 3. The min_size and max_size of the default box are determined by the
corresponding scale of each feature map. The scale calculation formula is as follows:

���෣ − ���෣
�� ���෣ � � − � �� ��� ���
�−�
Set Smin=0.2, Smax=0.9, m in the formula refers to the number of the effective feature layers used for prediction.
For example, SSD 300 uses 6 feature layers for prediction as mentioned before, so m=6. It can be seen from the
default box size used by each feature layer that SSD uses low-level feature maps to detect small objects and high-
level feature maps to detect large objects. This is the most prominent contribution of SSD.
It is worth mentioning that the reason why the object can be detected by the feature map is that the convolution
actually saves the spatial position information of original input picture[9]. The features at a certain position on the
input picture are reflected at the same position on the feature map. Therefore, we can map each feature map cell to
the area of the input picture on the corresponding position and convolute the feature map with a 3*3 convolution
kernel to make predictions. Then we compare prior box with the ground truth box’s annotations to train filter’s
parameters, adjusting the classification and position of each prior boxes closer to the ground truth boxes.

3.2. Prior box regression model

After obtaining the default boxes, SSD will first match them with the ground truth boxes of the input picture.
Only the default box with the highest and qualified confidence level is used as the prior box. Then we do bounding-
box regression on these prior boxes.
For each effective feature layer, it will be convolved with num_priors*4 convolution kernels and num_priors
*num_classes convolution kernels respectively once. Num_priors, num_classes and 4 represent the number of prior
boxes generated by the feature map cell, the number of sample classifications, and the dimension of [x, y, w, h].
The first convolution obtains the offset(adjusted value) of the position and size of the prior box. The position and
size of the prior box are denoted by � ෣�෣ ��� ��� ��� �,and the ground truth box’s position and size are denoted by
� ෣�෣ ��� ��� ��� �. The offset of the prior box and the ground truth box is � ෣�෣ ��� ��� ��� �:

�෣ � �෣ �� � �� �� ��
�෣ � �� � �� log �� log ���
�� �� �� � ��
We call the above process of obtaining the prediction results from the ground truth box as encode, and the
process of obtaining the size and position �� of the prediction box from the prediction result � is called decode:

��෣ �� �෣ � �෣ ���� �� �෣ � �� ���� �� ��� ���� �� ��� ���


When predicting, we need to decode the result of the convolution on the effective feature layers to obtain the
predicted position; during training, we encode the information of the ground truth box, and the encoded values are
used to calculate the loss function � together with the prediction results
Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775 771
4 Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000

The second convolution on the effective feature layers is used to predict the category of prior boxes. Fig. 2 shows
the structure that SSD divides the effective feature layer into three lines, which are classification prediction,
bounding-box regression, and generating prior boxes. We define the weighted sum of localization loss and
confidence loss as the loss function ‫ܮ‬, and reduce the loss by training the model until the loss function no longer
drops significantly.

Fig. 2. SSD processing flow

The above is the basic principle of feature extraction, training and prediction of SSD model. Model training

3.3. Dataset making

To train deep neural networks, a dataset is essential. The dataset can be obtained by manual annotation or from
a public dataset. Manually labelling dataset is more suitable for specific application environments, but it takes time
and effort. This article selects the PASCAL VOC data set for making the people dataset we need.
VOC2007 dataset contains 9963 labelled images, including four main categories (people, animals, vehicles,
indoor items), and twenty sub categories[10]. We need to process it into a dataset containing only people, which is
very important, otherwise too many negative samples will reduce the accuracy of the model. We will dicuss this
conclusion in section 4.3.
VOC2007 folder contains three sub folders: JPEGImages, Annotations, ImageSet. The JPEGImages folder stores
all pictures in jpg format. Annotations folder contains the annotation information of all pictures, including the
object’s classification and location, and the annotation file’s name corresponds to the picture stored in the
JPEGImages folder. The annotation file’s format is xml. The Imageset folder stores several text files, which are the
samples’ names of the training set, test set, training-validation set, and validation set. The directory structure of the
VOC2007 dataset folder is as follows:
/VOC2007--------/Annotations
--------/ImageSets------/Main
-------/JPEGImages
We use the following process to obtain the required people dataset. Create a folder VOCperson with the same
directory structure as the VOC2007 dataset folder; Traverse the xml files under the Annotation folder, if it contains
<name>person</name>, save it and delete the rest. Save the corresponding pictures in JPEGImages folder according
to the remaining annotation file name, and delete the rest;Put the jpg images you extracted into the newly created
JPEGImages folder; put the annotation files you extracted (xml files’ names and number should be the same as the
name and number of pictures in the new JPEGImages folder) into the new Annotations folder. In the end we got
2095 pictures for training, validation and testing.

3.4. Training process

The SSD model is based on the Tensorflow[11] deep learning framework. Our experimental environment is as
follows: processor Intel i5-8400 CPU, memory RAM 8G, graphics card NVIDIA GTX 1060 3G, tensoflow-gpu
1.13.2.
Before training, put the pictures and label files into the project.10% of the dataset is used for verification, and
90% is used for training. Set the classes of voc_classes.txt in model_data folder to person, and set num_classes=2
(person and background) in the train.py. Fully shuffle the data set to avoid certain features appearing centrally. Put
772 Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775
Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000 5

ssd_weights.h5 into the corresponding folder as the initial value of training. Build SSD 300 model with the Keras
Functional API, and specify loss, optimizer, batch_size and the number of epochs. Once the SSD model is compiled,
start training the model[12].
During training, monitor changes in indicators such as loss function values on training data and validation data.
Adopt a suitable learning rate decay strategy. First train with a larger learning rate of 10-5, if the indicators on the
training set and the validation set no longer change, end the training (Early Stopping), and then set the learning rate
to 10-5 to continue training. This ensures adequate training. The training process takes about two hours.

3.5. Experimental result

Use Tensorboard to visualize training results after training. In section 4.1 we conclude that it is essential to process
VOC2007 into a dataset containing only people. So the model was trained with processed dataset and unprocessed
dataset separately, and we compare the training results of two experiments. It can be seen from Fig. 3 that although the
training set is reduced, the loss of training using the processed dataset has decreased from 1.9 to 1.8.

(a) loss before data set processing (b) loss after data set processing

Fig. 3. Comparison of loss before and after data set processing.

Next, we need to calculate the positioning accuracy of the model. We first detect all the test set pictures to obtain
the position of the prediction boxes, then get the position of the ground truth boxes from the annotation files.
According to the obtained prediction box and the ground truth box, calculate the Intersection over Union(IOU): �� /
�� .
�� is the overlapping area between the prediction box and the ground truth box, and �� is the total area occupied
by the prediction box and the ground truth box. Determine whether the sample is correctly classified according to
whether the IOU is greater than the confidence threshold. The samples classified as positive samples correctly are
TP (True Positives), the samples classified as negative samples correctly are TN (True Negatives), the samples
classified as positive samples incorrectly are FP (False Positives), and the samples classified as negative samples
incorrectly are FN (False Negatives). Precision refers to the proportion of TP to the samples classified as positive,
and Recall refers to the proportion of TP to all the positive samples. Precision and Recall are respectively:

�� ��
�ܿ���‫̴�ݏ‬෣ �ܴ����� ���
�� � ൅� �� � ൅�

The mAP value is a measure of the accuracy of object detection model[13]. We take different confidence
thresholds and get different Precision and Recall.The area under the P-R curve is mAP. According to calculations,
our trained model’s mAP=88.44%, as shown in Fig. 4.
Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775 773
6 Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000

Fig. 4. The mAP of model.

4. Social distancing monitoring

After training the SSD model, we expand its function to monitor the social distancing of people. According to
section 3.2, we know that the model will obtain the position of the prediction boxes when detecting people in a
picture. In SSD.py file, we created an empty dictionary. When the model detects a person and generates a prediction
box, we save the coordinates of the top left corner of the prediction box in the dictionary. If the length of the
dictionary is greater than 1, it means that more than two humans have been detected in one picture. We use the
itertools.combination() function to calculate the Euclidean distance( ܽʹ ൅ ܾʹ , a, b are the coordinates of two points)
between the top left vertices of each combination of all prediction boxes. If the distance is greater than a certain
value(the value in this study is 100), the two person are judged to be too close. The complete block diagram of the
algorithm is shown in Fig. 5.
The steps in Fig. 5 are summarized as follows. The model gets the input picture, which can also be a frame of
video. Then Use SSD object detection algorithm to detect and locate people in the picture, draw green boxes on
them. Calculate the Euclidean distance between the top left vertices of the prediction boxes as the distance between
people, if the distance is greater than a certain value (we set 100), it is judged that the two people are too close.
Finally Draw a Red Line to connect the top left vertices on the prediction boxes which are too close, warning that
the social distance between the two people is dangerous.

Fig. 5. monitoring process diagram.

The detection results are shown in Fig. 6, and the two images are from VOC2007.
774 Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775
Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000 7

(a) two of three people are too close (b) three of five people are too close

Fig. 6. prediction results.

This study also made an experiment on video, and the detection frame rate reached about 8 frames, results are
shown in Fig. 7. The real-time frame rates are displayed in the upper left corner of the pictures, which are 8.06 and
7.03 fps respectively.

(a) 8.06 fps, detecting two people are too closed (b) 7.03 fps, detecting four people are two closed

Fig. 7. Video detection result graphs

5. Conclusions

This study shows that SSD object detection algorithm has the advantages of high accuracy and speed in the
application of social distancing monitoring in public places. People are often relatively small in surveillance video,
which is just good for taking advantage of SSD detection of small objects. This will help to ease the spread of the
epidemic.
This study uses the original SSD model, if the model architecture is modified and optimized, the recognition
accuracy of small objects will be further improved. In addition, the model can be deployed on a high-performance
GPU (such as NVIDIA Jetson TX2) to further improve the detection frame rate and real-time performance.

References

1. Chu D.K. et al.,Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-
19: a systematic review and meta-analysis, Lancet,2020. https://doi.org/10.1016/S0140-6736(20)31142-9
Jingchen Qin et al. / Procedia Computer Science 183 (2021) 768–775 775
8 Jingchen Qin et al. / Procedia Computer Science 00 (2020) 000–000

2. Zhang Ning. Development of Dynamic Object Detection System Based on UAV Platform[D]. Zhejiang University,2018. (In Chinese)
3. LI Xiaoning,LEI Tao,ZHONG Jiandan. Detecting method of small vehicle targets based on improved [J]. Journal of Applied
Optics,2020,41(01):150-155.(In Chinese)
4. HU Menglong, SHI Yu. Research on Small Target Object Detection Algorithm Based on SSD Method [J]. Modern Information
Technology,2020,4(03):5-9. (In Chinese)
5. Yangjie Xie. Research and Implementation of SSD Target Detection Technology Based on FPGA[D].Xidian University,2019. (In Chinese)
6. Li Huaqing. The Research on Fast Detection Algorithm for Small Targets of Aerial Images Based on SSD[D]. Xidian University,2018. (In
Chinese)
7. Liu W , Anguelov D , Erhan D , et al. SSD: Single Shot MultiBox Detector[C]// European Conference on Computer Vision. Springer
International Publishing, 2016.
8. Lin T Y , Goyal P , Girshick R , et al. Focal Loss for Dense Object Detection[J]. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 2017, PP(99):2999-3007.
9. He K., Zhang X., Ren S., Sun J. (2014) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In: Fleet D.,
Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8691.
Springer, Cham
10. Everingham M , Gool L V , Williams C K I , et al. The Pascal Visual Object Classes (VOC) Challenge[J]. International Journal of
Computer Vision, 2010, 88(2):p.303-338.
11. Tensorflow.An Open Source Machine Learning Framework for Everyone [EB/OL].[2017-11-10].https://tensorflow.org
12. Chollet F.Keras[EB/OL].[2020-04-12].https://keras.io.
13. WEI Wei,PU Wei,LIU Yi. Application of Improved YOLOv3 in Aerial Target Detection [J]. Computer Engineering and
Applications,2020,56(07):17-23. (In Chinese)

You might also like