Professional Documents
Culture Documents
YOLO basedThreatObjectDetectioninX Rayimages
YOLO basedThreatObjectDetectioninX Rayimages
net/publication/340890506
CITATIONS READS
3 952
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Reagan L. Galvez on 18 May 2020.
Manila, Philippines
*
Bulacan State University
Malolos, Philippines
reagan_galvez@dlsu.edu.ph
Abstract—Manual detection of threat objects in an X-ray images. These have saved researchers lots of time in data
machine is a tedious task for the baggage inspectors in airports, gathering and annotating each image. However, training a
train stations, and establishments. Objects inside the baggage different dataset could also be needed to evaluate the
seen by the X-ray machine are commonly occluded and difficult performance of the aforementioned object detection
to recognize when rotated. Because of this, there is a high algorithms. There is a relatively small body of literature that
chance of missed detection, particularly during rush hour. As a is concerned with threat object detection in X-ray images
solution, this paper presents a You Only Look Once (YOLO)- like in [12], [13]. The main reason is that most X-ray images
based object detector for the automated detection of threat for this kind of dataset are not publicly available. It is also
objects in an X-ray image. The study compared the
difficult to generate X-ray images because X-ray machines
performance between using transfer learning and training from
scratch in an IEDXray dataset which composed of scanned X-
are expensive compared with an ordinary camera that can
ray images of improvised explosive device (IED) replicas. The capture various images easily. As a solution, this paper used
results of this research indicate that training YOLO from a YOLO-based object detection algorithm to detect threat
scratch beats transfer learning in quick detection of threat objects in X-ray images.
objects. Training from scratch achieved a mean average
precision (mAP) of 45.89% in 416×416 image, 51.48% in III. METHODOLOGY
608×608 image, and 52.40% in a multi-scale image. On the A. YOLO Architecture
other hand, using transfer learning achieved only an mAP of
29.54% while 29.17% mAP in a multi-scale image. This study used You Only Look Once (YOLOv3)
architecture [14] that was an upgrade from [15], [16] to
Keywords—automated detection, convolutional neural detect threat objects. YOLOv3 predicts bounding boxes
networks, threat object, transfer learning, X-ray image, YOLO using dimension clusters as anchor boxes. The network
outputs 4 values 𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤 , 𝑡ℎ for each bounding box. Given
I. INTRODUCTION that (𝑐𝑥 , 𝑐𝑦 ) is the offset of the top left corner of the image,
Tight security in public places such as in airport the center (𝑏𝑥 , 𝑏𝑦 ) of the bounding box coordinates in (1)
terminals, train stations, and commercial establishments is and (2) will be predicted using sigmoid function 𝜎.
important nowadays due to the increasing terrorist activities
like bombing. These will prevent several casualties that can
possibly happen anytime. In the Global Terrorism Index 2018 𝑏𝑥 = 𝜎 (𝑡𝑥 + 𝑐𝑥 ) (1)
[1], the Philippines was listed in the top 10 for the countries
most impacted by terrorism and also the lone Southeast Asian 𝑏𝑦 = 𝜎(𝑡𝑦 + 𝑐𝑦 ) (2)
country included. To deal with this, one of the security
measures implemented in the said public places is baggage
scanning using an X-ray machine. This will able to project The bounding box width 𝑏𝑤 and height 𝑏ℎ are computed
the items inside the baggage to a monitor, and the inspector using (3) and (4) where 𝑝𝑤 and 𝑝ℎ are the height and width
analyzes it whether the items pose a threat or not. Although of the bounding box prior.
this process is widely adopted not only in the Philippines but
also in other countries, the possibility of missed detections
during peak hours is high. This is due to the limited time 𝑏𝑤 = 𝑝𝑤 𝑒
𝑡𝑤
(3)
available to scan and analyze the baggage, which can be
solved by using a fast object detector as decision support for 𝑏ℎ = 𝑝ℎ 𝑒
𝑡ℎ
(4)
threat object detection.
II. RELATED WORKS
Table I shows an overview of the feature extractor used
In recent years, there has been an increasing amount of by YOLOv3 called Darknet-53. It has 53 convolutional
literature on object classification and detection [2], [3] using layers trained on ImageNet [17] dataset, which contains
convolutional neural networks (CNN) after the huge success 1000 classes of images. Darknet-53 are combinations of
of the AlexNet [4] in the image classification task. This convolutional layers with sizes 1×1, 3×3, and residual
established the fact the features extracted by the CNN can network (shortcut connections). The final layer used average
accurately classify objects over handcrafted features used by pooling to downsample the output and a softmax activation
[5], [6]. Much of the object detection algorithms [7]–[9] are function.
trained and evaluated using PASCAL VOC [10] or MS
COCO [11] dataset due to the availability of annotated
Fig. 2. Histogram.
C. Training
The model was trained using stochastic gradient descent
(SGD) with Nesterov momentum [18] using a learning rate
of 0.001 and 300 epochs. The other hyperparameters used in
training are summarized in Table II. The batch size was
decreased when using a 608×608 image and trained from
scratch (multi-scale) due to memory constraints. For the
simple transfer learning and transfer learning (multi-scale),
the weights came from YOLOv3, which used spatial
pyramid pooling described in [19]. The weights were frozen
during the training except for the 3 YOLO layers.
𝐶
1
𝑚𝐴𝑃 = ∑ 𝐴𝑃𝑖 (6)
𝐶
𝑖=1
Fig. 1. Grayscale image of the IEDXray dataset.
Fig. 3. Performance comparison between training from scratch and using transfer learning.
Fig. 4. Performance comparison between training from scratch and using transfer learning (multi-scale).
Precision and recall are computed using (7) and (8) and using transfer learning (multi-scale) in terms of train
where 𝑇𝑃, 𝐹𝑃, 𝐹𝑁 , are true positive, false positive, and loss, mAP, precision, and recall values. Again, we can see
false negative, respectively. that training from scratch outperforms transfer learning
(multi-scale). It also shows that implementing the multi-
scale parameter adds noise to the graph, which does not
𝑇𝑃 improve the training. The average loss and mAP when using
𝑃= (7) multi-scale is 26.15 and 20.40%, respectively.
𝑇𝑃 + 𝐹𝑃
In the last experiment, in order to see the effect of the
size of the input image in detection accuracy when training
𝑇𝑃 from scratch, the image is scaled in different sizes. First is
𝑅= (8)
𝑇𝑃 + 𝐹𝑁 from 416×416 to 608×608, and the other used multi-scale,
which scales image from the range 320 to 608 pixels. The
performance comparison in training from scratch using
IV. RESULTS AND DISCUSSION different sizes of the input image is shown in Fig. 5. The best
The first experiment seeks to know the performance mAP achieved in this experiment is 52.40%, which used
between training from scratch and using transfer learning in multi-scale images.
detecting threat objects. The performance comparison of this Table III shows the performances of each experiment.
experiment in terms of train loss, mAP, precision, and recall Closer inspection of the table shows that training the
values is presented in Fig. 3. Interestingly, we can see that YOLOv3 from scratch generally outperforms the transfer
training from scratch outperformed transfer learning at learning approach. We can see that increasing the size of the
evaluation. The average training loss when using transfer input image by 46% (416×416 → 608×608) increases the
learning is 25.67, while training from scratch is only 2.40. mAP by 12% while using multi-scale increases the mAP by
The mAP in transfer learning has a higher start point, but 14%. Fig. 6 presents the accuracy and speed (average speed
after 2 epochs, the mAP does not increase, and the average per image) comparison of all the experiments conducted. We
mAP is 21.50% until the last epochs. On the other hand, the can see that the best tradeoff between accuracy and speed is
mAP in training from scratch slowly increases, and after achieved in training from scratch (multi-scale), which has
about 100 epochs, the mAP stops increasing until the last 52.40% mAP and inference speed of 27.99 ms.
epochs with an average mAP of 38.03%. The best mAP is
45.89% at 300 epochs. All experiments showed that YOLOv3 has difficulty in
detecting thin objects like wires. Overall, these results
In the next experiment, multi-scale training is added to indicate that although transfer learning is a must-try
transfer learning. The multi-scale parameter scales the approach in training a small amount of dataset, it does not
training images from 320 to 608 pixels. Fig. 4 above shows
the performance comparison between training from scratch
Fig. 5. Performance comparison in training from scratch using different size of input image 416×416 and 608×608.