You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340890506

YOLO-based Threat Object Detection in X-ray Images

Conference Paper · November 2019


DOI: 10.1109/HNICEM48295.2019.9073599

CITATIONS READS

3 952

4 authors:

Reagan L. Galvez Elmer P. Dadios


Bulacan State University De La Salle University
8 PUBLICATIONS   166 CITATIONS    544 PUBLICATIONS   3,266 CITATIONS   

SEE PROFILE SEE PROFILE

Argel Bandala Ryan Rhay P. Vicerra


De La Salle University De La Salle University
304 PUBLICATIONS   1,820 CITATIONS    141 PUBLICATIONS   655 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Intelligent Transportation System View project

Intelligent Systems for Agricultural Processes and Automation View project

All content following this page was uploaded by Reagan L. Galvez on 18 May 2020.

The user has requested enhancement of the downloaded file.


YOLO-based Threat Object Detection in X-ray
Images
Reagan L. Galvez1#*, Elmer P. Dadios2#, Argel A. Bandala1#, and Ryan Rhay P. Vicerra2#
1Electronics and Communications Engineering Department
2Manufacturing Engineering and Management Department
#De La Salle University

Manila, Philippines
*
Bulacan State University
Malolos, Philippines
reagan_galvez@dlsu.edu.ph

Abstract—Manual detection of threat objects in an X-ray images. These have saved researchers lots of time in data
machine is a tedious task for the baggage inspectors in airports, gathering and annotating each image. However, training a
train stations, and establishments. Objects inside the baggage different dataset could also be needed to evaluate the
seen by the X-ray machine are commonly occluded and difficult performance of the aforementioned object detection
to recognize when rotated. Because of this, there is a high algorithms. There is a relatively small body of literature that
chance of missed detection, particularly during rush hour. As a is concerned with threat object detection in X-ray images
solution, this paper presents a You Only Look Once (YOLO)- like in [12], [13]. The main reason is that most X-ray images
based object detector for the automated detection of threat for this kind of dataset are not publicly available. It is also
objects in an X-ray image. The study compared the
difficult to generate X-ray images because X-ray machines
performance between using transfer learning and training from
scratch in an IEDXray dataset which composed of scanned X-
are expensive compared with an ordinary camera that can
ray images of improvised explosive device (IED) replicas. The capture various images easily. As a solution, this paper used
results of this research indicate that training YOLO from a YOLO-based object detection algorithm to detect threat
scratch beats transfer learning in quick detection of threat objects in X-ray images.
objects. Training from scratch achieved a mean average
precision (mAP) of 45.89% in 416×416 image, 51.48% in III. METHODOLOGY
608×608 image, and 52.40% in a multi-scale image. On the A. YOLO Architecture
other hand, using transfer learning achieved only an mAP of
29.54% while 29.17% mAP in a multi-scale image. This study used You Only Look Once (YOLOv3)
architecture [14] that was an upgrade from [15], [16] to
Keywords—automated detection, convolutional neural detect threat objects. YOLOv3 predicts bounding boxes
networks, threat object, transfer learning, X-ray image, YOLO using dimension clusters as anchor boxes. The network
outputs 4 values 𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤 , 𝑡ℎ for each bounding box. Given
I. INTRODUCTION that (𝑐𝑥 , 𝑐𝑦 ) is the offset of the top left corner of the image,
Tight security in public places such as in airport the center (𝑏𝑥 , 𝑏𝑦 ) of the bounding box coordinates in (1)
terminals, train stations, and commercial establishments is and (2) will be predicted using sigmoid function 𝜎.
important nowadays due to the increasing terrorist activities
like bombing. These will prevent several casualties that can
possibly happen anytime. In the Global Terrorism Index 2018 𝑏𝑥 = 𝜎 (𝑡𝑥 + 𝑐𝑥 ) (1)
[1], the Philippines was listed in the top 10 for the countries
most impacted by terrorism and also the lone Southeast Asian 𝑏𝑦 = 𝜎(𝑡𝑦 + 𝑐𝑦 ) (2)
country included. To deal with this, one of the security
measures implemented in the said public places is baggage
scanning using an X-ray machine. This will able to project The bounding box width 𝑏𝑤 and height 𝑏ℎ are computed
the items inside the baggage to a monitor, and the inspector using (3) and (4) where 𝑝𝑤 and 𝑝ℎ are the height and width
analyzes it whether the items pose a threat or not. Although of the bounding box prior.
this process is widely adopted not only in the Philippines but
also in other countries, the possibility of missed detections
during peak hours is high. This is due to the limited time 𝑏𝑤 = 𝑝𝑤 𝑒
𝑡𝑤
(3)
available to scan and analyze the baggage, which can be
solved by using a fast object detector as decision support for 𝑏ℎ = 𝑝ℎ 𝑒
𝑡ℎ
(4)
threat object detection.
II. RELATED WORKS
Table I shows an overview of the feature extractor used
In recent years, there has been an increasing amount of by YOLOv3 called Darknet-53. It has 53 convolutional
literature on object classification and detection [2], [3] using layers trained on ImageNet [17] dataset, which contains
convolutional neural networks (CNN) after the huge success 1000 classes of images. Darknet-53 are combinations of
of the AlexNet [4] in the image classification task. This convolutional layers with sizes 1×1, 3×3, and residual
established the fact the features extracted by the CNN can network (shortcut connections). The final layer used average
accurately classify objects over handcrafted features used by pooling to downsample the output and a softmax activation
[5], [6]. Much of the object detection algorithms [7]–[9] are function.
trained and evaluated using PASCAL VOC [10] or MS
COCO [11] dataset due to the availability of annotated

978-1-7281-3044-6/19/$31.00 ©2019 IEEE


TABLE I. DARKNET-53 [14]

Fig. 2. Histogram.

C. Training
The model was trained using stochastic gradient descent
(SGD) with Nesterov momentum [18] using a learning rate
of 0.001 and 300 epochs. The other hyperparameters used in
training are summarized in Table II. The batch size was
decreased when using a 608×608 image and trained from
scratch (multi-scale) due to memory constraints. For the
simple transfer learning and transfer learning (multi-scale),
the weights came from YOLOv3, which used spatial
pyramid pooling described in [19]. The weights were frozen
during the training except for the 3 YOLO layers.

TABLE II. SUMMARY OF HYPERPARAMETERS

model batch size image size

Trained from scratch 8 416×416


Trained from scratch 4 320 to 608
(608×608)
Trained from scratch 4 320 to 608
(multi-scale)
Transfer learning 8 416×416
Transfer learning (multi- 8 320 to 608
scale)
B. IEDXray Dataset
The X-ray image used in the study was composed of D. Evaluation
scanned images of IED replicas without explosive material. The YOLO-based threat object detector was evaluated
Fig. 1 shows the exemplar of an IEDXray dataset in using the PASCAL VOC detection metric at an Intersection
grayscale and its corresponding histogram in Fig. 2. The x- Over Union (IoU) threshold of 0.5. IoU in (5) is the ratio
axis of the histogram shows the intensity of the image, while between the area of intersection (ground truth 𝑋𝐺 ∩
the y-axis shows the number of pixels at each intensity predicted bounding box 𝑋𝑃 ) and area of union (ground truth
value. We can see that the intensities are concentrated 𝑋𝐺 ∪ predicted bounding box 𝑋𝑃 ).
between 200 and 255, which are the white pixels in the
image.
𝐴(𝑋𝐺 ∩ 𝑋𝑃 )
𝐼𝑜𝑈 = (5)
𝐴(𝑋𝐺 ∪ 𝑋𝑃 )

The mean average precision (mAP) in (6) is the average


𝐴𝑃 of classes 𝐶 where 𝐴𝑃 is calculated by interpolating the
precision and recall (𝑃𝑅) curve in a set of eleven equally
spaced recall levels.

𝐶
1
𝑚𝐴𝑃 = ∑ 𝐴𝑃𝑖 (6)
𝐶
𝑖=1
Fig. 1. Grayscale image of the IEDXray dataset.
Fig. 3. Performance comparison between training from scratch and using transfer learning.

Fig. 4. Performance comparison between training from scratch and using transfer learning (multi-scale).

Precision and recall are computed using (7) and (8) and using transfer learning (multi-scale) in terms of train
where 𝑇𝑃, 𝐹𝑃, 𝐹𝑁 , are true positive, false positive, and loss, mAP, precision, and recall values. Again, we can see
false negative, respectively. that training from scratch outperforms transfer learning
(multi-scale). It also shows that implementing the multi-
scale parameter adds noise to the graph, which does not
𝑇𝑃 improve the training. The average loss and mAP when using
𝑃= (7) multi-scale is 26.15 and 20.40%, respectively.
𝑇𝑃 + 𝐹𝑃
In the last experiment, in order to see the effect of the
size of the input image in detection accuracy when training
𝑇𝑃 from scratch, the image is scaled in different sizes. First is
𝑅= (8)
𝑇𝑃 + 𝐹𝑁 from 416×416 to 608×608, and the other used multi-scale,
which scales image from the range 320 to 608 pixels. The
performance comparison in training from scratch using
IV. RESULTS AND DISCUSSION different sizes of the input image is shown in Fig. 5. The best
The first experiment seeks to know the performance mAP achieved in this experiment is 52.40%, which used
between training from scratch and using transfer learning in multi-scale images.
detecting threat objects. The performance comparison of this Table III shows the performances of each experiment.
experiment in terms of train loss, mAP, precision, and recall Closer inspection of the table shows that training the
values is presented in Fig. 3. Interestingly, we can see that YOLOv3 from scratch generally outperforms the transfer
training from scratch outperformed transfer learning at learning approach. We can see that increasing the size of the
evaluation. The average training loss when using transfer input image by 46% (416×416 → 608×608) increases the
learning is 25.67, while training from scratch is only 2.40. mAP by 12% while using multi-scale increases the mAP by
The mAP in transfer learning has a higher start point, but 14%. Fig. 6 presents the accuracy and speed (average speed
after 2 epochs, the mAP does not increase, and the average per image) comparison of all the experiments conducted. We
mAP is 21.50% until the last epochs. On the other hand, the can see that the best tradeoff between accuracy and speed is
mAP in training from scratch slowly increases, and after achieved in training from scratch (multi-scale), which has
about 100 epochs, the mAP stops increasing until the last 52.40% mAP and inference speed of 27.99 ms.
epochs with an average mAP of 38.03%. The best mAP is
45.89% at 300 epochs. All experiments showed that YOLOv3 has difficulty in
detecting thin objects like wires. Overall, these results
In the next experiment, multi-scale training is added to indicate that although transfer learning is a must-try
transfer learning. The multi-scale parameter scales the approach in training a small amount of dataset, it does not
training images from 320 to 608 pixels. Fig. 4 above shows
the performance comparison between training from scratch
Fig. 5. Performance comparison in training from scratch using different size of input image 416×416 and 608×608.

TABLE III. EVALUATION SUMMARY

model mAP APbattery APmortar APwires time(ms)


Trained from scratch 0.4589 0.3733 0.8466 0.1569 29.41
Trained from scratch (608×608) 0.5148 0.3986 0.9401 0.2058 39.72
Trained from scratch (multi-scale) 0.5240 0.4627 0.9209 0.1885 27.99
Transfer learning 0.2954 0.0383 0.8115 0.0363 28.22
Transfer learning (multi-scale) 0.2917 0.0216 0.8096 0.0439 30.19

Fig. 6. Accuracy vs. speed.

always improve the accuracy of the model. In this case, no


increase in mAP was observed. REFERENCES
[1] Institute for Economics & Peace, “Global terrorism index 2018:
V. CONCLUSION measuring the impact of terrorism,” 2018. [Online]. Available:
http://visionofhumanity.org/reports/. [Accessed: 30-Aug-2019].
In this study, a YOLO-based object detector was used to [2] R. L. Galvez, A. A. Bandala, E. P. Dadios, R. R. P. Vicerra, and J.
automate threat detection in an X-ray image. Different M. Z. Maningo, “Object detection using convolutional neural
experiments were conducted to determine the best model networks,” in TENCON 2018-2018 IEEE Region 10 Conference,
that yields the highest accuracy while balancing the speed of 2018, pp. 2023–2027.
[3] R. L. Galvez, E. P. Dadios, A. A. Bandala, and R. R. P. Vicerra,
inference. These experiments confirmed that using a “Threat object classification in X-ray images using transfer
YOLOv3 model, the best mAP (52.40%) can be obtained by learning,” in 2018 IEEE 10th International Conference on
training the model from scratch instead of using the transfer Humanoid, Nanotechnology, Information Technology,
learning technique in a small set of data. In addition, training Communication and Control, Environment and Management,
multi-scale images improved the detection performance by HNICEM 2018, 2019.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
14% while increasing the image size improved the classification with deep convolutional neural networks,” in
performance by only 12%.
Advances in neural information processing systems, 2012, pp.
1097–1105.
[5] N. Dalal and B. Triggs, “Histograms of oriented gradients for
human detection,” in 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp.
886–893.
[6] P. Viola and M. Jones, “Rapid object detection using a boosted
cascade of simple features,” in Proceedings of the 2001 IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition. CVPR 2001, vol. 1, pp. I-511-I–518.
[7] R. Girshick, “Fast R-CNN,” in 2015 IEEE International
Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards
real-time object detection with region proposal networks,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149,
Jun. 2017.
[9] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via
region-based fully convolutional networks,” in Advances in
neural information processing systems, 2016, pp. 379–387.
[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
Zisserman, “The pascal visual object classes (voc) challenge,” Int.
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[11] T.-Y. Lin et al., “Microsoft COCO: common objects in context,”
in European conference on computer vision, 2014, pp. 740–755.
[12] T. Morris, T. Chien, and E. Goodman, “Convolutional neural
networks for automatic threat detection in security X-Ray
images,” in 2018 17th IEEE International Conference on Machine
Learning and Applications (ICMLA), 2018, pp. 285–292.
[13] L. D. Griffin, M. Caldwell, J. T. A. Andrews, and H. Bohler,
“‘Unexpected item in the bagging area’: anomaly detection in X-
Ray security images,” IEEE Trans. Inf. Forensics Secur., vol. 14,
no. 6, pp. 1539–1553, Jun. 2019.
[14] J. Redmon and A. Farhadi, “YOLOv3: an incremental
improvement,” arXiv:1804.02767 [cs.CV], Apr. 2018.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: unified, real-time object detection,” in Proceedings of
the IEEE conference on computer vision and pattern recognition,
2016, pp. 779–788.
[16] J. Redmon and A. Farhadi, “{YOLO9000:} better, faster,
stronger,” CoRR, vol. abs/1612.0, 2016.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in 2009
IEEE conference on computer vision and pattern recognition,
2009, pp. 248–255.
[18] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the
importance of initialization and momentum in deep learning,” in
Proceedings of the 30th International Conference on Machine
Learning, 2013, vol. 28, no. 3, pp. 1139–1147.
[19] Z. Huang and J. Wang, “DC-SPP-YOLO: Dense connection and
spatial pyramid pooling based YOLO for object detection,”
arXiv:1903.08589 [cs.CV], Mar. 2019.

View publication stats

You might also like