You are on page 1of 6

Logo detection and brand recognition with one-stage

logo detection framework and simplified resnet50


backbone
Sarwo1, Yaya Heryadi2, Edy Abdulrachman4 Widodo Budiharto3
Doctor of Computer Science School of Computer Science
Bina Nusantara University Bina Nusantara University
Jakarta, Indonesia Jakarta, Indonesia
sarwo @binus.ac.id 1, yayaheryadi@binus.edu 2, ediA@binus.edu4 wbudiharto@binus.edu3

Abstract— Logo and brand name are two concepts which are Detectors, and (3) Two-staged Detectors . Traditional Object
typically studied in many course subjects. In education context, Detectors is an approach before deep learning. Some of his
automated logo detection and brand name recognition from digital studies include studies using HOG features for human
image or video are very crucial as a learning tool to achieve learning detection [9], template matching with RGB features as input
outcomes. One technical issue in the logo detection and brand name [10], [11], and several other studies such as [12]–[17], The
recognition is its requirement to develop model that achive fast second approach, is one-stage detector that combines
recognition speed and high recognition accuracy. One-stage detector classification and detection tasks in a single step. In the object
is a breakthrough and innovative object detection framework; detecting process, a detector is applied over a regular, dense
however, long duration and computing power required to carry out
sampling of objects of locations, scales, and aspect ratios, for
training and detection processes using the backbone deep architecture
are often considered to be the challenge of this framework. The
example, SSD [18], YOLO [19], and RetinaNet [8]. The last is
objective of this study is to propose a novel ResNet variant models two-stage object detection, which works in a two-stage
using ResNet-50 as the basis. The empiric results showed that the process. The first stage generates a set of candidates for
model 2 achieved 0.408 mAP, the best average accuracy with training proposals that contain all objects of interest and filters out most
time 1.41 hour. The original Resnet50 model, in contrast, achieved of negative locations. The second stage classifies the proposals
0.556 mAP average accuracy with 1.91 hour training time. The produced by the first stage into foreground classes or
detection testing of the proposed Model 2 model was 23.47 fps, while background, for example,: R-CNN [20], Fast R-CNN [5],
the detection testing of Resnet50 model was 29.33 Fps. Faster R-CNN [6], Mask R-CNN [21] and other [6], [7], [22]–
[25].
Keywords— Logo Detection, One Stage Detector, Resnet50.
In addition to the challenges mentioned, many practical
applications for logo detection require real-time application
I. INTRODUCTION that turns a lot of research into a single-stage logo detector.
Logo and brand name are two concepts which are typically Although some one-stage logo detectors have been proposed to
studied in many course subjects such as marketing (e.g. run faster or to be similar to two-stage detectors, the accuracy
introducing brand name and its logo, promoting product or of this one-stage logo detector is often below the sophisticated
services) [1], [2], transportation (e.g. tracking logo of online two-stage object detector. To achieve this goal, many previous
motorcycle or taxi). For learning purposes, automated logo methods usually choose one of the following opposite
detection and brand name recognition from digital image or directions. First, it started with a fast logo detector model and
video are very crucial to achieve learning outcomes. its accuracy was improved by modifying backbone model.
Second, it started with an object detector model with high
Research on logo detection is one of the subtopics of accuracy and reduced its operating time. Lastly, it started with
research on object detection that has been started since the a fast one-stage logo detection model that was followed by
1990. Research reported by LeCun [3] was probably one of the increasing its accuracy and reduce its operating time by
first studies in the field of object detection using convolutional proposing a simplified ResNet model.
neural network model. This study was subsequently refined in
succession by: Regional Convolutional Neural Networks (R-
CNNs) [4], Fast R-CNN [5], Faster R-CNN [6] and research II. RELATED METHOD
that combines Object detection and Segmentation Mask R-
CNN [7]. ResNet was first introduced by He et al. in 2015 [26]. The
According to research by Lin et al. [8], the approach to model was improved in [27] by adding a rule on each layer of
object detection in research is broadly divided into three the block. In the proposed ResNet model, the first two layers
categories: (1) Traditional Object Detector, (2) One-stage are similar to those of GoogleNet [28] model, namely,: 7 x 7
convolutional layers with 64 output channels and stride with

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.
size 2 followed by 3 x 3 maximum pooling layers followed by
stride with size 2. In contrast to GoogleNet, the batch
normalization layer is added after each convolutional layer in
ResNet model. Several variants of the ResNet model are
ResNet-18, ResNet-50, ResNet-34, ResNet-50, ResNet-101, 1 x 1 Relu (512) 1 x 1 Relu (512)
ResNet-152, and ResNet-200. Despite its high performance to 2
address image classification task, one of the ResNet x 1
3 x 3 Conv (512) 3 x 3 Conv (512)
disadvantages is its deep structures that cause long training Blok 5

time. Therefore, the objective of this study is to propose a Blok 5


model called the simplified ResNet, which is a simplified 1 x 1 Relu 1 x 1 Relu
(1024) (1024)
version of the original ResNet model. The main objective of
this study, therefore, is to develop a ResNet model variant with
less number of model parameters to reduce training and
detection time, but with acceptable detection accuracy.
1 x 1 Relu 1 x 1 Relu
Table 1. limitations of Resnet models for one-stage (1024) (1024)
Detector. 2
x 2
No Model Problem 3 x 3 Conv (512) Blok 4 3 x 3 Conv (512)
Blok 4

1 Resnet 18 Very low accuracy below 0.5


Resnet 34 Very low accuracy below 0.5 1 x 1 Relu (512) 1 x 1 Relu (512)
2
3 Resnet 50 Training time is long and requires
considerable memory and
paramters
4 Resnet 101 Training time is long and requires 1 x 1 Relu (512) 1 x 1 Relu (512)
considerable memory and
1
paramters
3 x 3 Conv (256) 1
5 Resnet152 Training time is long and requires 3 x 3 Conv (256) Blok 3
considerable memory and Blok 3
paramters
1 x 1 Relu (256) 1 x 1 Relu (256)

In This paper proposes a simplified ResNet that is used in a


one-stage detector., Researchers proposed 4 simplified ResNet
models with different block combinations of each model, so as
to produce different parameters. The main objective of this 1 x 1 Relu (256) 1 x 1 Relu (256)
study is to improve accuracy and speed up performance of the
models., In more detail, it can be seen in Figure. 1 and Figure. 1
1 3 x 3 Conv (128)
2 3 x 3 Conv (128)
Blok 2 Blok 2

1 x 1 Relu (128) 1 x 1 Relu (128)

3 x 3 Max 3 x 3 Max
pooling (2) Blok 1
pooling (2)
s Blok 1

7 x 7 Conv (2) 7 x 7 Conv (2)

Model 1 Model 2

Figure 1. Proposed Model 1 and Model 2

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.
While Following Lin, Ai, & Doll [8], the proposed logo
detection framework in this study is a single unified network
comprises of three main componens namely:
1) Backbone:, which is to compute a feature map from the
1 x 1 Relu given input images. In this study, four models are were
(1024) explored, namely,: Model-1, Model -2, Model -3, Model
1 -4, ResNet-18, ResNet-34 and ResNet-50. These models
3 x 3 Conv (512)
Blok 5 were chosen in this study due to their high accuracy
achievements on the ImageNet challenge benchmark.
1 x 1 Relu (512)
This study trained each of candidates for backbones using
the given logo dataset rather than using off-the-shelf
3 x 3 Conv (512) 3 model as reported by Lin, Ai, and Doll [8]. It is expected
.
Blok 5
that the model training produces the trained models that
1 x 1 Relu (512)
are able to learn only the logo from classes of interest
3 x 3 Conv (512) 2) Subnet-1: which is to compute convolutional logo
1
3 x 3 Conv (256) classification.
Blok 4
3) Subnet-2: which is to compute convolutional bounding
v
1 x 1 Relu (256) 3 x 3 Conv (256)
box regression.
6

Blok 4 Similar to the framework proposed by Lin, Ai, & Doll [8], the
objective of model training is to find model parameters that
3 x 3 Conv (256)
optimized Focal Loss (see Uq.1) as the objective function.
1 x 1 Relu (256)
1
−𝑙𝑜𝑔(𝑝) 𝑖𝑓 𝑦 = 1
𝐶(𝑝,𝑦) = { (1)
−𝑙𝑜𝑔(1 − 𝑝) 𝑜𝑡ℎ𝑒𝑟𝑤ℎ𝑖𝑠𝑒
Blok 3
3 x 3 Conv (128)
3 x 3 Conv (128)
In the equation above the notation y ∈ {±1} represents the
4 ground-truth class logo while P ∈ [0, 1] is probability of class
1 x 1 Relu (128) Blok 3 logo, and CE is Cross entropy for loss binary classification,
3 x 3 Conv (128)
𝑃𝑖 is the result of logo prediction, and 𝑌𝑖 are labeled (1 if the
object belongs to class i, 0 and vice versa). from the CE
equation binary classification developed for multi-class
1 x 1 Relu (512) classification 𝑃𝑖 is defined and CE is developed becoming
3 x 3 Conv (64) the following equation:

3 x 3 Conv (512) 1 3 𝑝 𝑖𝑓 𝑦 = 1
𝑃𝑡 = { (2)
Blok 2
1−𝑝 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
Blok 2 3 x 3 Conv (64)

1 x 1 Relu
(1024)
So if we simplify the loss for binary classification (CE) and
3 x 3 Max multi-class classification 𝑃𝑡 equations is writlen as below :
3 x 3 Max pooling (2)
pooling (2) Blok 1 𝐶𝐸(𝑝,𝑦) = 𝐶𝐸(𝑝𝑡) = - log (𝑃𝑡 ) (3)
Blok 1
7 x 7 Conv (2)
Lin, Ai, & Doll [8], improved the focal loss function to
7 x 7 Conv (2)
overcome the logo imbalance between the foreground and the
background class during training with gamma ( 1 − 𝑃𝑖 )γ
Model 3 Model 4 so that the equation above becomes :

Figure 2. Proposed Model 3 and Model 4 𝐹𝐿 (𝑃𝑡 ) = −( 1 − 𝑃𝑖 )γ log (𝑃𝑡 ) (4)

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.
Following Lin, Ai, & Doll [8], Cross entropy 𝐶𝐸(𝑝,𝑦) is
changed to Focal Loss 𝐹𝐿(𝑝,𝑦) with a value of Y = 2.
III. EXPERIMENT AND RESULT

3.1 Training and Evaluation.


Research using datasets was obtained from ROMYNY Logo
2016 [29], which consists of 20 classes. Interestingly, although
each model convergen toward its optimum parameter values
during training process, the average classification of loss and
average regression of Model 2 outperformeds those of ResNet-
18 and ResNet-34, and Model 2 was faster than the other
model, (see Figure 4).

The author has done the experiments and summarized the


training results where the author used several factors, such as
number of parameters, training, and number of blocks., The
author also analyzed the influence of these factors on
accuracy, regression loss, and classification loss. The
measurement used mean Average Precision (mAP), average
classification loss, and average regression loss (see Table 2).

Figure 3. Framework for logo detection

Figure 4. Visualize Classification and Regression Loss model

Table 2. Accuracy comparation simplefied Resnet model and original Resnet (mAP)
No Model mAP Average Average Average Number of
Regression Classification Training Time Parameters
Loss Loss (hours)
1 Model 1 0.382 ± 0.1033 0.09 0.03 1.43 25,740,705
2 Model 2 0.408 ± 0.1050 0.08 0.01 1.41 22,392,225
3 Model 3 0.364 ± 0.1033 0.12 0.10 1.40 21,271,969
4 Model 4 0.521 ± 0.0912 0.11 0.07 1.91 22,392,225
5 Resnet 18 0.065 ± 0.0331 0.47 0.45 1.40 20,200,097
6 Resnet 34 0.057 ± 0.0357 0.49 0.49 1.48 30,315,681
7 Resnet50 0.382 ± 0.0921 0.08 0.01 1.91 36,797,857

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.
Model 1 (p=1.00/25.01 Model 2 (0.98/23.47)
sec)
From Table 2, it is known that Model 1 produced accuracy
of (0.382) with training time of (1.43) hours, while
measurements of Regression Loss and classification loss
resulted in 0.09 and (0.03), respectively. The results of Model
1, compared to Resnet-50, resulted in better training time,
which was (0.43) hour, while the accuracy of 0.556 mAP , went
down to 0.382 mAP with the difference in accuracy of 0.14 Model 3(0.95/24.62) Model 4(0.95/24.62)
mAP, then the results of Resnet 2 were analyzed, The following
is the analysis of Resnet 2. This model was obtained by
reducing the block from Model 1, the superiority of the Model
2 was that it had a small amount of 22,392,225 parameters, but
produce good accuracy of (0.408) mAP, classification loss of
(0.01), regression loss of (0.08), and the required training time Resnet18 (error) Resnet34 (error)
of (1.41) hours. This result iwas certainly better than the results
of Model 1, Resnet-18 and Resnet-34, but not much different
from the result of Resnet-50. From a series of experiments that
the author did, the author concluded the best proposed in this
study was model-2.
3.2 logo testing
Resnet50 (1.00/27.58
At this stage the testing was carried out by using seven models,
which were in accordance to the models used in the previous Figure 4. Some samples of testing images using: (a) Model 1,
process, which iwas in the training phase. The models used (b) Model 2, (c) Model 3 (d) Model 4 (e) Resnet 18 (f)
were : Model 1, Model 2, Model 3, Model 4 , Resnet-18, Resnet34 and (g) Resnet50
Resnet-34 and Resnet-50., The testing phase, used mAP (mean
average precision) units, while the IoU (Intersection of Union) IV. CONCLUSION
limit value used was 0.5, indicating that the logos that are
generated will be taken with accuracy ranging from 0.5 - 1 . This paper presents a novel one-stage logo detector framework
Next, the visibility of the testing process above can be seen in in which the backbone of the detector is a simple focused
Figure 4 and Table 3. network. The different frameworks proposed were frameworks
in which the backbone is an off-the-shelf model. The framework
Table 3. Summary of Average Testing Performance (mAP) proposed a backbone that was trained and supervised using
Model Probabilty Average time gradient descent training algorithms. The experimental results
1 Model 1 0.992 38.75 Second showed that Model-2 got accuracy of 0.408 mnP, as the logo
± 0.0056 detector backbone outperformed Resnet18 that obtained
2 Model 2 0.954 36.45 Second accuracy of 0.065 mAP and Resnet-34 that obtained accuracy of
± 0.1274 0.057 mAP. However, the accuracy of Model-2 was still lower
3 Model 3 0.882 38.08 Second than that of Resnet-50, which was 0.556 mAP, Meanwhile, the
±0.0717 training time required by Model-2 was 1.41 hour, 0.40 hour
4 Model 4 0.998 40.94 Second faster than the training time of Resnet50, which was 1.91 hour.
± 0.0028 The results of testing are shown in Table 4. Model 2 obtained
5 Resnet 18 0.100 (error) average result of 0.954 and testing time of 36.45 Fps.
± 0.1406 Meanwhile, Resnet-50 (0.996) and testing time of 41.18 Fps.
6 Resnet 34 (error) (error) Model 2 was 4.73 second faster than Resnet50. This result was
7 Resnet 50 0.996 41.18 Second also better than the results of ResNet-18 or Resnet-34.
±0.056
REFERENCES

[1] Y. Liao, X. Lu, C. Zhang, Y. Wang, and Z. Tang, “Mutual


Enhancement for Detection of Multiple Logos in Sports
Videos,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–
Octob, pp. 4856–4865, 2017.

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.
[2] G. Oliveira, X. Frazão, A. Pimentel, and B. Ribeiro, [21] M. R-cnn, P. Doll, and R. Girshick, “Mask R-CNN,”
“Automatic graphic logo detection via Fast Region-based arxiv.org, 2017.
Convolutional Networks,” Proc. Int. Jt. Conf. Neural [22] Y. Bao, H. Li, X. Fan, R. Liu, and Q. Jia, “Region-based
Networks, vol. 2016–Octob, pp. 985–991, 2016. CNN for Logo Detection,” Proc. Int. Conf. Internet
[3] Y. Lecun, “gradient-based learning applied to Document Multimed. Comput. Serv. - ICIMCS’16, pp. 319–322,
Recognition,” ieeexplore.ieee.org/document/726791, 2016.
1998. [23] Y. Zhang et al., “Deep learning for logo recognition,” Int.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Conf. Intell. Syst. Des. Appl. ISDA, vol. 245, no. 36, pp.
feature hierarchies for accurate object detection and 2051–2054, 2017.
semantic segmentation,” Proc. IEEE Comput. Soc. Conf. [24] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object
Comput. Vis. Pattern Recognit., pp. 580–587, 2014. Detection via Region-based Fully Convolutional
[5] R. Girshick, “Fast R-CNN,” arXiv.org e-Print Arch., Networks,” Nips, 2016.
2015. [25] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Pooling in Deep Convolutional Networks for Visual
Towards Real-Time Object Detection with Region Recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
Proposal Networks,” Nips, pp. 1–10, 2015. vol. 37, no. 9, pp. 1904–1916, 2015.
[7] K. He and R. Girshick, “Mask R-CNN,” 2017. [26] Z. He, “Deep Residual Learning for Image Recognition,”
[8] T. Lin, F. Ai, and P. Doll, “Focal Loss for Dense Object arXiv.org e-Print Arch., vol. 7, no. 3, pp. 171–180, 2015.
Detection.” [27] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings
[9] N. Dalal, B. Triggs, and D. Europe, in deep residual networks,” Lect. Notes Comput. Sci.
“HistogramOfOrientedGradientsForHumanDetection.pdf (including Subser. Lect. Notes Artif. Intell. Lect. Notes
,” 2005. Bioinformatics), vol. 9908 LNCS, pp. 630–645, 2016.
[10] O. H. jafari and M. Y. Yang, “Real-Time RGB-D based [28] C. Szegedy et al., “Going deeper with convolutions,”
Template Matching Pedestrian Detection,” 2016. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
[11] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Recognit., vol. 07–12–June, pp. 1–9, 2015.
and Y. LeCun, “OverFeat : Integrated Recognition , [29] Frank Fotso. Application of py_faster_rcnn in logo
Localization and Detection using Convolutional detection task: ZF & VGG16. https:
Networks,” arXiv Prepr., pp. 1–15, 2013. //github.com/franckfotso/faster_rcnn_logo/blob/master/R
[12] Y. Zhang and D. Wang, “Logo Detection and Recognition EADME.md. 2017 (cited on page 10).
Based on Classification The Characteristics of Logos,”
pp. 805–806, 2014.
[13] J. Glagolevs and K. Freivalds, “Logo detection in images
using HOG and SIFT,” 2017 5th IEEE Work. Adv.
Information, Electron. Electr. Eng., pp. 1–5, 2017.
[14] C. Wan, Z. Zhao, X. Guo, and A. Cai, “Tree-based Shape
Descriptor for scalable logo detection,” IEEE VCIP 2013
- 2013 IEEE Int. Conf. Vis. Commun. Image Process.,
2013.
[15] S. Y. Arafat, S. A. Husain, I. A. Niaz, and M. Saleem,
“Logo Detection and Recognition in Video Stream,”
IEEE, pp. 163–168, 2010.
[16] T. A. Pham, M. Delalandre, and S. Barrat, “A contour-
based method for logo detection,” 2011.
[17] J. Revaud, M. Douze, and C. Schmid, “Correlation-based
burstiness for logo retrieval,” p. 965, 2012.
[18] W. Liu et al., “SSD : Single Shot MultiBox Detector,”
arXiv, pp. 1–15, 2016.
[19] D. Impiombato et al., “You Only Look Once: Unified,
Real-Time Object Detection,” Nucl. Instruments Methods
Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc.
Equip., vol. 794, pp. 185–192, 2015.
[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
feature hierarchies for accurate object detection and
semantic segmentation,” Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., pp. 580–587, 2014.

Authorized licensed use limited to: Carleton University. Downloaded on July 28,2020 at 09:13:22 UTC from IEEE Xplore. Restrictions apply.

You might also like