You are on page 1of 4

Deep Learning for Logo Detection

Karel Paleček
Department name of organization
Technical University of Liberec
Liberec, Czech Republic
Email: karel.palecek@tul.cz

Abstract—We present a deep learning system for automatic boxes. One of the most successful single shot detectors proved
logo detection in real world images. We base our detector out to be RetinaNet [2], also evaluated in this study. In
on the popular framework of FasterR-CNN and compare its contrast, the two-stage detectors first prune the search space
performance to other models such as Mask R-CNN or RetinaNet.
We perform a detailed empirical analysis of various design and using region proposal network (RPN) and only output the
architecture choices and show how these can have much higher predictions for selected regions of interest. In comparison to
influence than algorithmic tweaks or popular techniques such single-stage detectors, they basically only differ by adding this
as data augmentation. We also provide a systematic detection intermediate RPN stage. The most well known examples of
performance comparison of various models on multiple popular this category are Faster R-CNN and Mask R-CNN, both of
datasets including FlickrLogos-32, TopLogo-10 and recently in-
troduced QMUL-OpenLogo benchmark, which allows for a direct which allow training in an end-to-end fashion. The latter also
comparison between recently proposed extensions. By careful outputs pixel-level segmentation of object instances in addition
optimization of the training procedure we were able to achieve to ordinary bounding boxes.
significant improvements of the state of the art on all mentioned As have been shown [3], compared to two-stage, the single-
datasets. We apply our observations to build a detector to detect stage detectors are optimized for efficiency at the expense
logos of the Red Bull brand in online media and images.
Keywords—Faster R-CNN, FlickrLogos-32, Logo detection, of accuracy, and vice versa. However, both classes of de-
Mask R-CNN, QMUl-OpenLogo, RetinaNet, TopLogo-10 tectors are easily tunable and the trade-off can be varied.
One source of errors, which is cleverly mitigated through
I. I NTRODUCTION the RPN stage in two-stage detectors, is the class imbalance
between background samples and objects of interest. This
Automatic logo detection and recognition is a challenging
issue has been tackled by [2], where Lin et al. proposed a
computer vision task with many applications. One of the
weighting scheme that modifies the cross entropy loss, such
typical purposes of such systems is measurement of brand
that the easily classified regions contribute less to the overall
visibility and prominence in online media or television broad-
loss value. For detailed analysis and empirical comparison of
casts. The task can be formulated as a special case of generic
several algorithms and network designs see e. g. [3].
object detection, albeit with some significant differences. Most
notably, logo instances usually occupy only a small portion of II. D EEP LEARNING FOR LOGO DETECTION
the image, often being captured by accident, for example as
One of the first applications of convolutional networks for
sponsor logo on an athlete’s sportswear or a driver’s vehicle
automatic logo detection was [4]. Hoi et al. utilized the Fast
etc. There is also a different kind of visual variability, for
R-CNN framework with external region proposal based on
example when the logo’s color design changes depending on
selective search (SS). In order to facilitate their research,
target background or when pictorially designed logo appears
they collected a large dataset called LOGO-net with over 130
as a three dimensional sculpture.
thousand object annotations. However, the dataset is not public
Recently, the field of object detection became dominated by
and results therefore not comparable.
methods based on deep convolutional networks [1]. There are
Another example of automatic logo detection using deep
two main categories of detectors: single stage and two-stage
learning is the work of Eggert et al. [5]. Similarly to Hoi et al.,
detectors. In the former type, the image is passed through
Eggert et al also used R-CNN detector with external SS region
the network in a single pass and outputs bounding box and
proposal. However, instead of collecting large datasets they
class predictions for every position on a grid of predefined
heavily relied on artificial data augmentation. By randomly
resolution. In order to capture objects of different sizes and
transforming images (e. g. perspective changes, color shifting
aspect ratios, several bounding boxes are anchored to every
and blurring), they were able to increase training set variability
point on the grid and each of them is classified independently.
and achieved 78.1 % of mean average precision (mAP) on the
Also, instead of predicting absolute bounding box coordinates,
popular FlickrLogos-32 dataset.
the network only learns to correct these predefined anchor
Even with techniques such as transfer learning, convolu-
This work was supported by the Technology Agency of the Czech Republic tional networks often require sufficiently sized datasets with
(Project No. TH03010018) enough variability for training. Since manual annotation can

978-1-7281-1864-2/19/$31.00 ©2019 IEEE 609 TSP 2019


TABLE I. OVERVIEW OF DATASETS
be a costly process, artificial data augmentation is a prevailing
theme in the logo detection literature. For example, in [6], dataset # classes # instances # images
[7] authors segmented logo instances and placed them on real FlickrLogos-32 32 3405 2240 (1280:960)
world images as a way to create additional artificial context. Logos-32plus 32 12302 7830 (6870:960)
The data augmentation approach was taken to the extreme by TopLogo-10 10 860 698 (400:298)
Montserrat et al. [8], who trained a Faster R-CNN detector QMUL-OpenLogo 352 50507 27083 (18958:8125)
almost entirely on automatically collected synthetic images. Newton-Redbull 1 395 237 (191:46)
One common problem with logos is also their small size.
Therefore Eggert et al. [9] proposed to extend the Faster R-
CNN anchor set to more fine grained resolution. They were model real world challenges and to allow training detectors
able to improve the baseline mAP of 51 % to 67.1 %. However, based on deep neural networks. The testing subset stays the
a drawback of this solution is higher memory requirements and same, resulting in an average per-class trainval:test ratio of
slower training time. Another way to target issues with small 400:30.
size of objects was introduced in [10]. Lin et al. proposed TopLogo-10 has been collected by Su et al. [6]. It comprises
a feature pyramid network (FPN) as a way to better exploit of top 10 most popular clothing brands logos, each having 40
different levels of convolutional backbone feature maps in training and 30 testing images, amounting to the total of 860
detectors such as Faster R-CNN. It adds a parallel feature instances. Despite the smaller dataset volume, the logos are
map to each level of the backbone, but the data flow is often of low resolution and captured in highly variable context,
reversed and goes from top coarse resolution maps to the making the dataset more challenging than FlickrLogos-32.
bottom fine grained ones. The anchor boxes and corresponding QMUL-OpenLogo has been introduced recently by [13].
RPN are connected to every level of the FPN, one level for It is a semi-supervised composition of several publicly avail-
each considered object scale. able logo datasets and aggregates examples from the three
In this paper, we show that by careful optimization of the above mentioned ones as well as several others (BelgaLogos,
training procedure, we can obtain higher detection perfor- FlickrLogos-27, Logos-in-the-Wild, SportsLogo, WebLogo-
mance than with data augmentation and algorithmic and design 2M-test). Due to the sheer dataset size and appearance variabil-
changes in the detectors themselves. Specifically, we identify ity, it is arguably the most realistic logo detection benchmark
that by tuning the learning rate, batch size and utilizing FPN up to date. Although designed to address an open detection
one can obtain state of the art results even on existing datasets setting, where the majority of classes is not supposed to have
without the need for data expansion via heavy augmentation fine grained bounding box annotations, it also provides a
or generation of synthetic images. We build the detector for fully annotated closed set data split of over 27000 images
our target application of Red Bull logo recognition in online for traditional evaluation, see Tab. I.
and printed media. We also use our own dataset targeted to detection of a single
logo of the Red Bull brand. In order to train the detector, we
III. DATA manually labeled 237 images, 47 of which serves for testing.
In order to allow direct comparison of our results against The dataset was constructed directly for our target application
previous studies, we employ four most commonly appearing and contains instances of the Red Bull logo as small as 10
publicly available datasets. They are listed in Tab. I along pixels. Due to often low visual quality of the samples this
with their class and image statistics. The numbers of images dataset is very challenging.
in parentheses correspond to the size of training and testing
subsets, respectively. IV. E XPERIMENTS
FlickrLogos32-v2 [11] is by far the most widely used Similarly to the object detection literature, we follow the
dataset in the literature of automatic logo detection and recog- mean average precision (mAP) [%] as our main performance
nition. It comprises of 32 logo brands, each having 70 images, metric. It is calculated as the mean of precisions averaged
resulting in 3405 total objects. The dataset was designed for over different recall levels for each logo class. There are
development of methods based on keypoint matching and two main versions, however. First, PASCAL VOC calculates
compared to real world scenarios is somewhat limited. Each this metric for detections overlapping with corresponding
image displays examples only from a single class and logo ground truth annotations by at least 0.5 in terms of the
instances are usually captured with relatively high resolution intersection over union (IOU) metric. Second, in its MS COCO
and without much visual degradation. Since for each class the variant, it is averaged over multiple IOU thresholds IOU ∈
data is split into 10:30:30 images for training, validation and {0.5, 0.55, . . . , 0.90, 0.95}. The latter yields lower numbers,
testing sets respectively, it has become the common practice since it also averages over more strict IOU requirements. We
to use the trainval set of 40 images for training and the rest also report mean recall (mR) [%], i. e. per-class recall averaged
for testing. for all classes in the corresponding dataset.
Logos-32plus was introduced in [12] as an extension of All of the experiments were performed on NVIDIA
FlickrLogos-32. The main goal was to increase the class GTX 1080 Ti GPU with 12 GB of VRAM using the
variability and the size of the training set in order to better CUDA 10.1 framework and PyTorch 1.0 library. The code

610
TABLE II. O PTIMIZATION OF R ES N ET-50 ON F LICKR L OGOS -32

model mAP (VOC)


baseline SGD 77.1
+ momentum 84.6
+ scheduling 84.7
+ hor. flipping 85.7
+ FPN 87.0
+ tuned batch size and learning rate 88.3

TABLE III. A RCHITECTURES AND BACKBONES ON F LICKR L OGOS -32

model mAP (VOC) mR (VOC) training time


FRCNN / VGG-16 87.9 90.6 4.5 hours
FRCNN / R-50 88.3 90.5 2 hours
Fig. 1. Mean average precision versus image size on FlickrLogos-32.
FRCNN / R-101 88.1 90.1 2.8 hours
FRCNN / X-101 88.8 91.4 7 hours
MRCNN / R-50 87.9 90.0 2 hours
MRCNN / R-101 88.1 90.3 3.25 hours used to extract feature maps. We test three main types of
MRCNN / X-101 88.1 90.5 7.25 hours models: Faster R-CNN (FRCNN), Mask R-CNN (MRCNN)
RN / R-50 80.4 90.3 1.75 hours and RetinaNet (RN), and four different backbones: VGG-
RN / X-101 87.6 91.9 15 hours 16, ResNet-50 (R-50), ResNet-101 (R-101), and ResNeXt-
101 (X-101). For the description and evaluation of various
convolutional network architectures in ImageNet classification
along with experiment configuration files will be published see e. g. [14]. Our detection results are summarized in Tab. III.
online. One can observe that for all detector models, the heaviest
X-101 backbone consistently achieves the highest mAP scores.
A. Optimization of the detector
However, in case of region-based methods (Faster and Mask
In the first set of experiments, we show how optimization R-CNN), the difference is not very significant. The only case
choices and designs influence the final detection performance of significant improvement in terms of mAP was observed for
in terms of mAP. Table II summarizes the results. We start RetinaNet, whose performance otherwise lacked in compari-
by Faster R-CNN model with the ResNet-50 convolutional son to the other two detectors. However, its lower precision
backbone and optimize by using plain stochastic gradient numbers are caused by higher false positive ratio. RetinaNet-
descent (SGD) on the FlickrLogos-32 dataset. The model is based detectors therefore seem to be suitable especially for
trained on 10 GPUs for 2000 iterations with the total batch size recall-oriented applications.
of 20 images and learning rate of 0.02. In this setup, the entire
training and tesing procedure takes only about 30 minutes.
The biggest single boost in performance is gained by utilizing C. Image size
momentum in SGD, increasing mAP by more than 7 %. On the
other hand, decreasing the learning rate during training (learn- Before feeding an image into the detector, it is first resized
ing rate scheduling) does not seem to have strong effect on such that the smaller of width and height is not less than
mAP. We tried random search over optimal learning rate drops 800 pixels and larger side does not exceed 1333 pixels. The
after different amount of iterations, but the effect seemed to be only other data preprocessing we employ is a subtraction
modest. Another slight gain in performance was observed by of mean RGB pixel as calculated on the ImageNet database
adding commonly used horizontal flipping data augmentation. with values [102.9801, 115.9465, 122.7717]. By examining the
Also, throughout the experiments, feature pyramid network results we found out that one of the main sources of errors are
(FPN) consistently performed better than exploiting only a false negatives caused by small objects. One way of tackling
single feature map of the convolutional backbone. The best this issue is to augment the anchor set as per [9]. In our
score was achieved by training on a single GPU, increasing case, simply resizing the image often improved the mAP for
the total amount of iterations and lowering the learning rate additional percent or two. An example for R-50 and X-101
and batch size accordingly. For Faster R-CNN with ResNet- backbones of how mAP depends on image size is shown in
50, the optimal settings were 20000 iterations, batch size of 2 Fig. 1. The minimum image sizes were set to 320, 600, 800
and learning rate of 0.005 that was divided by 10 after 12000 (default), 960, and 1200 px. In most experiments, the optimal
and 16000 iterations. image size was 960 px, but it varied depending on the model
and distribution of logo sizes as captured in corresponding
B. Model selection and backbone datasets. One drawback of this approach are the increased
After optimizing the training procedure, we evaluate in- memory and time requirements, but this is also a problem
fluence of the type of model and convolutional backbone with the extended anchor set.

611
TABLE IV. B EST ACHIEVED RESULTS ON VARIOUS DATASETS
V. C ONCLUSION
img. VOC COCO
dataset CNN We have performed an empirical analysis of several methods
size mAP mR mAP mR
for automatic logo detection. We have shown how specific
FlickrLogos-32 R-50 960 89.5 91.6 65.8 70.1 design choices of optimization algorithm, batch size and
Logos-32plus X-101 800 93.5 96.1 66.9 72.6 learning rate scheduling can have a strong impact on the final
TopLogo-10 X-101 960 77.8 83.7 48.0 55.6 detection performance in terms of mean average precision.
QMUL-OpenLogo X-101 1200 76.7 88.3 50.8 61.7
Also, an empirical evaluation of three main types of detectors
Newton-Redbull X-101 800 82.1 90.1 53.8 61.5
with four different backbones was provided. This allows for
TABLE V. C OMPARISON WITH STATE OF THE ART
a direct comparison between various proposed models. In
these experiments, we observed that Faster R-CNN generally
dataset metric state of the art ours performs better than Mask R-CNN and single shot detectors
FlickrLogos-32
P@R (IR) [12] 97.6@67.6 97.9@93.6 represented by RetinaNet. We applied the observations to
mAP (VOC) [7] 85.4 89.5 improve score of our initial detector targeted at detection of
Logos-32plus P@R (IR) [12] 98.9@90.6 98.2@96.5 Red Bull logo from about 45 % to over 82 % of mAP (VOC).
TopLogo-10 mAP (VOC) [6] 41.8 77.8 Possible enhancements include other various techniques such
QMUL-OpenLogo mAP (VOC) [13] 48.3 76.7 as cyclical learning rates or extension of the anchor set. Our
future work will focus on logo detection in videos.

D. Results achieved on popular datasets R EFERENCES


[1] L. Liu, W. Ouyang, X. Wang, P. W. Fieguth, J. Chen, X. Liu,
Based on the empirical observations above, we evaluated and M. Pietikäinen, “Deep learning for generic object detection:
Faster R-CNN detectors on the other datasets as well. The A survey,” CoRR, vol. abs/1809.02165, 2018. [Online]. Available:
results are summarized in Tab. IV. In the second and the third http://arxiv.org/abs/1809.02165
[2] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense
column we report the min. image size with which the result object detection,” in 2017 IEEE International Conference on Computer
was achieved and the convolutional backbone, respectively. Vision (ICCV), Oct 2017, pp. 2999–3007.
By comparing our results to related works in Tab. V, we [3] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy
observe that for all datasets our detector outperforms the state trade-offs for modern convolutional object detectors,” in 2017 IEEE
of the art, often by a wide margin. Notice that instead of Conference on Computer Vision and Pattern Recognition (CVPR), July
object level, researches sometimes evaluate their detectors in 2017, pp. 3296–3297.
[4] S. C. H. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, and Q. Wu,
an image retrieval mode (IR), where precision (P) and recall “Logo-net: Large-scale deep logo detection and brand recognition with
(R) are calculated on the image level. In this mode, entire deep region-based convolutional networks,” CoRR, vol. abs/1511.02462,
image is classified as positive or negative according to whether 2015.
[5] C. Eggert, A. Winschel, and R. Lienhart, “On the benefit of synthetic
or not it contains an instance of a particular logo class. Since data for company logo detection,” in Proceedings of the 23rd ACM
the detections are not required to overlap with the ground International Conference on Multimedia, ser. MM ’15. New York, NY,
truth, results reported in this way can be much higher. We USA: ACM, 2015, pp. 1283–1286.
[6] H. Su, X. Zhu, and S. Gong, “Deep learning logo detection with data
perform such evaluation to compare our results with [12]. Only expansion by synthesising context,” in 2017 IEEE Winter Conference on
in the case of Logos-32plus have we achieved slightly lower Applications of Computer Vision (WACV), March 2017, pp. 530–539.
precision than [12], albeit at a 6 % higher recall, making the [7] D. Mas Montserrat, Q. Lin, J. Allebach, and E. Delp, “Training object
detection and recognition cnn models using data augmentation,” Elec-
F1 measure 2.7 % higher. However, we can outperform [12] tronic Imaging, vol. 2017, pp. 27–36, 01 2017.
simply by increasing the min. detection confidence, obtaining [8] D. Mas Montserrat, Q. Lin, J. Allebach, and E. J. Delp, “Logo detection
99.0 % prec. at 92.1 % recall. and recognition with synthetic images,” Electronic Imaging, vol. 2018,
pp. 3371–3377, 01 2018.
E. Discussion [9] C. Eggert, D. Zecha, S. Brehm, and R. Lienhart, “Improving small object
proposals for company logo detection,” in Proceedings of the 2017 ACM
We also tried several techniques to improve the detection on International Conference on Multimedia Retrieval, ser. ICMR ’17.
performance, but they did not prove effective. Most notably New York, NY, USA: ACM, 2017, pp. 167–174.
[10] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
we were not able to achieve higher mAP by utilizing data aug- “Feature pyramid networks for object detection,” in 2017 IEEE Confer-
mentation other than horizontal flipping, e. g. random rotation, ence on Computer Vision and Pattern Recognition (CVPR), July 2017,
color shift or blur. Contrary to other studies [5]–[7], our mAP pp. 936–944.
[11] S. Romberg, L. G. Pueyo, R. Lienhart, and R. van Zwol, “Scalable
decreased no matter how mild or strong the augmentations had logo recognition in real-world images,” in Proceedings of the 1st ACM
been. Since the artificial images basically extend the original International Conference on Multimedia Retrieval, ser. ICMR ’11. New
dataset, it is necessary to prolong the training procedure, York, NY, USA: ACM, 2011, pp. 25:1–25:8.
[12] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini, “Deep learning
otherwise the mAP drops even more. However, the addition of for logo recognition,” Neurocomputing, vol. 245, pp. 23–30, 2017.
Logos32-plus did improve the mAP on FlickrLogos-32 test set [13] H. Su, X. Zhu, and S. Gong, “Open logo detection challenge,” in British
by additional 4 %. It therefore seems that these simple forms Machine Vision Conference BMVC 2018. Northumbria University,
Newcastle, UK: BMVA Press, 2018, p. 16.
of augmentation are not able to simulate high variability in [14] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural
logo appearance, e. g. completely different color composition network models for practical applications,” CoRR, vol. abs/1605.07678,
or deformable geometrical transformations. 2016.

612

You might also like