Professional Documents
Culture Documents
Karel Paleček
Department name of organization
Technical University of Liberec
Liberec, Czech Republic
Email: karel.palecek@tul.cz
Abstract—We present a deep learning system for automatic boxes. One of the most successful single shot detectors proved
logo detection in real world images. We base our detector out to be RetinaNet [2], also evaluated in this study. In
on the popular framework of FasterR-CNN and compare its contrast, the two-stage detectors first prune the search space
performance to other models such as Mask R-CNN or RetinaNet.
We perform a detailed empirical analysis of various design and using region proposal network (RPN) and only output the
architecture choices and show how these can have much higher predictions for selected regions of interest. In comparison to
influence than algorithmic tweaks or popular techniques such single-stage detectors, they basically only differ by adding this
as data augmentation. We also provide a systematic detection intermediate RPN stage. The most well known examples of
performance comparison of various models on multiple popular this category are Faster R-CNN and Mask R-CNN, both of
datasets including FlickrLogos-32, TopLogo-10 and recently in-
troduced QMUL-OpenLogo benchmark, which allows for a direct which allow training in an end-to-end fashion. The latter also
comparison between recently proposed extensions. By careful outputs pixel-level segmentation of object instances in addition
optimization of the training procedure we were able to achieve to ordinary bounding boxes.
significant improvements of the state of the art on all mentioned As have been shown [3], compared to two-stage, the single-
datasets. We apply our observations to build a detector to detect stage detectors are optimized for efficiency at the expense
logos of the Red Bull brand in online media and images.
Keywords—Faster R-CNN, FlickrLogos-32, Logo detection, of accuracy, and vice versa. However, both classes of de-
Mask R-CNN, QMUl-OpenLogo, RetinaNet, TopLogo-10 tectors are easily tunable and the trade-off can be varied.
One source of errors, which is cleverly mitigated through
I. I NTRODUCTION the RPN stage in two-stage detectors, is the class imbalance
between background samples and objects of interest. This
Automatic logo detection and recognition is a challenging
issue has been tackled by [2], where Lin et al. proposed a
computer vision task with many applications. One of the
weighting scheme that modifies the cross entropy loss, such
typical purposes of such systems is measurement of brand
that the easily classified regions contribute less to the overall
visibility and prominence in online media or television broad-
loss value. For detailed analysis and empirical comparison of
casts. The task can be formulated as a special case of generic
several algorithms and network designs see e. g. [3].
object detection, albeit with some significant differences. Most
notably, logo instances usually occupy only a small portion of II. D EEP LEARNING FOR LOGO DETECTION
the image, often being captured by accident, for example as
One of the first applications of convolutional networks for
sponsor logo on an athlete’s sportswear or a driver’s vehicle
automatic logo detection was [4]. Hoi et al. utilized the Fast
etc. There is also a different kind of visual variability, for
R-CNN framework with external region proposal based on
example when the logo’s color design changes depending on
selective search (SS). In order to facilitate their research,
target background or when pictorially designed logo appears
they collected a large dataset called LOGO-net with over 130
as a three dimensional sculpture.
thousand object annotations. However, the dataset is not public
Recently, the field of object detection became dominated by
and results therefore not comparable.
methods based on deep convolutional networks [1]. There are
Another example of automatic logo detection using deep
two main categories of detectors: single stage and two-stage
learning is the work of Eggert et al. [5]. Similarly to Hoi et al.,
detectors. In the former type, the image is passed through
Eggert et al also used R-CNN detector with external SS region
the network in a single pass and outputs bounding box and
proposal. However, instead of collecting large datasets they
class predictions for every position on a grid of predefined
heavily relied on artificial data augmentation. By randomly
resolution. In order to capture objects of different sizes and
transforming images (e. g. perspective changes, color shifting
aspect ratios, several bounding boxes are anchored to every
and blurring), they were able to increase training set variability
point on the grid and each of them is classified independently.
and achieved 78.1 % of mean average precision (mAP) on the
Also, instead of predicting absolute bounding box coordinates,
popular FlickrLogos-32 dataset.
the network only learns to correct these predefined anchor
Even with techniques such as transfer learning, convolu-
This work was supported by the Technology Agency of the Czech Republic tional networks often require sufficiently sized datasets with
(Project No. TH03010018) enough variability for training. Since manual annotation can
610
TABLE II. O PTIMIZATION OF R ES N ET-50 ON F LICKR L OGOS -32
611
TABLE IV. B EST ACHIEVED RESULTS ON VARIOUS DATASETS
V. C ONCLUSION
img. VOC COCO
dataset CNN We have performed an empirical analysis of several methods
size mAP mR mAP mR
for automatic logo detection. We have shown how specific
FlickrLogos-32 R-50 960 89.5 91.6 65.8 70.1 design choices of optimization algorithm, batch size and
Logos-32plus X-101 800 93.5 96.1 66.9 72.6 learning rate scheduling can have a strong impact on the final
TopLogo-10 X-101 960 77.8 83.7 48.0 55.6 detection performance in terms of mean average precision.
QMUL-OpenLogo X-101 1200 76.7 88.3 50.8 61.7
Also, an empirical evaluation of three main types of detectors
Newton-Redbull X-101 800 82.1 90.1 53.8 61.5
with four different backbones was provided. This allows for
TABLE V. C OMPARISON WITH STATE OF THE ART
a direct comparison between various proposed models. In
these experiments, we observed that Faster R-CNN generally
dataset metric state of the art ours performs better than Mask R-CNN and single shot detectors
FlickrLogos-32
P@R (IR) [12] 97.6@67.6 97.9@93.6 represented by RetinaNet. We applied the observations to
mAP (VOC) [7] 85.4 89.5 improve score of our initial detector targeted at detection of
Logos-32plus P@R (IR) [12] 98.9@90.6 98.2@96.5 Red Bull logo from about 45 % to over 82 % of mAP (VOC).
TopLogo-10 mAP (VOC) [6] 41.8 77.8 Possible enhancements include other various techniques such
QMUL-OpenLogo mAP (VOC) [13] 48.3 76.7 as cyclical learning rates or extension of the anchor set. Our
future work will focus on logo detection in videos.
612