Professional Documents
Culture Documents
Abstract—Automatic License Plate Recognition (ALPR) is an and environmental conditions. However, many surveillance
important task with many applications in Intelligent Transporta- systems assume the camera to be static, performing poorly
tion and Surveillance systems. As in other computer vision tasks, on mobile cameras. The opposite is also true when systems
Deep Learning (DL) methods have been recently applied in the
context of ALPR, focusing on country-specific plates, such as consider the camera to be mobile as they do not take advantage
American or European, Chinese, Indian and Korean. However, of the small angle and scale variance found in static camera
either they are not a complete DL-ALPR pipeline, or they setups. In LPS, the layout and number of characters usually
are commercial and utilize private datasets and lack detailed change from one country to another – and sometimes even
information. In this work, we proposed an end-to-end DL- inside the same country, as in the USA. This variation in
ALPR system for Brazilian license plates based on state-of-the-
art Convolutional Neural Network architectures. Using a publicly conditions and layout motivated the creation of many small
available dataset with Brazilian plates [1], the system was able license plate databases around the world, such as Greek,
to correctly detect and recognize all seven characters of a license Chinese, Indian, American, European, Brazilian. Here we used
plate in 63.18% of the test set, and 97.39% when considering the Brazilian SSIG Database [1], which consists of 2,000 high
at least five correct characters (partial match). Considering the definition pictures from 101 different cars taken from a static
segmentation and recognition of each character individually,
we are able to segment 99% of the characters, and correctly camera.
recognize 93% of them. With the rise of Deep Learning (DL) techniques [12]–
[15], the accuracy of many pattern recognition tasks was
I. I NTRODUCTION greatly improved. In computer vision tasks, it is common to
Automatic License Plate Recognition (ALPR) is an im- see methods that explore some kind of Convolutional Neural
portant task in Intelligent Transportation and Surveillance, Network achieving state-of-the-art performance. However, the
which has many practical and relevant applications such as problem with DL techniques is that they need tons of data to be
automatic traffic law enforcement, detection of stolen vehicles trained from scratch. The more data you have, the deeper your
or toll violation, traffic flow control, etc. It has been studied network can be, allowing the recognition of progressively more
in the past decades and still is an open problem due to the complex patterns. This is particularly a problem for ALPR
large variability in image acquisition conditions (illumination, systems, since there is no large database available (in particular
capture angle, distance from camera, etc.) and license plate for Brazilian license plates).
(LP) layouts, which vary from one country to another. One of our main contributions is the introduction of a
The ALPR problem can be divided into the following complete DL-ALPR system based on state-of-the-art regional
three subtasks: License Plate Detection (LPD), License Plate CNNs. The system presents two considerably fast and small
Segmentation (LPS) and Character Recognition (CR). These YOLO-based networks, operating in a cascaded mode, with
subtasks compose the common pipeline for ALPR systems weights transferred from extensively trained networks. As far
found in the literature [2]–[4], and many works are focused on as we know, the ALPR system presented here is the first DL-
only one or two of the subtasks [1], [5]–[7]. Furthermore, LPS based method focused on Brazilian license plates.
and CR are highly related to Optical Character Recognition To overcome the lack of annotated car labels in the SSIG
(OCR), which is a noticeably known research field in computer database, another contribution is the detection of cars in mostly
vision and presents several solutions [8]–[11]. frontal views without using any external dataset to aid car
Even though a single CR approach could work for a variety detection. We simply took advantage of the fully annotated
of ALPR systems, the same is not true for LPD and LPS. An LP regions present in the dataset to estimate good training
ideal LPD method should be capable of extracting cars and regions around them, generating a full set of weakly labeled
license plates regardless of the camera setup (static or mobile) data. Since the estimated regions comprise the car headlights,
licence plate, and front tires, there is enough information this context is Local Binary Patterns, employed by Ope-
for our Deep CNN to find more than 99% of them on the nALPR [19] to detect European and USA license plates. In [2],
validation set. As the estimated frontal views are in average the authors used three deep CNNs to perform LPD, LPS and
19× larger than license plates, detecting them is easier, faster CR, respectively. Their results outperform OpenALPR by a
and less prone to false positives than LP detection itself. substantial margin. However, this is a commercial method and
The final important contribution is a carefully designed and no detailed description about the network architecture and
trained network to perform character detection and recogni- training procedures were given. Moreover, they used large
tion, achieving very good recall and precision rates with an private databases to train the networks, while OpenALPR is
average execution time of 2.2ms in our GPU. a totally open source project. Having such databases is a
This work is organized as follows: Section II provides a considerable advantage because DL performance is directly
brief description of traditional methods for ALPR in all three connected to the amount of data used.
subtasks. The proposed system is presented in Section III, License Plate Segmentation (LPS). Individual character
detailing the car frontal view detection approach, network segmentation is usually performed by some kind of binariza-
architectures, and training procedures. The overall ALPR tion method in order to separate the foreground from the
performance, state-of-the-art comparisons and independent background, following the application of connected compo-
evaluations for each subtask are shown in Section IV. Finally, nents analysis to extract the bounding boxes [20]. This is
Section V presents our conclusions and future work plans. highly dependent of the chosen binarization method and non-
uniformity of the plate (e.g. due to textured background, low
II. R ELATED WORK contrast between the background and the characters, or non-
This section revises a few basic concepts about deep learn- even illumination). Furthermore, those methods tend to fail
ing and it relationship with ALPR. It also presents other (or require additional post-processing) when two characters
computer vision and pattern recognition methods commonly are connected or one character is broken or divided. In [21],
used for ALPR (or its subtasks). the authors extract the Extremal Regions [8], which is a known
technique for text segmentation [9], [10].
A. Deep Learning (DL) Character Recognition (CR). It is a crucial task in ALPR
In 2014, Yosinski et al. [16] studied the transferability of systems, since a single error can invalidate an entire LP. More-
weights from largely trained networks to new networks. For over, some characters have a similar shape (e.g. O/0, I/1, A/4,
example, the filters found by a large network trained for days K/X, C/G, D/O, 2/Z, 5/S, 8/B) and may be even more similar
on a huge dataset of millions of images, can be transferred when distorted [4], [20]. Thus, having a CR method with high
to a smaller network and refined for a specific domain. This accuracy is mandatory. In [18], the authors used one SVM per
motivated us to employ DL on ALPR using an open dataset. character, leading to a total of 36 SVMs. Gou et al. [21] used
Furthermore, in the past two years, different regional CNN HoG features with Restricted Boltzmann Machines (RBM),
methods for fast and precise object detection were developed, which is similar to Deep Belief Networks [12], to recognize
such as YOLO [14] and Faster-RCNN [13]. However, to the Chinese characters – which generally are more complex than
best of our knowledge, only in [2] and [17] these networks Roman characters. They argue that RBM is more reliable
were used in the context of ALPR. In the former, the training then SVM for this task. Deep CNNs were used for character
details and datasets are not available, since it is part of a recognition in [3] and [2]. In [3], the authors used a 16-layer
commercial product; in the latter, they perform only LPD and CNN based on Spatial Transformer Networks [22], aiming to
no detailed information was written in English (the paper is make CR less sensitive to spatial transformations.
in Korean). In general, we noted that the use of DL in ALPR systems is
still limited, and just a few works in fact explore it in an end-
B. Automatic License Plate Recognition (ALPR) to-end fashion. Most of the methods found in the literature
Most of ALPR-related methods tackle only a subset of rely on hand-engineered features, like HoG and LBP. The
the required tasks: LPD, which aims to detect the region data-driven features provided by DL techniques are generally
that contains the plate; LPS, which focuses on segmenting more discriminative and can be coupled with classification in
individual characters within a detected plate; and CR, which a single network. Furthermore, recent networks like Faster R-
recognizes each segmented character and provides the final CNN and YOLO are capable of performing detection without
plate string. Some methods that tackle these subtasks are the need for pyramids of images and features to test multiple
revised next. scales, achieving high accuracy rates at limited execution
License Plate Detection (LPD). Detecting the car before times.
the plate is a common strategy in the LPD pipeline, aiming
to reduce the search region and the number of false positives. III. T HE P ROPOSED S YSTEM
To perform these detections, approaches using Histogram of As mentioned earlier, the traditional ALPR pipeline typi-
Oriented Gradients (HOG) as feature descriptors and Support cally involves car detection. Although it might be an optional
Vector Machine (SVM) for classification are popular [5], task for some methods, it is essential in our approach: all
[6], [18]. Another feature descriptor successfully used in images in the dataset have high resolution (1920×1080 pixels)
Front-View / License Plate Detection Network Heuristic
(FV/LPD-NET) 7 8
1 2 3 4
HHR 3 S A6 HHR
3546
5 6
License Plate Segmentation / Charecter Recognition
Image (LPS/CR-NET)
acquisition
Fig. 1. Proposed System pipeline.
and the cars do not occupy a large portion in the scene that refining the FAST-YOLO network would be enough. Its
(they might be relatively far from the camera). Moreover, the architecture and the training procedure adopted are described
average size of LPs is 0.26% of the full image area. Thus, the in Section III-B.
computation resources demanded to analyze an image of that For the detection and recognition of characters, we built
size looking for such a small object is considerably high. a new network inspired on the YOLO architecture, with
Taking into account that we are using DL techniques fundamental technical differences to accommodate 35 classes
known to be computationally expensive, avoiding small objects (0-9, A-Z except the letter O, which is detected jointly with the
becomes even more important when one seeks for a real- number 0) and outputs a feature map having the same aspect
time application. Shrinking the image is not an alternative ratio of LP (width is three times larger than height). Please
because the plate may become excessively small to the point refer to Section III-C for more details.
of being undetectable. Considering the fact that the available Finally, we perform a post-processing step in order to swap
database does not provide annotated car labels or bounding highly confused letters to digits and digits to letters, since
boxes, detecting cars using only its training information turns the Brazilian license plate is formed by exactly three letters
out to be a challenging task. followed by four digits.
We propose a simple method to extract the frontal portion of
the cars only based on the annotated LP. Basically, assuming A. Frontal-View Extraction
the LP is always attached to a car (and roughly aligned One problem with DL techniques is that they struggle to
horizontally), we decided to use the region around it to train detect small objects [24], and this indeed happened in our
a detection network, as described in Section III-A. experiments. We trained a FAST-YOLO network to detect only
One of our biggest concerns was to come out with a real- LPs, and the result was a low 82% of recall (which tends to
time ALPR system, without the need for expensive hardware. be further reduced after CR).
Therefore, we need to use a deep network capable of perform- The YOLO and FAST-YOLO networks have fixed input im-
ing detection and recognition in a very short time. In order age dimensions, which in both cases are defined as 416 × 416.
to accomplish those objectives, we choose to use a YOLO- Since in SSIG Database all images have 1920 × 1080 pixels,
based network [14]. As far as we know, these are the fastest a shrinking step is required. When the car is sufficiently far
networks reported in the literature – even faster than Faster- from the camera, that step reduces the LP size to the point of
RCNN [13] that perform object detection/recognition using being undetectable.
region proposals, without the need of (costly) image pyramid In order to overcome this problem, we extracted larger
and sliding windows. regions around the LP, namely car Frontal-Views (FV), by
In our tests, the FAST-YOLO network was capable of scaling and translating the annotated LP bounding box. The
performing a 20-class detection and classification in 800×600 parameters for this spatial transformation were estimated using
image in less than 5.5ms1 , or around 180 frames per second the following procedure:
(FPS). The precision reported for this network was 57.1%
1) Randomly selected a set of Nlp images from the training
on VOC [23] database, which is not much when compared
dataset, excluding images from buses and trucks because
to other deeper networks like YOLO (76.8%) and Faster-
their fronts are too large;
RCNN (76.4%). However, YOLO has 3× more parameters
2) Manually labeled these images with a rectangular
than FAST-YOLO, which makes it slower and data-hungry
bounding box around the license plate, where each
for training.
bounding box have the smallest region possible com-
As our intention is to tackle a two-class problem (car
prising the car headlights and tires (refer to the yellow
frontal views and plates) instead of twenty, we hypothesized
dashed rectangle in Fig. 2);
1 Hardware used: Intel Xeon E5620 2.4GHz processor, 12GB of RAM 3) Considering a 4-dimensional vector p~ containing the
memory, and NVIDIA TITAN X GPU. top-left point (px , py ), width pw and height ph of the
annotated LP, we want to find two translation parameters on a good GPU. Thus, we hypothesized that the FAST-YOLO
(αx , αy ), and two scaling parameters αw and αh that network customized for 2 classes could have the capacity to
relate the LP bounding box p~ with the FV bounding accommodate both tasks in a single network when executed
box ~v . This transformation is described by a function in a cascaded way.
px + αx pw
The FAST-YOLO architecture used is shown in Table I. The
py + αy ph only modification to the original was made in layer 15, where
~v = f~(~
p, α
~) =
αw pw
(1) we reduced the number of filters from 125 to 35 in order to
αh ph output 2 classes instead of 20 (please refer to [14] for more
information about the detection layer).
where α
~ contains the scaling and translating parameters.
4) To find α
~ , we maximized the sum of the Intersection TABLE I
over Union (also known as Jaccard Similarity) over the F RONTAL -V IEW AND L ICENSE P LATE D ETECTION N ETWORK
(FV/LPD-NET): BASICALLY A FAST-YOLO NETWORK ADJUSTED TO
Nlp labeled images: OUTPUT 2 OBJECT CLASSES .
Nlp
X no Layer Filters Size Input Output
α
~ ∗ = argmax IoU (f (~
pi , α
~ ), ~vi ) . (2) 1 conv 16 3 × 3 / 1 416 × 416 × 3 416 × 416 × 16
α
~ i=1
2 max 2 × 2 / 2 416 × 416 × 16 208 × 208 × 16
5) With the estimated parameter α ~ ∗ , the FV bounding 3 conv 32 3 × 3 / 1 208 × 208 × 16 208 × 208 × 32
boxes for all remaining images in the dataset were 4 max 2 × 2 / 2 208 × 208 × 32 104 × 104 × 32
labeled automatically, generating a weakly (but large) 5 conv 64 3 × 3 / 1 104 × 104 × 32 104 × 104 × 64
labeled set. An example of the weakly labeling process 6 max 2 × 2 / 2 104 × 104 × 64 52 × 52 × 64
is illustrated in Fig. 2. 7 conv 128 3 × 3 / 1 52 × 52 × 64 52 × 52 × 128
Despite discarding buses and trucks images in the training 8 max 2 × 2 / 2 52 × 52 × 128 26 × 26 × 128
to estimate the transformation parameters, the network where 9 conv 256 3 × 3 / 1 26 × 26 × 128 26 × 26 × 256
this labels were used was still capable of detecting these kind 10 max 2 × 2 / 2 26 × 26 × 256 13 × 13 × 256
of vehicles. 11 conv 512 3 × 3 / 1 13 × 13 × 256 13 × 13 × 512
Related approaches can be found in [25], [26], where plate 12 max 2 × 2 / 1 13 × 13 × 512 13 × 13 × 512
regions where used to estimate car frontal-views in order to 13 conv 1024 3 × 3 / 1 13 × 13 × 512 13 × 13 × 1024
recognize their model/manufacturer. 14 conv 1024 3 × 3 / 1 13 × 13 × 1024 13 × 13 × 1024
15 conv 35 1 × 1 / 1 13 × 13 × 1024 13 × 13 × 35
16 detection
Precision
16 detection 0.9
0.88
The curves were extracted using a fixed FV threshold λf v = available yet, but the authors kindly provided their results using the SSIG
0.2, and varying the LP threshold λpl uniformly in the range Database.
4 State-of-the-art commercial system, which has a cloud service available at
[0, 1]. As can be observed, our method is not much sensitive for https://www.sighthound.com/products/cloud.. The results presented here were
threshold variations, except for very high or very low values. collected on June, 2017.
1
Our final evaluation is related to execution time. In Table IV
0.9 we show the average time needed for each network to process
0.8
an input, and the total time for the whole system (assuming
that a single vehicle is being processed). Our ALPR system
0.7
runs at 76 FPS using a high-end GPU, and it can achieve
0.6 around 9 FPS with a cheaper mid-end GPU, which is also
F-Measure
0.5
feasible in several applications such as parking and toll mon-
itoring systems.
0.4
0.3 TABLE IV
E XECUTION TIME : NETWORKS FEED FORWARD TIMES FOR HIGH AND
0.2
MID - END GPU S .
0.1 Ours
Gonçalves 2016 Time (FPS)
0 Network High-end GPU Mid-end GPU
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
IoU (NVIDIA TITAN X) (GeForce GT 750M)
objects inside objects, for instance a LP attached to a car [14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
or a character in the LP, or to explore the expected spatial Once: Unified, Real-Time Object Detection,” in 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016,
distribution of characters within a plate for LPS. pp. 779–788.
Finally, we want to include car manufacturer/model recog- [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
nition in the ALPR pipeline. For this, the above mentioned Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), vol. 4, no. 3. IEEE, jun 2016, pp. 770–778.
parts-based models can be of great help. [16] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable
are features in deep neural networks?” in Proceedings of the 27th
ACKNOWLEDGMENT International Conference on Neural Information Processing Systems, ser.
NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 3320–3328.
The authors would like to thank the funding agencies [17] D. Lee, S. Yoon, J. Lee, and D. S. Park, “Real-Time License Plate
Detection Based on Faster R-CNN,” KIPS Transactions on Software
CAPES and CNPq, as well as NVIDIA Corporation for the and Data Engineering, vol. 5, no. 11, pp. 511–520, nov 2016.
donation of the Titan X Pascal GPU used for this research. [18] G. R. Gonçalves, D. Menotti, and W. R. Schwartz, “License plate recog-
nition based on temporal redundancy,” in 2016 IEEE 19th International
R EFERENCES Conference on Intelligent Transportation Systems (ITSC), Nov 2016, pp.
2577–2582.
[1] G. R. Gonçalves, S. P. G. da Silva, D. Menotti, and W. R. Schwartz, [19] M. Hill, “OpenALPR,” https://github.com/openalpr/openalpr.
“Benchmark for license plate character segmentation,” Journal of Elec- [20] S. Du, M. Ibrahim, M. Shehata, and W. Badawy, “Automatic license
tronic Imaging, vol. 25, no. 5, pp. 1–5, 2016. plate recognition (alpr): A state-of-the-art review,” IEEE Transactions
[2] S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz, “License Plate on Circuits and Systems for Video Technology, vol. 23, no. 2, pp. 311–
Detection and Recognition Using Deeply Learned Convolutional Neural 325, Feb 2013.
Networks,” 2017. [Online]. Available: http://arxiv.org/abs/1703.07330 [21] C. Gou, K. Wang, Y. Yao, and Z. Li, “Vehicle License Plate Recognition
[3] V. Jain, Z. Sasindran, A. Rajagopal, S. Biswas, H. S. Bharadwaj, and Based on Extremal Regions and Restricted Boltzmann Machines,” IEEE
K. R. Ramakrishnan, “Deep automatic license plate recognition system,” Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp.
in Proceedings of the Tenth Indian Conference on Computer Vision, 1096–1107, apr 2016.
Graphics and Image Processing - ICVGIP ’16. New York, New York, [22] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu,
USA: ACM Press, 2016, pp. 1–8. “Spatial transformer networks,” in Advances in Neural Information
[4] C.-N. E. Anagnostopoulos, “License Plate Recognition: A Brief Tuto- Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee,
rial,” IEEE Intelligent Transportation Systems Magazine, vol. 6, no. 1, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc.,
pp. 59–67, 2014. 2015, pp. 2017–2025. [Online]. Available: http://papers.nips.cc/paper/
[5] J. W. Hsieh, L. C. Chen, and D. Y. Chen, “Symmetrical SURF and 5854-spatial-transformer-networks.pdf
Its applications to vehicle detection and vehicle make and model [23] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
recognition,” IEEE Transactions on Intelligent Transportation Systems, man, “The pascal visual object classes (voc) challenge,” International
vol. 15, no. 1, pp. 6–20, 2014. Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, Jun. 2010.
[6] P. R.F, C.-C. G, W. R. Schwartz, and M. D, “Brazilian License [24] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu, “Traffic-
Plate Detection Using Histogram of Oriented Gradients and Sliding sign detection and classification in the wild,” in 2016 IEEE Conference
Windows,” International Journal of Computer Science and Information on Computer Vision and Pattern Recognition (CVPR), June 2016, pp.
Technology, vol. 5, no. 6, pp. 39–52, 2013. 2110–2118.
[7] Y. Yuan, W. Zou, Y. Zhao, X. Wang, X. Hu, and N. Komodakis, [25] A. Psyllos, C. Anagnostopoulos, and E. Kayafas, “Vehicle model recog-
“A Robust and Efficient Approach to License Plate Detection,” IEEE nition from frontal view image measurements,” Computer Standards &
Transactions on Image Processing, vol. 26, no. 3, pp. 1102–1114, mar Interfaces, vol. 33, no. 2, pp. 142–151, feb 2011.
2017. [26] H. Yang, L. Zhai, Z. Liu, L. Li, Y. Luo, Y. Wang, H. Lai, and
[8] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust Wide Baseline M. Guan, “An Efficient Method for Vehicle Model Identification via
Stereo from Maximally Stable Extremal Regions,” in Procedings of Logo Recognition,” in 2013 International Conference on Computational
the British Machine Vision Conference 2002. British Machine Vision and Information Sciences. IEEE, jun 2013, pp. 1080–1083.
Association, 2002, pp. 36.1–36.10. [27] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection:
Survey and experiments,” IEEE transactions on pattern analysis and
[9] H. Cho, M. Sung, and B. Jun, “Canny Text Detector: Fast and Robust
machine intelligence, vol. 31, no. 12, pp. 2179–2195, 2009.
Scene Text Localization Algorithm,” in 2016 IEEE Conference on
[28] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for
Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016,
fine-grained category detection,” in European conference on computer
pp. 3566–3573.
vision, 2014, pp. 834–849.
[10] M.-C. Sung, B. Jun, H. Cho, and D. Kim, “Scene text detection with ro-
[29] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal,
bust character candidate extraction method,” in 2015 13th International
and D. Metaxas, “SPDA-CNN: Unifying Semantic Part Detection and
Conference on Document Analysis and Recognition (ICDAR). IEEE,
Abstraction for Fine-Grained Recognition,” in 2016 IEEE Conference
aug 2015, pp. 426–430.
on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016,
[11] Q. Ye and D. Doermann, “Text detection and recognition in imagery:
pp. 1143–1152.
A survey,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 37, no. 7, pp. 1480–1500, July 2015.
[12] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm
for Deep Belief Nets,” Neural Computation, no. 7, pp. 1527–1554, jul
2006.
[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
no. 6, pp. 1137–1149, jun 2017.