You are on page 1of 6

LPRNet: License Plate Recognition via Deep Neural Networks

Sergey Zherzdev Alexey Gruzdev


ex-Intel∗ Intel
IOTG Computer Vision Group IOTG Computer Vision Group
sergeyzherzdev@gmail.com alexey.gruzdev@intel.com
arXiv:1806.10447v1 [cs.CV] 27 Jun 2018

Abstract LPRNet is based on Deep Convolutional Neural Net-


work. Recent studies proved effectiveness and superiority
This paper proposes LPRNet - end-to-end method for of Convolutional Neural Networks in many Computer Vi-
Automatic License Plate Recognition without preliminary sion tasks such as image classification, object detection and
character segmentation. Our approach is inspired by re- semantic segmentation. However, running most of them on
cent breakthroughs in Deep Neural Networks, and works in embedded devices still remains a challenging problem.
real-time with recognition accuracy up to 95% for Chinese LPRNet is a very efficient neural network, which takes
license plates: 3 ms/plate on nVIDIA R GeForceTM GTX only 0.34 GFLops to make a single forward pass. Also, our
1080 and 1.3 ms/plate on Intel R CoreTM i7-6700K CPU. model is real-time on Intel Core i7-6700K SkyLake CPU
LPRNet consists of the lightweight Convolutional Neu- with high accuracy on challenging Chinese License plates
ral Network, so it can be trained in end-to-end way. To the and can be trained end-to-end. Moreover, LPRNet can be
best of our knowledge, LPRNet is the first real-time License partially ported on FPGA, which can free up CPU power
Plate Recognition system that does not use RNNs. As a re- for other parts of the pipeline. Our main contributions can
sult, the LPRNet algorithm may be used to create embedded be summarized as follows:
solutions for LPR that feature high level accuracy even on • LPRNet is a real-time framework for high-quality
challenging Chinese license plates. license plate recognition supporting template and
character independent variable-length license plates,
performing LPR without character pre-segmentation,
1. Introduction trainable end-to-end from scratch for different national
Automatic License Plate Recognition is a challenging license plates.
and important task which is used in traffic management, • LPRNet is the first real-time approach that does not use
digital security surveillance, vehicle recognition, parking Recurrent Neural Networks and is lightweight enough
management of big cities. This task is a complex prob- to run on variety of platforms, including embedded de-
lem due to many factors which include but are not limited vices.
to: blurry images, poor lighting conditions, variability of
license plates numbers (including special characters e.g. lo- • Application of LPRNet to real traffic surveillance
gograms for China, Japan), physical impact (deformations), video shows that our approach is robust enough to
weather conditions (see some examples in Fig. 1). handle difficult cases, such as perspective and camera-
The robust Automatic License Plate Recognition system dependent distortions, hard lighting conditions, change
needs to cope with a variety of environments while main- of viewpoint, etc.
taining a high level of accuracy, in other words this system The rest of the paper is organized as follows. Section 2
should work well in natural conditions. describes the related work. In sec. 3 we review the details
This paper tackles the License Plate Recognition prob- of the LPRNet model. Sec. 4 reports the results on Chi-
lem and introduces the LPRNet algorithm, which is de- nese License Plates and includes an ablation study of our
signed to work without pre-segmentation and consequent algorithm. We summarize and conclude our work in sec. 5.
recognition of characters. In the present paper, we do not
consider License Plate Detection problem, however, for our 2. Related Work
particular case it can be done through LBP-cascade.
In the earlier works on general LP recognition, such as
0 This work was done when Sergey was an Intel employee [1] the pipeline consist of character segmentation and char-

1
Figure 1. Example of LPRNet recognitions

acter classification stages: padded to the predefined fixed length), so the whole recog-
nition can be done in a single feed-forward pass. It also uti-
• Character segmentation typically uses different hand-
lizes the Spatial Transformer Network (STN) [8] to reduce
crafted algorithms, combining projections, connectiv-
the effect of input image deformations.
ity and contour based image components. It takes a
binary image or intermediate representation as input The algorithm in [9] makes an attempt to solve both li-
so character segmentation quality is highly affected by cense plate detection and license plate recognition problems
the input image noise, low resolution, blur or deforma- by single Deep Neural Network.
tions. Recent work [10] tries to exploit synthetic data gener-
ation approach based on Generative Adversarial Networks
• Character classification typically utilizes one of the op- [11] for data generation procedure to obtain large represen-
tical character recognition (OCR) methods - adopted tative license plates dataset.
for LP character set. In our approach, we avoided using hand-crafted features
Since classification follows the character segmentation, over a binarized image - instead we used raw RGB pixels
end-to-end recognition quality depends heavily on the ap- as CNN input. The LSTM-based sequence decoder work-
plied segmentation method. In order to solve the prob- ing on outputs of a sliding window CNN was replaced with
lem of character segmentation there were proposed end- a fully convolutional model which output is interpreted as
to-end Convolutional Neural Networks (CNNs) based so- character probabilities sequence for CTC loss training and
lutions taking the whole LP image as input and producing greedy or prefix search string inference. For better perfor-
the output character sequence. mance the pre-decoder intermediate feature map was aug-
The segmentation-free model in [2] is based on variable mented by the global context embedding as described in
length sequence decoding driven by connectionist temporal [12]. Also the backbone CNN model was reduced sig-
classification (CTC) loss [3, 4]. It uses hand-crafted fea- nificantly using the low computation cost basic building
tures LBP built on a binarized image as CNN input to pro- block inspired by SqueezeNet Fire Blocks [13] and Incep-
duce character classes probabilities. Applied to all input tion Blocks of [14, 15, 16]. Batch Normalization [17] and
image positions via the sliding window approach it makes Dropout [18] techniques were used for regularization.
the input sequence for the bi-directional Long-Short Term LP image input size affects both the computational cost
Memory (LSTM) [5] based decoder. Since the decoder out- and the recognition quality [19], as a result there is a trade-
put and target character sequence lengths are different, CTC off between using high [6] or moderate [7, 2] resolution.
loss is used for the pre-segmentation free end-to-end train-
ing. 3. LPRNet
The model in [6] mostly follows the approach described
3.1. Design architecture
in [2] except that the sliding window method was replaced
by CNN output spatial splitting to the RNN input sequence In this section we describe our LPRNet network archi-
(”sliding window” over feature map instead of input). tecture design in detail.
In contrast [7] uses the CNN-based model for the whole In recent studies tend to use parts of the powerful clas-
LP image to produce the global LP embedding which is de- sification networks such as VGG, ResNet or GoogLeNet
coded to a 11-character-length sequence via 11 fully con- as ‘backbone‘ for their tasks by applying transfer learning
nected model heads. Each of the heads is trained to clas- techniques. However, this is not the best option for building
sify the i-th target string character (which is assumed to be fast and lightweight networks, so in our case we redesigned
main ‘backbone‘ network applying recently discovered ar- pixel width. Since the decoder output and the target char-
chitecture tricks. acter sequence lengths are of different length, we apply the
The basic building block of our CNN model backbone method of CTC loss [20] - for segmentation-free end-to-end
(Table 2) was inspired by SqueezeNet Fire Blocks [13] and training. CTC loss is a well-known approach for situations
Inception Blocks of [14, 15, 16]. We also followed the re- where input and output sequences are misaligned and have
search best practices and used Batch Normalization [17] variable lengths. Moreover, CTC provides an efficient way
and ReLU activation after each convolution operation. to go from probabilities at each time step to the probabil-
In a nutshell our design consists of: ity of an output sequence. More detailed explanation about
CTC loss can be found in .
• location network with Spatial Transformer Layer [8]
(optional) Layer Type Parameters
Input 94x24 pixels RGB image
• light-weight convolutional neural network (backbone) Convolution #64 3x3 stride 1
• per-position character classification head MaxPooling #64 3x3 stride 1
Small basic block #128 3x3 stride 1
• character probabilities for further sequence decoding MaxPooling #64 3x3 stride (2, 1)
Small basic block #256 3x3 stride 1
• post-filtering procedure Small basic block #256 3x3 stride 1
First, the input image is preprocessed by the Spatial MaxPooling #64 3x3 stride (2, 1)
Transformer Layer, as proposed in [8]. This step is optional Dropout 0.5 ratio
but allows to explore how one can transform the input image Convolution #256 4x1 stride 1
to have better characteristics for recognition. The original Dropout 0.5 ratio
LocNet (see the Table 1) architecture was used to estimate Convolution # class number 1x13 stride 1
optimal transformation parameters. Table 3. Back-bone Network Architecture

Layer Type Parameters To further improve performance, the pre-decoder inter-


Input 94x24 pixels RGB image mediate feature map was augmented with the global context
AvgPooling #32 3x3 stride 2 — embedding as in [12]. It is computed via a fully-connected
Convolution #32 5x5 stride 3 #32 5x5 stride 5 layer over backbone output, tiled to the desired size and
Concatenation channel-wise concatenated with backbone output. In order to adjust the
Dropout 0.5 ratio depth of feature map to the character class number addi-
FC #32 with TanH activation tional 1 × 1 convolution is applied.
FC #6 with scaled TanH activation For the decoding procedure at the inference stage we
considered 2 options: greedy search and beam search.
Table 1. LocNet architecture
While greedy search takes the maximum of class proba-
bilities in each position, beam search maximizes the total
probability of the output sequence [3, 4].
Layer Type Parameters/Dimensions
For post-filtering we use a task-oriented language model
Input Cin × H × W feature map implemented as a set of the target country LP templates.
Convolution # Cout /4 1x1 stride 1 Note that post-filtering is applied together with Beam
Convolution # Cout /4 3x1 strideh=1, padh=1 Search. The post-filtering procedure gets top-N most prob-
Convolution # Cout /4 1x3 stridew=1, padw=1 able sequences found by beam search and returns the first
Convolution # Cout 1x1 stride 1 one that matches the set of predefined templates which de-
Output Cout × H × W feature map pends on country LP regulations.
Table 2. Small Basic Block 3.2. Training details
The backbone network architecture is described in Ta- All training experiments were done with the help of Ten-
ble 3. The backbone takes a raw RGB image as input and sorFlow [21].
calculates spatially distributed rich features. Wide convolu- We train our model with ’Adam’ optimizer using batch
tion (with 1 × 13 kernel) utilizes the local character context size of 32, initial learning rate 0.001 and gradient noise
instead of using LSTM-based RNN. The backbone subnet- scale of 0.001. We drop the learning rate once after every
work output can be interpreted as a sequence of character 100k iterations by a factor of 10 and train our network for
probabilities whose length corresponds to the input image 250k iterations in total.
In our experiments we use data augmentation by random Table 4 shows recognition accuracies achieved by differ-
affine transformations, e.g. rotation, scaling and shift. ent models.
It is worth mentioning, that application of LocNet from
the beginning of training leads to degradation of results, be- Method Recognition Accuracy, % GFLOPs
cause LocNet cannot get reasonable gradients from a recog- LPRNet baseline 94.1 0.71
nizer which is typically too weak for the first few iterations. LPRNet basic 95.0 0.34
So, in our experiments, we turn LocNet on only after 5k LPRNet reduced 94.0 0.163
iterations.
Table 4. Results on Chinese License Plates.
All other hyper-parameters were chosen by cross-
validation over the target dataset.
4.2. Ablation study
4. Results of the Experiments It is of vital importance to conduct the ablation study to
The LPRNet baseline network, from which we started identify correlation between various enhancements and re-
our experiments with different architectures, was inspired spective accuracy/performance improvements. This helps
by [2]. It’s mainly based on Inception blocks followed by other researchers adopt ideas from the paper and reuse most
a bidirectional LSTM (biLSTM) decoder and trained with promising architecture approaches. Table 5 shows a sum-
CTC loss. We first performed some experiments aimed at mary of architecture approaches and their impact on accu-
replacing biLSTM with biGRU cells, but we did not observe racy.
any clear benefits of using biGRU over biLSTM.
Approach LPRNet
Then, we focused on eliminating of the complicated biL-
STM decoder, because most modern embedded devices still Global context 3 3 3 3 3
do not have sufficient compute and memory to efficiently Data augm. 3 3 3 3 3 3 3
execute biLSTM. Importantly, our LSTM is applied to a STN-alignment 3 3 3 3 3
spatial sequence rather than to a temporal one. Thus all Beam Search 3 3 3
LSTM inputs are known upfront both at the training stage Post-filtering 3 3
as well as at the inference stage. Therefore we believe that Accuracy, % 53.4 58.6 59.0 62.95 91.6 94.4 94.4 95.0
RNN can be replaced by spatial convolutions without a sig- Table 5. Effects of various tricks on LPRNet quality.
nificant drop in accuracy. The RNN-less model with some
backbone modifications is referenced as LPRNet basic and As one can see, the largest accuracy gain (36%) was
it was described in details in sec. 3. achieved using the global context. The data augmentation
To improve runtime performance we also modified LPR- techniques also help to improve accuracy significantly (by
Net basic by using 2 × 2 strides for all pooling layers. This 28.6%). Without using data augmentation and the global
modification (the LPRNet reduced model) reduces the size context we could not train the model from scratch.
of intermediate feature maps and total inference computa- The STN-based alignment subnetwork provides notice-
tional cost significantly (see GFLOPs column of the Table able improvement of 2.8-5.2%. Beam Search with post-
4). filtering further improves recognition accuracy by 0.4-
0.6%.
4.1. Chinese License Plates dataset
4.3. Performance analysis
We tested our approach on a private dataset containing
various images of Chinese license plates collected from The LPRNet reduced model was ported to various hard-
different security and surveillance cameras. This dataset ware platforms including CPU, GPU and FPGA. The results
was first run through the LPB-based detector to get bound- are presented in the Table 6.
ing boxes for each license plate. Then, all license plates
were manually labeled. In total, the dataset contains 11696 Target platform 1 LP processing time
cropped license plate images, which are split as 9:1 into GPU + cuDNN 3 ms
training and validation subsets respectively. CPU (using Caffe [22]) 11-15 ms
Automatically cropped license plate images were used CPU + FPGA (using DLA [23]) 4 ms1
for training to make the network more robust to detection ar- CPU (using IE from Intel OpenVINO [24]) 1.3 ms
tifacts because in some cases plates are cropped with some
background around edges, while in other cases they are Table 6. Performance analysis.
cropped too close to edges with no background at all or
event with some parts of the license plate missing. 1 The LPRNet reduced model was used
Here GPU is nVIDIA R GeForceTM 1080, CPU is Intel R [5] S. Hochreiter and J. Schmidhuber, “Long Short-Term Mem-
CoreTM i7-6700K SkyLake, FPGA is Intel R ArriaTM 10 and ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov.
IE is for Inference Engine from Intel R OpenVINO. 1997. 2
[6] T. K. Cheang, Y. S. Chong, and Y. H. Tay, “Segmentation-
5. Conclusions and Future Work free Vehicle License Plate Recognition using ConvNet-
RNN,” arXiv:1701.06439 [cs], Jan. 2017, arXiv:
In this work, we have shown that for License Plate 1701.06439. 2
Recognition one can utilize pretty small convolutional neu-
[7] V. Jain, Z. Sasindran, A. Rajagopal, S. Biswas, H. S. Bharad-
ral networks. LPRNet model was introduced, which can be
waj, and K. R. Ramakrishnan, “Deep Automatic License
used for challenging data, achieving up to 95% recognition Plate Recognition System,” in Proceedings of the Tenth In-
accuracy. Architecture details, its motivation and the abla- dian Conference on Computer Vision, Graphics and Image
tion study was conducted. Processing, ser. ICVGIP ’16. New York, NY, USA: ACM,
We showed that LPRNet can perform inference in real- 2016, pp. 6:1–6:8. 2
time on a variety of hardware architectures including CPU, [8] M. Jaderberg, K. Simonyan, A. Zisserman, and
GPU and FPGA. We have no doubt that LPRNet could at- K. Kavukcuoglu, “Spatial Transformer Networks,”
tain real-time performance even on more specialized em- arXiv:1506.02025 [cs], Jun. 2015, arXiv: 1506.02025.
bedded low-power devices. 2, 3
The LPRNet can likely be compressed using modern [9] H. Li, P. Wang, and C. Shen, “Towards End-to-End Car Li-
pruning and quantization techniques, which would poten- cense Plates Detection and Recognition with Deep Neural
tially help to reduce the computational complexity even fur- Networks,” ArXiv e-prints, Sep. 2017. 2
ther.
[10] X. Wang, Z. Man, M. You, and C. Shen, “Adversarial Gener-
As a future direction of research, LPRNet work can be ation of Training Examples: Applications to Moving Vehicle
extended by merging CNN-based detection part into our al- License Plate Recognition,” ArXiv e-prints, Jul. 2017. 2
gorithm, so that both detection and recognition tasks will
[11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
be evaluated as a single network in order to outperform the D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
LBP-based cascaded detector quality. “Generative Adversarial Networks,” ArXiv e-prints, Jun.
2014. 2
6. Acknowledgements
[12] W. Liu, A. Rabinovich, and A. C. Berg, “ParseNet: Look-
We would like to thank Intel R IOTG Computer Vision ing Wider to See Better,” arXiv:1506.04579 [cs], Jun. 2015,
(ICV) Optimization team for porting the model to the Intel arXiv: 1506.04579. 2, 3
Inference Engine of OpenVINO, as well as Intel R IOTG [13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Computer Vision (ICV) OVX FPGA team for porting the Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accu-
model to the DLA. We also would like to thank Intel R racy with 50x fewer parameters and <0.5mb model size,”
PSG DLA and Intel R Computer Vision SDK teams for their arXiv:1602.07360 [cs], Feb. 2016, arXiv: 1602.07360. 2, 3
tools and support. [14] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi,
“Inception-v4, Inception-ResNet and the Impact of Resid-
References ual Connections on Learning,” arXiv:1602.07261 [cs], Feb.
2016, arXiv: 1602.07261. 2, 3
[1] C. N. E. Anagnostopoulos, I. E. Anagnostopoulos, I. D.
Psoroulas, V. Loumos, and E. Kayafas, “License Plate [15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
Recognition From Still Images and Video Sequences: A Sur- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
vey,” IEEE Transactions on Intelligent Transportation Sys- “Going Deeper with Convolutions,” arXiv:1409.4842 [cs],
tems, vol. 9, no. 3, pp. 377–391, Sep. 2008. 1 Sep. 2014, arXiv: 1409.4842. 2, 3
[2] H. Li and C. Shen, “Reading Car License Plates Us- [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
ing Deep Convolutional Neural Networks and LSTMs,” jna, “Rethinking the Inception Architecture for Com-
arXiv:1601.05610 [cs], Jan. 2016, arXiv: 1601.05610. 2, puter Vision,” arXiv:1512.00567 [cs], Dec. 2015, arXiv:
4 1512.00567. 2, 3
[3] A. Graves, Supervised Sequence Labelling with Recurrent [17] S. Ioffe and C. Szegedy, “Batch Normalization: Accel-
Neural Networks, 2012th ed. Heidelberg ; New York: erating Deep Network Training by Reducing Internal Co-
Springer, Feb. 2012. 2, 3 variate Shift,” arXiv:1502.03167 [cs], Feb. 2015, arXiv:
[4] A. Graves, S. Fernndez, F. Gomez, and J. Schmidhu- 1502.03167. 2, 3
ber, “Connectionist temporal classification: labelling unseg- [18] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
mented sequence data with recurrent neural networks,” in R. Salakhutdinov, “Dropout: A Simple Way to Prevent
Proceedings of the 23rd international conference on Ma- Neural Networks from Overfitting,” J. Mach. Learn. Res.,
chine learning. ACM, 2006, pp. 369–376. 2, 3 vol. 15, no. 1, pp. 1929–1958, Jan. 2014. 2
[19] S. Agarwal, D. Tran, L. Torresani, and H. Farid, “Decipher-
ing Severely Degraded License Plates,” San Francisco, CA,
2017. 2
[20] A. Hannun, “Sequence modeling with ctc,” Distill, 2017,
https://distill.pub/2017/ctc. 3
[21] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War-
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Ten-
sorFlow: Large-Scale Machine Learning on Heterogeneous
Distributed Systems,” arXiv:1603.04467 [cs], Mar. 2016,
arXiv: 1603.04467. 3
[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Con-
volutional architecture for fast feature embedding,” arXiv
preprint arXiv:1408.5093, 2014. 4
[23] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and
G. R. Chiu, “An OpenCL(TM) Deep Learning Accelera-
tor on Arria 10,” arXiv:1701.03534 [cs], Jan. 2017, arXiv:
1701.03534. 4
[24] “Intel OpenVINO Toolkit | Intel Software.” [On-
line]. Available: https://software.intel.com/en-us/articles/
OpenVINO-InferEngine 4

You might also like