You are on page 1of 4

2019 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems

(DDECS)

A Sketch Classifier Technique with Deep Learning



Models Realized in an Embedded System
Tsung-Han Tsai Po-Ting Chi Kuo-Hsing Cheng
Dept. of electrical engineering Dept. of electrical engineering Dept. of electrical engineering
National Central University National Central University National Central University
Taoyuan, Taiwan Taoyuan, Taiwan Taoyuan, Taiwan
han@ee.ncu.edu.tw 106521049@cc.ncu.edu.tw cheng@ee.ncu.edu.tw

Abstract—Since 2011, due to the growth in the amount of object detection and face recognition and it can be used as
information, the innovation of learning algorithms and the feature extractors and classifiers for images and audios at
improvement of computer technology make the application of the same time, but the disadvantage is that when the size of
artificial intelligence feasible in a wide range of fields. This deep neural network models increases with the complexity
paper presents a sketch classifier technique with deep of the output and input, the parameter will be significantly
learning models. We use the depth-wise convolution layer to increased. To meet the trend of so-called edge computation
lighten the deep neural network. The result shows the on deep learning, an realization on embedded system plays
improvement in approximately 1/5 of computation. We use an important role. How to simplify the network parameters
Google Quick Draw dataset to train and evaluate the network,
and the amount of computation will be the main focus of
which can have 98% accuracy in 10 categories and 85%
accuracy in 100 categories. Finally, we realize it on
deep neural network implementing on the embedded
STM32F469I Discovery development board for platforms.
demonstration. The system can achieve real-time In addition to common toys, parents expect more toys to
implementation of sketch classification. stimulate children's movement development and logical
thinking. At present, there are many electronic products for
Keywords—Deep Learning, Neural Network, Embedded
preschool education, such as digital drawing panel, word
System, Sketch Classification
cards, puzzle game, and other digital learning materials. The
I. INTRODUCTION use of digital products has become one of the most common
activities of children in this generation. Preschool children
Machine learning is an algorithm that allows computers use digital products earlier and earlier, most parents reward
to make predictions or classify in mathematical models. their children by letting them use their phones or other
This method requires a large amount of raw data and the digital product when they perform well.
human labeled ground truth labels to train the data to adjust
and select corresponding mathematical models. At the same The goal is to teach children the shape of various objects
time, the model’s performance is evaluated by validation in the game, promote children's creativity and learning
data to determine whether the model is suitable for the task. ability. Since deep learning technique has been involved on
image classification, people can combine the preschool
Deep learning algorithms have become extremely learning material with high-edge technology, using DNN to
popular since 2012 when the Alex Net model [1] train a model to recognize the sketch the user draws on the
significantly improved the classification accuracy in the digital drawing panel. Models trained by a huge amount of
ILSVRC 2012 competition. Different models and training data can have a significant increase in accuracy on a large
methods are proposed every year, For example, in the number of categories. The production can use different
classification task of ImageNet data, the recognition rate of gaming rules such as classification which categories is
deep neural network has exceeded the human average, 5.1% closest to the sketch drawn, or draw a sketch in a specific
of error rate, reaching a new milestone. category.
Compared with traditional algorithms, the deep neural In this work, we implement the sketch recognition
network model has higher robustness and can distinguish system on the STM32F469I discovery development board.
positive and negative samples more accurately for various The block diagram is shown in Fig.1. The touch control
complex scenes. It can solve various problems such as device is more intuitive than the traditional button and rod

Figure 1. The hardware block of the touch panel sketching system

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


978-1-7281-0073-9/19/$31.00 ©2019 IEEE
!

interface, providing a direct, convenient and accurate way development board, depending on instruction and time of
!
of application’s interaction. Touch devices are more memory access.
attractive than interfaces that require prior familiarity with
input devices. In this work, the touchpad can be used In order to reduce the number of parameters and
directly to draw with fingers or a stylus and using a deep computation, we optimize the network concerning the
neural network to achieve real-time and high-precision following factors:
identification system, easy for children to use. 1. Reduce the number of the filter unit,
II. RELATED WORKS 2. Increase the number of pooling layers
Handwriting recognition is a very important benchmark 3. Reduce the number of input frames
in machine learning. For many model validation, MNIST
[2], the handwritten digit recognition dataset, is used. In 4. Reduce the size of the convolution kernel
earlier research, Principal components analysis (PCA) and 5. Balance the parameters of each layer and the amount
Radial basis function (RBF) feature extraction methods are of calculation
used to match various linear classifiers such as single-layer
perceptron. K-nearest-neighbors [3], SVM [4] and so on for As a result, the model structure used in this work is the
handwritten digits recognition, where V-SVM [5] method separate convolution layer used in MobileNet [11] model. It
has been able to achieve more than 99 % of the classification could reduce each layer’s computation cost to (1). Where N
accuracy. is number of output layer’s filter, 𝐷𝑘 is the kernel size of
orthe iginal convolution layer.
Although the MNIST dataset has 60,000 training images,
1 1
each digit has 6,000. Thus the handwritten digits recognition + (1)
𝑁 𝐷𝑘2
dataset is not complex enough to evaluate the modern
algorithm. It means the performance of neural network The reduction model is shown in Table 2. The parameter
models cannot be well measured. Thus, we use Quick Draw and computation cost in 10 categories classification can be
dataset [6], provided by Google. It contains more than 50 1
reduced to of the basic model.
million sketch pictures, including 345 categories with 28 x 5
28 image size. The experiment environment is employed on an Nvidia
With enough data, we can evaluate the model reduction GeForce GTX 1080ti graphic card using TensorFlow
method in more accurate ways. In order to implement the framework. We choose 10 categories, apple, book, car, dog,
network on the embedded system, we do not concern the elephant, fish, guitar, house, knife, and lightbulb, to validate
architecture that increased the computation amount, such as the model. Verify if the model could handle the sketch
Res-net [7] or LSTM [8]. To enhance the efficiency on low- shared similar features. The gradient descent optimizer is
computation, we do not use the execution in low bits on fix- Stochastic Gradient Descent, with learning rate is set to 0.05,
point representation, which was used such as in Binary net[9] and batch size is 600, randomly split each of category in the
or XOR net[10]. dataset in train data and test data, in 20:1 ratio. We use
SoftMax Cross-Entropy as loss function defined as (2):
III. EXPERIMENTAL cross entropy loss = ∑𝑖 𝑦𝑖𝑔𝑡 ∗ log 𝑦𝑖 (2)
The neural network model used in this work is a fully
convolutional network, as shown in Table 1. To reduce the Table 2. The reduction model structures
number of parameters and computation, this work does not Layer Size-in Size-out kernel param FLOPS
use batch normalization layer and full connection layer, nor Conv1 28x28x1 28x28x16 3x3x1,16 144 113K
does it scale images to different resolutions. Conv2_depth 28x28x16 28x28x16 3x3x16,1 144 113K
Conv2_point 28x28x16 28x28x16 1x1x16,16 256 200K
The original network with traditional convolution
Pool1 28x28x16 14x14x16 2x2 - -
layer’s parameter is 104976 single-precision floating-point
Conv3_depth 14x14x16 14x14x16 3x3x16,1 144 28K
numbers. It means the computation is approximately 10M
Conv3_point 14x14x16 14x14x32 1x1x16,32 512 100K
multiplications and additions. Each multi-add and other
Pool2 14x14x32 7x7x32 2x2 - -
CPU processing requires 5 to 15 cycles in STM
Conv4_depth 7x7x32 7x7x32 3x3x1,32 288 14K
Table 1. The base model structures Conv4_point 7x7x32 7x7x64 1x1x32,64 2048 100K
Layer Size-in Size-out kernel param FLOPS Conv5_depth 7x7x64 5x5x64 3x3x1,64 576 28K
Conv1 28x28x1 28x28x16 3x3x1,16 144 113K Conv5_point 5x5x64 5x5x64 1x1x64,64 4096 200K
Conv2 28x28x16 28x28x16 3x3x16,16 2304 1.8M Conv6_depth 5x5x64 3x3x64 3x3x64,64 576 14K
Pool1 28x28x16 14x14x16 2x2 - - Conv6_point 3x3x64 3x3x64 1x1x64,64 4096 100K
Conv3 14x14x16 14x14x32 3x3x16,32 4608 900K Conv7 3x3x64 1x1x10 3x3x64,10 5760 51K
Pool2 14x14x32 7x7x32 2x2 - - (10 classes)
Total 18640 1M
Conv4 7x7x32 7x7x64 3x3x32,64 18432 900K
Conv7 3x3x64 1x1x100 3x3x64,100 57600 510K
Conv5 7x7x64 5x5x64 3x3x64,64 36864 1.8M
(100 classes)
Conv6 5x5x64 3x3x64 3x3x64,64 36864 900K
Total 70480 1.5M
Conv7 3x3x64 1x1x10 3x3x64,10 5760 51K
Total 104976 6.5M
!

Table 3. The confusion matrix of 10 categories


!
categories apple book car dog elephant fish guitar house knife lightblub
Apple 0.978 0 0.008 0.001 0 0 0.002 0 0 0.002
Book 0.002 0.976 0.006 0 0.006 0 0.002 0.004 0.004 0
Car 0.004 0 0.984 0.004 0.005 0.002 0 0 0.002 0.002
Dog 0 0 0.012 0.906 0.064 0.006 0.004 0.002 0.006 0
Elephant 0.004 0.002 0.01 0.088 0.884 0.002 0.002 0.002 0.004 0.002
Fish 0.006 0.002 0.002 0.018 0.002 0.962 0.002 0 0.006 0.
Guitar 0.002 0 0.002 0.006 0.006 0.004 0.95 0 0.018 0.012
House 0.002 0.004 0 0.002 0 0 0 0.984 0.008 0
knife 0.002 0.006 0.006 0.008 0.002 0.004 0.006 0.008 0.954 0.004
lightblub 0.006 0.002 0.002 0.002 0.002 0.002 0.004 0.002 0.02 0.958

where, 𝑦𝑖𝑔𝑡 is the ground truth label of the data and 𝑦𝑖 is the other is the depth-wise convolutions. The pseudo code of a
prediction result of the model, i is the data count in a batch. convolutional layer can be written as that in Fig 2. And the
We Choose ReLU as activation function to all layers except depth-wise convolution layer is written in Fig 3. As well as
the last layer, because they have better performance on the maximum pooling layer and the activation function layer,
classification problems than tanh or sigmoid. the point-wise layer can be implemented by traditional
convolution layer by setting the kernel size of 1 x 1.
The confusion matrix of 10 categories classification is
shown in Table 3. The accuracy is 98% in 10 categories IV. CONCLUSION
classification, 82% in 100 categories classification.
In this paper, we presented a design for sketch classifier
When implementing the convolution framework on an technique with deep learning models. The main goal is to
embedded system, we need to fit the interface and achieve the effect of edutainment and help children to learn.
requirement of the development board. The images in the We use the deep learning model and simplify the parameters
dataset are 28 x 28 pixels, while the touch panel is 800 x 480 to achieve the high-performance result. Additionally, we
pixels, Therefore, the designed drawing area is 280 x 280 make the full system design on the embedded system and
pixels, and use bilinear interpolation method to scale to 28 present a real-time demonstration. We provide 10 categories
x 28 pixels. for classification. The recognition rate is above 98% on
average.
Two different convolutions behaviors are written in c,
one is the traditional two-dimensional convolutions and the Although limited by the precision of the touch panel on
the development board, the sketch is hard to draw, but it can
be solved when it implements on a product with larger draw
panel and drawing with the stylus.
For (k=0; k<kernel_width; k++)
For (j=0; j<kernel_height; j++) In future work, we will try different network structure
For (i=0; i<kernel_filter; i++) that can resolve the full 345 categories classification, by
For (x=0; x<image_width; x++)
adding the order of drawing stroke, the precision of
classifier can be improved.
For (y=0; y<image_height; y++)
For (c=0; c<image_channel; c++)
output[y][x][i]+=input[y+j][x+k][c]*
kernel[i][j][k][c]

Fig 2 pseudo code of a convolution layer

For (k=0; k<kernel_width; k++)


For (j=0; j<kernel_height; j++)
For (x=0; x<image_width; x++)
For (y=0; y<image_height; y++)
For (c=0; c<image_channel; c++)
output[y][x][c]+=input[y+j][x+k][c]*
kernel[j][k][c]

Fig 3 pseudo code of a depth-wise convolution layer


!

!
REFERENCES
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. In NIPS,
2012.
[2] MNIST http://yann.lecun.com/exdb/mnist/ [Online; accessed 19-
January-2019]
[3] Altman, N. S. (1992). "An introduction to kernel and nearest-
neighbor nonparametric regression". The American Statistician. 46
(3): 175–185.
[4] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
Machine learning 20.3 (1995): 273-297.
[5] Schölkopf, B., Smola, A. J., Williamson, R. C., et al., “New Support
Vector Algorithms”, Neural Computation, vol. 5, pp.1207-1245,
2000.
[6] Quick,Draw! The Data. https://quickdraw.withgoogle.com/data
[Online; accessed 19-January-2019]
[7] He, Kaiming, et al. ”Deep residual learning for image recognition.”
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2016.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997
[9] Courbariaux, Matthieu and Bengio, Yoshua. Binarynet: Training
deep neural networks with weights and activations constrained to+
1 or-1. arXiv preprint arXiv:1602.02830, 2016.
[10] Rastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and
Farhadi, Ali. Xnor-net: Imagenet classification using binary
convolutional neural networks. arXiv preprint arXiv:1603.05279,
2016
[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.
Weyand, M. Andreetto, and H. Adam, ”Mobilenets: Efficient
convolutional neural networks for mobile vision applications,”
arXiv:1704.04861 [cs], Apr 2017.

Fig4. The demonstration on the development board

You might also like