Professional Documents
Culture Documents
Abstract—Recent deep learning based approaches have Most of them can be well recognized by most people, however,
achieved great success on handwriting recognition. Chinese nowadays, it is becoming more and more difficult for people to
characters are among the most widely adopted writing systems in write them correctly, due to the overuse of keyboard or touch-
the world. Previous research has mainly focused on recognizing
screen based input methods. Compared with reading, writing
arXiv:1606.06539v1 [cs.CV] 21 Jun 2016
1312
2 1 3 2
1 11 16 4 6
1
14
3 4 5
8 9 9 10
34 10 10 19 20
14 2
7 22 11 12
17
16 18 76 9 15 21 27 18
6 13
18 26 5 8 17
13 23 25
24 1514 16
5 15 19
20 12 8
11 17 7
Fig. 1. Illustration of three online handwritten Chinese characters. Each color represents a stroke and the numbers indicate the writing order. The purpose of
this paper is to automatically recognize and draw (generate) real and cursive Chinese characters under a single framework based on recurrent neural networks.
proposed such as NADE [20], variational auto-encoder [21], Section IV reports the experimental results on the ICDAR-
DRAW [22], and so on. To better model the generating 2013 competition database. Section V details the generative
process, the generative adversarial network (GAN) [23] si- RNN model for drawing recognizable Chinese characters.
multaneously train a generator to capture the data distribution Section VI shows the examples and analyses of the generated
and a discriminator to distinguish real and generated samples characters. At last, Section VII draws the concluding remarks.
in a min-max optimization framework. Under this framework,
high-quality images can be generated with the LAPGAN [24]
II. R EPRESENTATION FOR O NLINE H ANDWRITTEN
and DCGAN [25] models, which are extensions of the original
C HINESE C HARACTER
GAN.
Recently, it was shown by [26] that realistically-looking Different from the static image based representation for
Chinese characters can be generated with DCGAN. However, offline handwritten characters, rich dynamic (spatial and tem-
the generated characters are offline images which ignore poral) information can be collected in the writing process for
the handwriting dynamics (temporal order and trajectory). online handwritten characters, which can be represented as a
To automatically generate the online (dynamic) handwriting variable length sequence:
trajectory, the recurrent neural network (RNN) with LSTM
was shown to be very effective for English online handwriting [[x1 , y1 , s1 ], [x2 , y2 , s2 ], . . . , [xn , yn , sn ]], (1)
generation [4]. The contribution of this paper is to study
where xi and yi are the xy-coordinates of the pen movements
how to extend and adapt this technique for Chinese character
and si indicates which stroke the point i belongs to. As shown
generation, considering the difference between English and
in Fig. 1, Chinese characters usually contain multiple strokes
Chinese handwriting habits and the large number of categories
and each stroke is produced by numerous points. Besides the
for Chinese characters. As shown by [27], fake and regular-
character shape information, the writing order is also preserved
written Chinese characters can be generated under the LSTM-
in the online sequential data, which is valuable and very hard
RNN framework. However, a more interesting and challenging
to recover from the static image. Therefore, to capture the
problem is the generating of real (readable) and cursive
dynamic information for increasing recognition accuracy and
handwritten Chinese characters.
also to improve the naturalness of the generated characters,
To reach this goal, we propose a conditional RNN-based
we directly make use of the raw sequential data rather than
generative model (equipped with GRUs or LSTMs) to au-
transforming them into an image-like representation.
tomatically draw human-readable cursive Chinese characters.
The character embedding is jointly trained with the generative
model. Therefore, given a character class, different samples A. Removing Redundant Points
(belonging to the given class but with different writing styles)
Different people may have different handwriting habits (e.g.,
can be automatically generated by the RNN model conditioned
regular, fluent, cursive, and so on), resulting in significantly
on the embedding. In this paper, the tasks of automatically
different number of sampling points, even when they are
drawing and recognizing Chinese characters are completed
writing the same character. To remove the redundant points, we
both with RNNs, seen as either generative or discriminative
propose a simple strategy to preprocess the sequence. Consider
models. Therefore, to verify the quality of the generated char-
a particular point (xi , yi , si ). Let’s assume si = si−1 = si+1 ,
acters, we can feed them into the pre-trained discriminative
otherwise, it will be the starting or ending point of a stroke
RNN model to see whether they can be correctly classified or
which will always be preserved. Whether to remove point i
not. It is found that most of the generated characters can be
or not depends on two conditions. The first condition is based
automatically recognized with high accuracy. This verifies the
on the distance of this point away from its former point:
effectiveness of the proposed method in generating real and
p
cursive Chinese characters. (xi − xi−1 )2 + (yi − yi−1 )2 < Tdist . (2)
The rest of this paper is organized as follows. Section II
introduces the representation of online handwritten Chinese As shown in Fig. 2(a), if point i is too close to point i − 1, it
characters. Section III describes the discriminative RNN model should be removed. Moreover, point i should also be removed
for end-to-end recognition of handwritten Chinese characters. if it is on a straight line connecting points i − 1 and i + 1. Let
3
݅ ܮ
1
ܶௗ௦௧ ܶ௦ ߤ௬
-7300
-7350
-1
݅െͳ ݅ͳ ߜ௫ -2
ߤ௫ -7450
3500 3520 3540 3560 3580 3600 3620 3640 3660 3680
-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Fig. 2. (a) Removing of redundant points. (b) Coordinate normalization. (c) Character before preprocessing. (d) Character after preprocessing.
4xi = xi+1 − xi and 4yi = yi+1 − yi , the second condition This normalization is applied globally for all the points in the
is based on the cosine similarity: character. Note that we do not estimate the standard deviation
4xi−1 4xi + 4yi−1 4yi on the y-axis and the y-coordinate is also normalized by δx .
2 )0.5 (4x2 + 4y 2 )0.5 > Tcos . (3) The reason for doing so is to keep the original ratio of height
(4x2i−1 + 4yi−1 i i
and width for the character, and also keep the writing direction
If one of the conditions in Eq. (2) or Eq. (3) is satisfied, the for each stroke. After coordinate normalization, each character
point i should be removed. With this preprocessing, the shape is placed in a standard xy-coordinate system, while the shape
information of the character is still well preserved, but each of the character is kept unchanged.
item (point) in the new sequence becomes more informative,
since the redundant points have been removed. C. Illustration
The characters before and after preprocessing are illustrated
B. Coordinate Normalization in Fig. 2 (c) and (d) respectively. It is shown that the character
Another influence is coming from the variations in size or shape is well preserved, and many redundant points have been
absolute values of the coordinates for the characters captured removed. The original character contains 96 points while the
with different devices or written by different people. There- processed character only has 44 points. This will make each
fore, we must normalize the xy-coordinates into a standard point more informative, which will benefit RNN modeling not
interval. Specifically, as shown in Fig. 2(b), consider a straight just because of speed but also because the issue of long-
line L connecting two points (x1 , y1 ) and (x2 , y2 ), the pro- term dependencies [28] is thus reduced because sequences
jections of this line onto x-axis and y-axis are are shorter. Moreover, as shown in Fig. 2(d), the coordinates
Z of the new character is normalized. In the new coordinate
1 system, the position of (0, 0) is located in the central part
px (L) = xdL = len(L)(x1 + x2 ),
ZL
2
(4) of the character, and the deviations on the xy-axis are also
1 normalized. Since in this paper we use a sequence-based rep-
py (L) = ydL = len(L)(y1 + y2 ),
L 2 resentation, the preprocessing used here is different from the
p traditional methods designed for image-based representation,
where len(L) = (x2 − x1 )2 + (y2 − y1 )2 denotes the length
such as the equidistance sampling [3] and character shape
of L. With these information, we can estimate the mean values
normalization [29].
by projecting all lines onto x-axis and y-axis:
P P III. D ISCRIMINATIVE M ODEL : E ND - TO -E ND
L∈Ω px (L) py (L)
µx = P , µy = P L∈Ω , (5) R ECOGNITION WITH R ECURRENT N EURAL N ETWORK
L∈Ω len(L) L∈Ω len(L)
The best established approaches for recognizing Chinese
where Ω represents the set of all straight lines that connect characters are based on transforming the sequence of Eq. (1)
two successive points within the same stroke. After this, we into some image-like representation [12], [13], [14] and then
estimate the deviation (from mean) of the projections: applying the convolutional neural network (CNN). For the
Z
1 purpose of fully end-to-end recognition, we apply a recurrent
dx (L) = (x − µx )2 dL = len(L) (x2 − µx )2 +
L 3 (6) neural network (RNN) directly on the raw sequential data.
(x1 − µx )2 + (x1 − µx )(x2 − µx ) . Due to better utilization of temporal and spatial information,
our RNN approach can achieve higher accuracy than previous
The standard deviation on x-axis can then be estimated as: CNN and image-based models.
sP
dx (L)
δx = P L∈Ω . (7) A. Representation for Recognition
L∈Ω len(L)
From the sequence of Eq. (1) (after preprocessing), we
With all the information of µx , µy and δx estimated from one extract a six-dimensional representation for each straight line
character, we can now normalize the coordinates by: Li connecting two points i and i + 1:
xnew = (x − µx )/δx , ynew = (y − µy )/δx . (8) Li = [xi , yi , 4xi , 4yi , I(si = si+1 ), I(si 6= si+1 )], (9)
4
where 4xi = xi+1 − xi , 4yi = yi+1 − yi and I(·) = 1 char 1 char 2 …… char 3755
when the condition is true and otherwise zero. In each Li ,
the first two terms are the start position of the line, and logistic regression
the 3-4th terms are the direction of the pen movements, full layer and dropout
while the last two terms indicate the status of the pen, i.e.,
[1, 0] means pen-down while [0, 1] means pen-up. With this
representation, the character in Eq. (1) is transformed to a mean pooling and dropout
new sequence of [L1 , L2 , . . . , Ln−1 ]. To simplify the notations ᇱ
used in following subsections, we will use [x1 , x2 , . . . , xk ] to ݄ᇱ ݄ିଵ ݄ଵᇱ
denote a general sequence, but note that each item xi here is LSTM/GRU LSTM/GRU …… LSTM/GRU
actually the six-dimensional vector shown in Eq. (9). ……
LSTM/GRU LSTM/GRU LSTM/GRU
݄ଵ ݄ଶ ݄
……
……
……
B. Recurrent Neural Network (RNN)
The RNN is a natural generalization of the feedforward LSTM/GRU LSTM/GRU …… LSTM/GRU
neural networks to sequences [30]. Given a general input
sequence [x1 , x2 , . . . , xk ] where xi ∈ Rd (different samples LSTM/GRU LSTM/GRU …… LSTM/GRU
may have different sequence length k), at each time-step of
RNN modeling, a hidden state is produced, resulting in a
hidden sequence of [h1 , h2 , . . . , hk ]. The activation of the ݔଵ ݔଶ ݔ
hidden state at time-step t is computed as a function f of
the current input xt and previous hidden state ht−1 as: Fig. 3. The stacked bidirectional RNN for end-to-end recognition.
ሾݔଵ ǡ ݔଶ ǡ ݔଷ ǡ ݔସ ǡ ݔହ ǡ ݔሿ ሾݔଵ ǡ ݔଷ ǡ ݔହ ǡ ݔሿ is only shown once. In the testing process, two strategies
can be used. First, we can directly feed the full-sequence
into the RNN for classification. Second, we can also apply
sequential dropout to obtain multiple sub-sequences, and then
make an ensemble-based decision by fusing the classification
results from these sub-sequences. The comparison of these two
approaches will be discussed in the experimental section.
TABLE I
C OMPARISON OF DIFFERENT NETWORK ARCHITECTURES FOR ONLINE HANDWRITTEN C HINESE CHARACTER RECOGNITION .
Name Architecture Recurrent Type Memory Train Time Test Speed Train Acc. Test Acc.
NET1 6 → [500] → 200 → 3755 LSTM 11.00MB 95.16h 0.3792ms 98.07% 97.67%
NET2 6 → [500] → 200 → 3755 GRU 9.06MB 75.92h 0.2949ms 97.81% 97.71%
NET3 6 → [100, 500] → 200 → 3755 LSTM 12.76MB 125.43h 0.5063ms 98.21% 97.70%
NET4 6 → [100, 500] → 200 → 3755 GRU 10.38MB 99.77h 0.3774ms 97.87% 97.76%
NET5 6 → [100, 300, 500] → 200 → 3755 LSTM 19.48MB 216.75h 0.7974ms 98.67% 97.80%
NET6 6 → [100, 300, 500] → 200 → 3755 GRU 15.43MB 168.80h 0.6137ms 97.97% 97.77%
TABLE II
and Tcos = 0.99, where H is the height and W is the width of T EST ACCURACIES (%) OF ENSEMBLE - BASED DECISIONS FROM
the character. After preprocessing, the average length of the SUB - SEQUENCES GENERATED BY RANDOM DROPOUT.
sequences for each character is about 50. As shown in Fig. 3,
to increase the generalization performance, the dropout is used Ensemble of Sub-Sequences
for the mean pooling layer and full layer with probability 0.1. Name Full 1 5 10 15 20 30
Moreover, dropout (with probability 0.3) is also used on input
sequence for data augmentation as described in Section III-F. NET1 97.67 96.53 97.74 97.82 97.84 97.84 97.86
The initialization of the network is described in Section III-G. NET2 97.71 96.56 97.77 97.84 97.86 97.85 97.89
The optimization algorithm is the Adam [40] with mini-batch
NET3 97.70 96.56 97.71 97.82 97.84 97.85 97.86
size 1000. The initial learning rate is set to be 0.001 and
then decreased by ×0.3 when the cost or accuracy on the NET4 97.76 96.54 97.78 97.86 97.87 97.88 97.89
training data stop improving. After each epoch, we shuffle the NET5 97.80 96.79 97.82 97.91 97.93 97.94 97.96
training data to make different mini-batches. All the models
were implemented under the Theano [44], [45] platform using NET6 97.77 96.64 97.79 97.87 97.88 97.89 97.91
the NVIDIA Titan-X 12G GPU.
TABLE III
R ESULTS ON ICDAR-2013 COMPETITION DATABASE OF ONLINE works on the ICDAR-2013 competition database [19]. It is
HANDWRITTEN C HINESE CHARACTER RECOGNITION . shown that the deep learning based approaches outperform the
traditional methods with large margins. In the ICDAR-2011
Methods: Ref. Memory Accuracy competition, the winner is the Vision Objects Ltd. from France
Human Performance [19] n/a 95.19% using a multilayer perceptron (MLP) classifier. Moreover, in
ICDAR-2013, the winner is from University of Warwick, UK,
Traditional Benchmark [46] 120.0MB 95.31% using the path signature feature map and a deep convolutional
ICDAR-2011 Winner: VO-3 [42] 41.62MB 95.77% neural network [13]. Recently, the state-of-the-art performance
has been achieved by [14] with domain-specific knowledge
ICDAR-2013 Winner: UWarwick [19] 37.80MB 97.39%
and the ensemble of nine convolutional neural networks.
ICDAR-2013 Runner-up: VO-3 [19] 87.60MB 96.87% As revealed in Table III and Table I, all of our models (from
DropSample-DCNN [14] 15.00MB 97.23%
NET1 to NET6) can easily outperform previous benchmarks.
Taking NET4 as an example, compared with other approaches,
DropSample-DCNN-Ensemble-9 [14] 135.0MB 97.51% it is better from the aspects of both memory consumption
RNN: NET4 ours 10.38MB 97.76% and classification accuracy. The previous best performance
has usually been achieved with convolutional neural networks
RNN: NET4-subseq30 ours 10.38MB 97.89% (CNN), which has a particular requirement of transforming the
RNN: Ensemble-NET123456 ours 78.11MB 98.15% online sequential handwriting data into some image-like repre-
sentations [12], [13], [14]. On the contrary, our discriminative
RNN model directly deals with the raw sequential data, and
are not significant and also vanishing when more layers therefore has the potential to exploit additional information
being stacked. This is because the recurrent units maintain which is discarded in the spatial representations. Moreover,
activations for each time-step which already make the RNN our method is also fully end-to-end, depending only on generic
model to be extremely deep, and therefore, stacking more priors about sequential data processing, and not requiring any
layers will not bring too much additional discriminative ability other domain-specific knowledge. These results suggest that:
to the model. Moreover, as shown in Table I, with more compared with CNNs, RNNs should be the first choice for
stacked recurrent layers, both the training and testing time are online handwriting recognition, due to their powerful ability
increased dramatically. Therefore, we did not consider more in sequence processing and the natural sequential property of
than three stacked recurrent layers. From the perspectives of online handwriting.
both accuracy and efficiency, in real applications, NET4 is As shown in Table III, the ensemble of 30 sub-sequences
preferred among the six different network architectures. with NET4 (NET4-subseq30) running on different random
draws of sequential dropout (as discussed in Section III-F)
F. Ensemble-based Decision from Sub-sequences can further improve the performance of NET4. An advan-
tage for this ensemble is that only one trained model is
In our experiments, the dropout [35] strategy is applied not
required which will save the time and memory resources,
only as a regularization (Fig. 3) but also as a data augmentation
in comparison with usual ensembles where multiple models
method (Section III-F). As shown in Fig. 4, in the testing
should be trained. However, a drawback is that the number
process, we can still apply dropout on the input sequence
of randomly sampled sub-sequences should be large enough
to generate multiple sub-sequences, and then make ensemble-
to guarantee the ensemble performance, which will be time-
based decisions to further improve the accuracy. Specifically,
consuming for evaluation. Another commonly used type of
the probabilities (outputs from one network) of each sub-
ensemble is obtained by model averaging from separately
sequence are averaged to make the final prediction.
trained models. The classification performance by combining
Tabel II reports the results for ensemble-based decisions
the six pre-trained networks (NET1 to NET6) is shown in the
from sub-sequences. It is shown that with only one sub-
last row of Table III. Due to the differences in network depths
sequence, the accuracy is inferior compared with the full-
and recurrent types, the six networks are complementary to
sequence. This is easy to understand, since there exists in-
each other. Finally, with this kind of ensemble, this paper
formation loss in each sub-sequence. However, with more
reached the accuracy of 98.15%, which is a new state-of-the-
and more randomly sampled sub-sequences being involved
art and significantly outperforms all previously reported results
in the ensemble, the classification accuracies are gradually
for online handwritten Chinese character recognition.
improved. Finally, with the ensemble of 30 sub-sequences,
the accuracies for different networks become consistently
higher than the full-sequence based prediction. These results V. G ENERATIVE M ODEL : AUTOMATIC D RAWING
verified the effectiveness of using dropout for ensemble-based R ECOGNIZABLE C HINESE C HARACTERS
sequence classification.
Given an input sequence x and the corresponding character
class y, the purpose of a discriminative model (as described in
G. Comparison with Other State-of-the-art Approaches Section III) is to learn p(y|x). On the other hand, the purpose
To compare our method with other approaches, Table III of a generative model is to learn p(x) or p(x|y) (conditional
lists the state-of-the-art performance achieved by previous generative model). In other words, by modeling the distribu-
8
[0,0,1]?
loss loss ݀௧ାଵ ݏ௧ାଵ
End
random sampling max
݀௧ାଵ ݏ௧ାଵ ݀௧ାଶ ݏ௧ାଶ max
char char
݀௧ ǡ ݏ௧ ݀௧ାଵ ǡ ݏ௧ାଵ
(a) (b)
Fig. 5. For the time-step t in the generative RNN model: (a) illustration of the training process, and (b) illustration of the drawing/generating process.
tion of the sequences, the generative model can be used to In the following descriptions, we use c ∈ Rd to represent the
draw (generate) new handwritten characters automatically. embedding vector for a general character class.
Our previous experiments show that, GRUs and LSTMs
A. Representation for Generation have comparable performance, but the computation of GRU is
more efficient. Therefore, we build our generative RNN model
Compared with the representation used in Section III-A, the
based on GRUs [17] rather than LSTMs. As shown in Fig. 5,
representation for generating characters is slightly different.
at time-step t, the inputs for a GRU include:
Motivated by [4], each character can be represented as:
• previous hidden state ht−1 ∈ R ,
D
• current pen-state st ∈ R ,
3
where di = [4xi , 4yi ] ∈ R2 is the pen moving direction
• character embedding c ∈ R .
d
which can be viewed as a straight line. We can draw this
character with multiple lines by concatenating [d1 , d2 , . . . , dk ], Following the GRU gating strategy, the updating of the hidden
i.e., the ending position of previous line is the starting position state and the computation of output for time-step t are:
of current line. Since one character usually contains multiple
d0t = tanh (Wd dt + bd ) , (25)
strokes, each line di may be either pen-down (should be
drawn on the paper) or pen-up (should be ignored). Therefore, s0t = tanh (Ws st + bs ) , (26)
another variable si ∈ R3 is used to represent the status of pen. rt = sigm (Wr ht−1 + Ur d0t + Vr s0t + Mr c + br ) , (27)
As suggested by [27], three states should be considered:1 zt = sigm (Wz ht−1 + Uz d0t + Vz s0t + Mz c + bz ) , (28)
[1, 0, 0], pen-down, e
ht = tanh (W (rt ht−1 ) + U d0t + V s0t + M c + b) , (29)
si = [0, 1, 0], pen-up, (24)
ht = zt ht−1 + (1 − zt ) e
ht , (30)
[0, 0, 1], end-of-char.
ot = tanh (Wo ht + Uo d0t + Vo s0t + Mo c + bo ) , (31)
With the end-of-char value of si , the RNN can automatically
decide when to finish the generating process. Using the rep- where W∗ , U∗ , V∗ , M∗ are weight matrices and b∗ is the weight
resentation in Eq. (23), the character can be drawn in vector vector for GRU. Since both the pen-direction dt and pen-state
format, which is more plausible and natural than a bitmap st are low-dimensional, we first transform them into higher-
image. dimensional spaces by Eqs. (25) and (26). After that, the reset
gate rt is computed in Eq. (27) and update gate zt is computed
B. Conditional Generative RNN Model in Eq. (28). The candidate hidden state in Eq. (29) is controlled
by the reset gate rt which can automatically decide whether
To model the distribution of the handwriting sequence, a to forget previous state or not. The new hidden state ht in
generative RNN model is utilized. Considering that there are Eq. (30) is then updated as a combination of previous and
a large number of different Chinese characters, and in order candidate hidden states controlled by update gate zt . At last,
to generate real and readable characters, the character embed- an output vector ot is calculated in Eq. (31). To improve the
ding is trained jointly with the RNN model. The character generalization performance, the dropout strategy [35] is also
embedding is a matrix E ∈ Rd×N where N is the number of applied on ot .
character classes and d is the embedded dimensionality. Each
In all these computations, the character embedding c is
column in E is the embedded vector for a particular class.
provided to remind RNN that this is the drawing of a particular
1 Note that the symbol s used here has a different meaning with the s
i i
character other than random scrawling. The dynamic writing
used in Eq. (1), and the si used here can be easily deduced from Eq. (1). information for this character is encoded with the hidden state
9
5
C. GMM Modeling of Pen-Direction: From ot To dt+1
As suggested by [4], the Gaussian mixture model (GMM) is 10
used for the pen-direction. Suppose there are M components
20
in the GMM, a 5 × M -dimensional vector is then calculated
based on the output vector ot as: 30
{π̂ j , µ̂jx , µ̂jy , δ̂xj , δ̂yj }M
j=1 ∈ R
5M
= Wgmm ×ot +bgmm , (32) 40
X 50
exp(π̂ j )
πj = P j0
⇒ π j ∈ (0, 1), π j = 1, (33)
j 0 exp(π̂ ) j
Fig. 6. Illustration of the generating/drawing for one particular character in
µjx = µ̂jx ⇒ µjx ∈ R, (34) different epochs of the training process.
µjy = µ̂jy ⇒ µjy ∈ R, (35)
δxj = exp δ̂xj ⇒ δxj > 0, (36) The probability density Ps (st+1 ) for the next pen-state st+1 =
[s1t+1 , s2t+1 , s3t+1 ] is then defined as:
δyj = exp δ̂yj ⇒ δyj > 0. (37) 3
X
Note the above five variables are for the time-step t + 1, and Ps (st+1 ) = sit+1 pit+1 . (42)
here we omit the subscript of t + 1 for simplification. For the i=1
j-th component in GMM, π j denotes the component weight, With this softmax modeling, RNN can automatically decide
µjx and µjy denotes the means, while δxj and δyj are the standard the status of pen and also the ending time of the generating
deviations. The probability density Pd (dt+1 ) for the next pen- process, according to the dynamic changes in the hidden state
direction dt+1 = [4xt+1 , 4yt+1 ] is defined as: of GRU during drawing/writing process.
M
X E. Training of the Generative RNN Model
Pd (dt+1 ) = π j N dt+1 |µjx , µjy , δxj , δyj
j=1 To train the generative RNN model, a loss function should
M
(38) be defined. Given a character represented by a sequence in
X
= π j N 4xt+1 |µjx , δx N 4yt+1 |µjy , δy ,
j j Eq. (23) and its corresponding character embedding c ∈ Rd , by
j=1 passing them through the RNN model, as shown in Fig. 5(a),
the final loss can be defined as the summation of the losses at
where
each time-step:
1 (x − µ)2 X
N (x|µ, δ) = √ exp − . (39) loss = − {log (Pd (dt+1 )) + log (Ps (st+1 ))} . (43)
δ 2π 2δ 2
t
Differently from [4], here for each mixture component, the
However, as shown by [27], directly minimizing this loss
x-axis and y-axis are assumed to be independent, which
function will lead to poor performance, because the three pen-
will simplify the model and still gives similar performance
states in Eq. (24) are not equally happened in the training
compared with the full bivariate Gaussian model. Using a
process. The occurrence of pen-down is too frequent which
GMM for modeling the pen-direction can capture the dynamic
always dominate the loss, especially compared with “end-of-
information of different handwriting styles, and hence allow
char” state which only occur once for each character. To reduce
RNNs to generate diverse handwritten characters.
the influence from this unbalanced problem, a cost-sensitive
approach [27] should be used to define a new loss:
D. SoftMax Modeling of Pen-State: From ot To st+1 ( 3
)
X X
To model the discrete pen-states (pen-down, pen-up, or end- loss = − log (Pd (dt+1 )) + i i i
w st+1 log(pt+1 ) ,
of-char), the softmax activation is applied on the transforma- t i=1
tion of ot to give a probability for each state: (44)
where [w1 , w2 , w3 ] = [1, 5, 100] are the weights for the losses
p̂1t+1 , p̂2t+1 , p̂3t+1 ∈ R3 = Wsoftmax × ot + bsoftmax , (40) of pen-down, pen-up, and end-of-char respectively. In this way,
exp(p̂it+1 ) X 3 the RNN can be trained effectively to produce real characters.
pit+1 = P3 j
∈ (0, 1) ⇒ pit+1 = 1. (41) Other strategies such as initialization and optimization are the
j=1 exp(p̂ t+1 ) i=1 same as Section III-G.
10
VI. E XPERIMENTS ON D RAWING C HINESE C HARACTERS the hidden state of GRU is 1000, therefore, the dimensions
of the vectors in Eqs. (27)(28)(29)(30) are all 1000. The
In this section, we will show the generated characters
dimension for the output vector in Eq. (31) is 300, and the
visually, and analyze the quality of the generated characters
dropout probability applied on this output vector is 0.3. The
by feeding them into the discriminative RNN model to check
number of mixture components in GMM of Section V-C is
whether they are recognizable or not. Moreover, we will also
30. With these configurations, the generative RNN model is
discuss properties of the character embedding matrix.
trained using Adam [40] with mini-batch size 500 and initial
learning rate 0.001. With Theano [44], [45] and an NVIDIA
A. Database Titan-X 12G GPU, the training of our generative RNN model
To train the generative RNN model, we still use the database took about 50 hours to converge.
of CASIA [43] including OLHWDB1.0 and OLHWDB1.1.
There are more than two million training samples and all
the characters are written cursively with frequently-used hand- C. Illustration of the Training Process
writing habits from different individuals. This is significantly
different from the experiment in [27], where only 11,000 To monitor the training process, Fig. 6 shows the generated
regular-written samples are used for training. Each character characters (for the first character among the 3,755 classes)
is now represented by multiple lines as shown in Eq. (23). in each epoch. It is shown that in the very beginning, the
The two hyper-parameters used in Section II for removing model seems to be confused by so many character classes.
redundant information are Tdist = 0.05 × max{H, W } and In the first three epochs, the generated characters look like
Tcos = 0.9, where H is the height and W is the width of some random mixtures (combinations) of different characters,
the character. After preprocessing, the average length of each which are impossible to read. Until the 10th epoch, some
sequence (character) is about 27. Note that the sequence used initial structures can be found for this particular character.
here is shorter than the sequence used for classification in After that, with the training process continued, the generated
Section IV. The reason for doing so is to make each line in characters become more and more clear. In the 50th epoch, all
Eq. (23) more informative and then alleviate the influence from the generated characters can be easily recognized by a human
noise strokes in the generating process. with high confidence. Moreover, all the generated characters
are cursive, and different handwriting styles can be found
among them. This verifies the effectiveness of the training
B. Implementation Details process for the generative RNN model. Another finding in
Our generative RNN model is capable of drawing 3,755 the experiments is that the Adam [40] optimization algorithm
different characters. The dimension for the character em- works much better for our generative RNN model, compared
bedding (as shown in Section V-B) is 500. In Eqs. (25) with the traditional stochastic gradient descent (SGD) with
and (26), both the low-dimensional pen-direction and pen-state momentum. With the Adam algorithm, our model converged
are transformed to a 300-dimensional space. The dimension for within about 60 epochs.
11
Fig. 8. Illustration of the automatically generated characters for different classes. Each row represents a particular character class. To give a better illustration,
each color (randomly selected) denotes one straight line (pen-down) as shown in Eq. (23).
As shown in Section V-B, the generative RNN model is With the character embedding, our RNN model can draw
jointly trained with the character embedding matrix E ∈ 3,755 different characters, by first choosing a column from the
Rd×N , which allows the model to generate characters accord- embedding matrix and then feeding it into every step of the
ing to the class indexes. We show the character embedding generating process as shown in Fig. 5. To verify the ability
matrix (500 × 3755) in Fig. 7. Each column in this matrix of drawing different characters, Fig. 8 shows the automati-
is the embedded vector for a particular class. The goal of the cally generated characters for nine different classes. All the
embedding is to indicate the RNN the identity of the character generated characters are new and different from the training
to be generated. Therefore, the characters with similar writing data. The generating/drawing is implemented randomly step-
trajectory (or similar shape) are supposed to be close to each by-step, i.e., by randomly sampling the pen-direction from
other in the embedded space. To verify this, we calculate the GMM model as described in Section V-C, and updating
the nearest neighbors of a character category according to the hidden states of GRU according to previous sampled
the Euclidean distance in the embedded space. As shown handwriting trajectory as shown in Section V-B. Moreover,
in Fig. 7, the nearest neighbors of one character usually all the characters are automatically ended with the “end-of-
have similar shape or share similar sub-structures with this char” pen-state as discussed in Section V-D, which means the
character. Note that in the training process, we did not utilize RNN can automatically decide when and how to finish the
any between-class information, the objective of the model writing/drawing process.
is just to maximize the generating probability conditioned All the automatically generated characters are human-
on the embedding. The character-relationship is automatically readable, and we can not even distinguish them from the real
learned from the handwriting similarity of each character. handwritten characters produced by human beings. The mem-
These results verify the effectiveness of the joint training of ory size of our generative RNN model is only 33.79MB, but it
character embedding and the generative RNN model, which can draw as more as 3,755 different characters. This means we
together form a model of the conditional distribution p(x|y), successfully transformed a large handwriting database (with
where x is the handwritten trajectory and y is the character more than two million samples) into a small RNN generator,
category. from which we can sample infinite different characters. With
12
0.8
Accuracy
0.6
0.4
0.2
0
500 1000 1500 2000 2500 3000 3500
Character Class Index
(b) (c)
㵱 㣅ᷠ 䓪
⿑ 』ᷠ 䓛
㗂 㕟ᷠ 䓤
Fig. 9. (a): The classification accuracies of the automatically generated characters for the 3,755 classes. (b): The generated characters which have low
recognition rates. (c): The generated characters which have perfect (100%) recognition rates.
the GMM modeling of pen-direction, different handwriting ically generated characters in this paper are not only cursive
habits can be covered in the writing process. As shown in but also readable by both human and machine.
Fig. 8, in each row (character), there are multiple handwriting The average classification accuracy for all the generated
styles, e.g., regular, fluent, and cursive. These results verify not characters is 93.98%, which means most of the characters
only the ability of the model in drawing recognizable Chinese are correctly classified. However, compared with the classi-
characters but also the diversity of the generative model in fication accuracies for real characters as shown in Table I,
handling different handwriting styles. the recognition accuracy for the generated characters is still
Nevertheless, we still note that the generated characters are lower. As shown in Fig. 9(a), there are some particular classes
not 100% perfect. As shown by the last few rows in Fig. 8, leading to significantly low accuracies, i.e., below 50%. To
there are some missing strokes in the generated characters check what is going on there, we show some generated
which make them hard to read. Therefore, we must find some characters with low recognition rate in Fig. 9(b). It is shown
methods to estimate the quality of the generated characters in that the wrongly classified characters usually come from the
a quantitative manner. confusable classes, i.e., the character classes which only have
some subtle difference in shape with another character class.
In such case, the generative RNN model was not capable
F. Quality Analysis: Recognizable or Not? of capturing these small but important details for accurate
To further analyze the quality of the generated characters, drawing of the particular character.
the discriminative RNN model in Section III is used to check On the contrary, as shown in Fig. 9(c), for the character
whether the generated characters are recognizable or not. The classes which do not have any confusion with other classes,
architecture of NET4 in Table I is utilized due to its good the generated characters can be easily classified with 100%
performance in the recognition task. Both the discriminative accuracy. Therefore, to further improve the quality of the
and generative RNN models are trained with real handwritten generated characters, we should pay more attention to the
characters from [43]. After that, for each of the 3,755-class, similar/confusing character classes. One solution is to modify
we randomly generate 100 characters with the generative the loss function in order to emphasize the training on the
RNN model, resulting in 375,500 test samples, which are confusing class pairs as suggested by [47]. Another strategy is
then feed into the discriminative RNN model for evaluation. integrating the attention mechanism [48], [49] and the memory
The classification accuracies of the generated characters with mechanism [50], [51] with the generative RNN model to allow
respect to different classes are shown in Fig. 9(a). the model to dynamically memorize and focus on the critical
It is shown that for most classes, the generated characters region of a particular character during the writing process.
can be automatically recognized with very high accuracy. These are future directions for further improving the quality
This verifies the ability of our generative model in correctly of the generative RNN model.
writing thousands of different Chinese characters. In previous
work of [27], the LSTM-RNN is used to generate fake and VII. C ONCLUSION AND F UTURE W ORK
regular-written Chinese characters. Instead, in this paper, our This paper investigates two closely-coupled tasks: automat-
generative model is conditioned on the character embedding, ically reading and writing. Specifically, the recurrent neural
and a large real handwriting database containing different network (RNN) is used as both discriminative and genera-
handwriting styles is used for training. Therefore, the automat- tive models for recognizing and drawing cursive handwritten
13
[26] “Generating offline Chinese characters with DCGAN,” 2015. [Online]. conditional random fields,” IEEE Trans. Pattern Analysis and Machine
Available: http://www.genekogan.com/works/a-book-from-the-sky.html Intelligence, vol. 35, no. 10, pp. 2484–2497, 2013.
[27] “Generating online fake Chinese characters with LSTM-RNN,” [53] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten Chinese text recognition
2015. [Online]. Available: http://blog.otoro.net/2015/12/28/recurrent- by integrating multiple contexts,” IEEE Trans. Pattern Analysis and
net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow Machine Intelligence, vol. 34, no. 8, pp. 1469–1481, 2012.
[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies [54] R. Messina and J. Louradour, “Segmentation-free handwritten Chinese
with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, text recognition with LSTM-RNN,” Proc. Int’l Conf. Document Analysis
no. 2, pp. 157–166, 1994. and Recognition (ICDAR), 2015.
[29] C.-L. Liu and K. Marukawa, “Pseudo two-dimensional shape normal- [55] Y. Kato and M. Yasuhara, “Recovery of drawing order from single-
ization methods for handwritten Chinese character recognition,” Pattern stroke handwriting images,” IEEE Trans. Pattern Analysis and Machine
Recognition, vol. 38, no. 12, pp. 2242–2255, 2005. Intelligence, vol. 22, no. 9, pp. 938–949, 2000.
[30] I. Goodfellow, A. Courville, and Y. Bengio, “Deep learning,” Book in [56] Y. Qiao, M. Nishiara, and M. Yasuhara, “A framework toward restoration
press, MIT Press, 2016. of writing order from single-stroked handwriting image,” IEEE Trans.
[31] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1724–
temporal classification: Labelling unsegmented sequence data with re- 1737, 2006.
current neural networks,” Proc. Int’l Conf. Machine Learning (ICML), [57] A. Lamb, V. Dumoulin, and A. Courville, “Discriminative regularization
2006. for generative models,” arXiv:1602.03220, 2016.
[32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of [58] D. Im, C. Kim, H. Jiang, and R. Memisevic, “Generating images with
gated recurrent neural networks on sequence modeling,” Proc. Advances recurrent adversarial networks,” arXiv:1602.05110, 2016.
in Neural Information Processing Systems (NIPS), 2014.
[33] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[34] D. Rumelhart, G. Hinton, and R. Williams, “Learning representations
by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
[35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from overfit-
ting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–
1958, 2014.
[36] X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout as
data augmentation,” arXiv:1506.08700, 2015.
[37] W. Yang, L. Jin, and M. Liu, “Character-level Chinese writer identifi-
cation using path signature feature, dropstroke and deep CNN,” Proc.
Int’l Conf. Document Analysis and Recognition (ICDAR), 2015.
[38] ——, “DeepWriterID: An end-to-end online text-independent writer
identification system,” arXiv:1508.04945, 2015.
[39] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration
of recurrent network architectures,” Proc. Int’l Conf. Machine Learning
(ICML), 2015.
[40] D. Kingma and J. Ba, “Adam: a method for stochastic optimization,”
Proc. Int’l Conf. Learning Representations (ICLR), 2015.
[41] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Chinese handwriting
recognition contest 2010,” Proc. Chinese Conf. Pattern Recognition
(CCPR), 2010.
[42] C.-L. Liu, F. Yin, Q.-F. Wang, and D.-H. Wang, “ICDAR 2011 Chi-
nese handwriting recognition competition,” Proc. Int’l Conf. Document
Analysis and Recognition (ICDAR), 2011.
[43] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA online and
offline Chinese handwriting databases,” Proc. Int’l Conf. Document
Analysis and Recognition (ICDAR), 2011.
[44] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Berg-
eron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new
features and speed improvements,” NIPS Deep Learning Workshop,
2012.
[45] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-
jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A CPU and
GPU math expression compiler,” Proc. Python for Scientific Computing
Conf. (SciPy), 2010.
[46] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Online and of-
fline handwritten Chinese character recognition: Benchmarking on new
databases,” Pattern Recognition, vol. 46, no. 1, pp. 155–162, 2013.
[47] I.-J. Kim, C. Choi, and S.-H. Lee, “Improving discrimination ability
of convolutional neural networks by hybrid learning,” Int’l Journal on
Document Analysis and Recognition, pp. 1–9, 2015.
[48] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content
using attention-based encoder-decoder networks,” IEEE Trans. Multime-
dia, vol. 17, no. 11, pp. 1875–1886, 2015.
[49] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption
generation with visual attention,” Proc. Int’l Conf. Machine Learning
(ICML), 2015.
[50] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,”
arXiv:1410.5401, 2014.
[51] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” Proc. Int’l
Conf. Learning Representations (ICLR), 2015.
[52] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu, and M. Nakagawa,
“Handwritten Chinese/Japanese text recognition using semi-Markov