You are on page 1of 18

Neural Comput & Applic (2019) 31:3155–3172

DOI 10.1007/s00521-017-3260-9

ORIGINAL ARTICLE

In-air handwritten English word recognition using attention


recurrent translator
Ji Gan1 • Weiqiang Wang1

Received: 5 July 2017 / Accepted: 14 October 2017 / Published online: 3 November 2017
Ó The Natural Computing Applications Forum 2017

Abstract As a new human–computer interaction way, in- 1 Introduction


air handwriting allows users to write in the air in a natural,
unconstrained way. Compared with conventional online Different from conventional human–computer interaction
handwriting based on touch devices, in-air handwriting is ways such as moving the mouse, typing on the keyboard,
much more challenging due to its unique characteristics. and touching the screen, in-air handwriting allows the user
The in-air handwriting is always finished in a single stroke to write in the air in a natural and unconstrained way,
and thus lacks pen-down and pen-up information. More- which provides the much more intuitive, convenient and
over, the in-air handwriting suffers less friction and space comfortable interaction experience. As a new human–
restriction so that the users write more casually. In this computer interaction way, we think the in-air handwriting
paper, we present an in-air handwriting system for effec- opens a new era for advanced and natural human–computer
tively recognizing handwritten English words. An atten- interaction and is promising to play an important role in
tion-based model, called attention recurrent translator, is human–computer interaction applications in the future.
proposed for the in-air handwritten English word recogni- Specifically, in-air handwriting is especially useful for
tion, which is considerably different from connectionist users to input text for smart system control, among many
temporal classification (CTC). We evaluate the proposed applications such as intelligent TV and virtual reality (VR).
approach on a newly collected dataset containing a total of For example, in-air handwriting provides a more conve-
150,480 recordings that cover 2280 English words. The nient way to operate the TV rather than pressing the but-
proposed approach achieves a word recognition accuracy tons on the remote control, especially when a user tries to
of 97.74%. The experimental results show that the pro- search the particular television channel or program.
posed recognizer is comparable with CTC and is extremely In-air handwriting is a new topic that differs from con-
effective for in-air handwritten English word recognition. ventional online handwriting. Compared with conventional
handwriting on touch screen, the in-air handwriting is
Keywords In-air handwriting  Human–computer much more challenging and remains quite unexplored due
interaction  English word recognition  Attention recurrent to its unique characteristics. For example, an in-air hand-
translator written English word is always finished in a single stroke
with no breaks between adjacent characters, as well as
different strokes of a character, in writing. The connected
strokes between characters, as well as those in each char-
acter, are more likely to cause ambiguity, which makes the
& Weiqiang Wang in-air handwritten word more challenging to be recognized
wqwang@ucas.ac.cn
than the conventional handwritten word. Additionally, the
Ji Gan in-air handwriting with fingertip suffers less friction and
ganji15@mails.ucas.ac.cn
space restriction. Thus the structures of characters in in-air
1
School of Computer and Control Engineering, University of handwritten words vary considerably even in the intra-
Chinese Academy of Sciences, Beijing, China class.

123
3156 Neural Comput & Applic (2019) 31:3155–3172

Fig. 1 The framework of the in-air handwritten English word corresponding feature sequence. In the recognition stage, the neural
recognition. After the acquisition of the in-air handwritten word, the network model translates the input feature sequence into the final
system normalizes the collected trajectory and extracts the hypothesis

With current advances in technology, in-air handwriting English word dataset contains 2280 English words, and
can be achieved with many somatosensory devices such as each English word has 66 different samples, which
Kinect1 and Leap Motion,2 which are capable of capturing results in a total of 150,480 recordings.
finger positions and movements in real time. The high – We propose a new approach for the in-air handwritten
precision and small expense of those somatosensory devi- English word recognition, which is inspired by the
ces make the in-air handwriting possible to become ubiq- attention-based model for natural machine transla-
uitous and popular. By using different kinds of motion tion [8–10], speech recognition [11, 12] and image
tracking devices, many in-air handwriting systems have caption [13]. To our best knowledge, we believe we
been implemented in recent years [1–7]. However, the first attempt to extend the attention model to the in-air
majority of the existing in-air handwriting systems mainly handwritten English word recognition.
focus on character and digit recognition. Even though a The rest of the paper is organized as follows. Section 2
few in-air handwriting systems [2–5, 7] have investigated briefly reviews the relevant works on the in-air handwriting
the word recognition or the sentence recognition, they are and word recognition methods. Section 3 introduces the in-
limited to the small amount of training data and the lack of air handwriting and the collected in-air handwritten Eng-
robust classification algorithms, which might make their lish word dataset. Section 4 presents the preprocessing
systems far from practical. steps for the in-air handwriting, and Sect. 5 presents the
Therefore, the primary objective of this study is to proposed recognizer in detail. Section 6 conducts the
implement a practical in-air handwriting system which can experimental evaluation, and Sect. 7 concludes the paper
recognize the in-air handwritten English word in a large and discusses the future works.
vocabulary by using the more robust recognizer. However,
the handwriting is actually defined on a 2D surface
regardless of the actual writing motion. Therefore we can 2 Related works
consider the in-air handwriting as representing the writing
motion projected on a vertical plane. The framework of the 2.1 In-air handwriting
presented in-air handwriting system is shown in Fig. 1, and
we will detail each module in the following sections. As a natural and comfortable human–computer interaction
The main contributions of the paper are summarized as way, the in-air handwriting is a quite attractive and exciting
follows: research topic. With current advances in technology, vari-
– We present an in-air handwriting system which can ous kinds of in-air handwriting systems have been imple-
record and recognize the in-air handwritten English mented with different somatosensory devices.
word in a large vocabulary, where the in-air handwrit- Amma et al. [1, 2] presented an air-writing system based
ing is achieved via the Leap Motion controller. on a wearable glove. They [1] used the hidden Markov
– We construct a moderately large in-air handwritten model (HMM) to classify 26 English letters. They con-
English word dataset. The collected in-air handwritten ducted the experiments on the dataset containing a total of
6500 characters, and their method achieved an average
writer-independent character recognition rate of 81.9%.
1
Kinect: https://developer.microsoft.com/en-us/windows/kinect. Furthermore, their approach achieved a single-writer word
2
Leap Motion: https://www.leapmotion.com.

123
Neural Comput & Applic (2019) 31:3155–3172 3157

recognition rate of 97.5% based on a small vocabulary of recognition. In their latter works [4], the authors further
652 words. In their latter works [2], they proposed a two- used the bidirectional long short-term memory
stage approach for spotting and recognition of continuous (BLSTM) [17] and connectionist temporal classification
air-writing, in which the spotting stage used the support (CTC) [18] for the word recognition, which outperforms
vector machine (SVM), and the recognition stage used the the HMM classifier.
HMM classifier. They conducted the experiments on the However, the majority of the existing in-air handwriting
dataset containing a total of 720 English sentences, and a systems mainly focus on character and digit recognition.
word error rate of 11% was achieved for the person inde- Even though a few in-air handwriting systems have
pendent setup. investigated the word recognition or the sentence recog-
Zhang et al. [6] presented a vision-based air-writing nition, they are limited to the small amount of training data
system focusing on Chinese character recognition based on as well as the lack of robust classification algorithms,
Kinect. They used the compact modified quadratic dis- which might make their systems far from practical. In this
criminant function (MQDF) classifier to recognize 6763 work, we present an in-air handwriting system based on
frequently used Chinese characters, 26 English letters (both Leap Motion for the in-air handwritten English word
uppercase and lowercase), and 10 digits. The experiments recognition in a large vocabulary. We extend the previous
were conducted on 375 videos with a total of 44,522 works in the following aspects: (1) The system is specifi-
frames, and a recognition rate of 78.46% for Chinese cally designed for the English word recognition rather than
character was achieved. On the other hand, Xu et al. [14] the character or the digit recognition; (2) A much larger
presented an in-air handwriting system for the Chinese dataset of the in-air handwritten English word is con-
character recognition based on Leap Motion. They recog- structed, which can provide sufficient samples for training
nized the projected 2D handwritten trajectories by com- the robust classifier; (3) A novel and effective neural net-
bining the directional features and the MQDF classifier. work architecture based on attention mechanism is pro-
They conducted the experiments on their constructed posed for the in-air handwritten English word recognition.
dataset IAHCC-UCAS2014 containing 244,075 recordings
of 3755 Chinese characters, and a character recognition 2.2 Word recognition methods
rate of 69.67% was achieved. In the latter works, other
different recognition methods were evaluated on their Generally, the approaches of handwritten word recognition
expanded dataset IAHCC-UCAS2015, such as the sparse can be typically classified into two categories: the analytic
representation-based classifier [15] and the recurrent neural approach and the holistic approach. The former approach
network (RNN) using the softmax classifier [16], and the first individually recognizes each isolated character and
better recognition results were achieved. then maps the output characters into the complete word.
Vikram et al. [7] achieved the air-writing based on Leap The latter approach treats the word as a single, indivisible
Motion. They recognized the writing sequence by using entity and attempts to recognize it using features of the
dynamic time warping distance (DTW) classifier on their word as a whole.
constructed dataset, which contains 3000 recordings of 30 Many approaches can be applied to the handwritten
English words and 26,000 recordings of 52 English letters. word recognition, such as DTW [7], MQDF [19],
On the other hand, Chen et al. [3] addressed overlapped air- HMM [20, 21]. However, many recent studies show that
writing by using the Leap Motion, in which the characters the deep neural networks usually outperform those tradi-
of a handwritten word are overlapped. The authors used the tional approaches for the recognition tasks. For example,
HMM for both the air-writing modeling and recognition. Graves et al. [22] have used BLSTM with CTC for
They conducted the experiments on the dataset which has unconstrained handwritten text recognition, and the
5720 motion characters, 4400 motion words covering the experimental results show that such a network architecture
40 words vocabulary from common television channels and outperforms the HMM classifier. RNN is extremely
1000 motion words covering 1k-word vocabulary. The applicable for the tasks such as unsegmented connected
proposed system achieved a word error rate of 0.8% for handwriting recognition or speech recognition since the
word-based recognition and 1.9% for letter-based recog- RNN can process the sequence data with longer range
nition. Recently Kumar et al. [5] presented the 3D text context information compared with the HMM. Specifically,
segmentation and recognition using Leap Motion. They Ahmed [23] has presented a thorough investigation of the
segmented the text into words first, and then used the performance of RNN for cursive and non-cursive scripts.
HMM for word recognition. They conducted the experi- Inspired by the great success of deep learning, deep bidi-
ments on the dataset containing 840 recordings covering 21 rectional recurrent neural network (DBRNN) has also been
different words. Their approach achieved overall accuracy investigated in the handwriting recognition and outper-
of 80.3% in word segmentation and 92.73% in word forms the single layer RNN [24, 25]. Furthermore, the

123
3158 Neural Comput & Applic (2019) 31:3155–3172

DBRNN architecture has achieved the state-of-the-art


recognition accuracy for the online handwritten Chinese
character recognition [26].
The holistic approach is suitable for applications that
have a limited vocabulary, however, for the large-vocab-
ulary handwritten word recognition, another feasible
approach is to individually recognize each character and
then map them onto the complete word by using a lexicon.
Specifically, CTC is primarily designed for sequence
labeling tasks, which can directly label unsegmented
sequences but avoid the complex pre-segmentation. CTC
has been applied to many cursive script recognition sys-
tems such as [22, 23, 27]. Moreover, CTC has been shown
to outperform both HMMs and RNN-HMM hybrids, and
has achieved great success in many sequence labeling tasks
such as handwriting recognition [22] and speech recogni- Fig. 2 The illustration of the presented in-air handwriting system.
The user writes in the air by moving the finger over the Leap Motion
tion [18]. However, CTC assumes that the output labels are
controller, and the corresponding handwritten trajectory is drawn on
conditionally independent of each other, while the char- the computer screen
acters of the English word are strongly associated with
each other since only the particular order of the characters construction of the in-air handwritten English word dataset
forms a meaningful word. and the characteristics of in-air handwritten English words.
Recently, the attention-based recurrent model has been
successfully applied to the various kinds of sequence 3.1 Word recording
labeling tasks, such as neural machine translation [8–10],
image caption generation [13] and the speech recogni- 3.1.1 Finger tracking using Leap Motion
tion [11, 12]. In the neural machine translation, this
attention-based recurrent model is generally based on the To achieve the in-air handwriting, the Leap Motion con-
word-level [8, 9] or the character-level [10] encoder–de- troller is integrated into our system. The device can track
coder framework, which encodes the input sequence into hands, fingers and finger-like tools, which operates in
the fixed-length vector and then decodes the vector to the intimate proximity with high precision and tracking frame
target sequence with the attention mechanism. However, to rate and reports discrete positions. As shown in Fig. 3, the
our best knowledge, limited works have explored the
applicability of using the ideas of the attention-based
recurrent model for the task of online handwriting recog-
nition. In this paper, we first attempt to extend the atten-
tion-based model, called attention recurrent translator, for
the in-air handwritten word recognition. Different from
CTC, the proposed attention recurrent translator outputs
each character depending on the previous output
characters.

3 In-air handwriting

By using the Leap Motion controller, we build an in-air


handwriting system which can record and recognize in-air
handwritten English words. An illustration of the in-air
handwriting system is shown in Fig. 2. With the presented
in-air handwriting system, we construct an in-air hand-
written English word dataset. Moreover, compared with
Fig. 3 The coordinate system of the Leap Motion controller. The
conventional handwriting based on touch devices, the in-air
Leap Motion system employs a right-handed Cartesian coordinate
handwriting is much more challenging due to its unique system where the origin is centered at the top of the controller, the x-
characteristics. In the following subsections, we detail the and z-axes lie in the horizontal plane, and the y-axis is vertical

123
Neural Comput & Applic (2019) 31:3155–3172 3159

UCAS2016. The data acquisition process involves 324


different participants between 20 and 27 years of age, and
every participant is required to write around 300–500
English words. As a result, we collect a total of 150,480
recordings covering 2280 English words, and each English
word contains 66 different samples.

3.2 Characteristics of in-air handwritten English


words

Compared with the conventional handwritten words gen-


erated on touch devices, in-air handwritten words have
their unique characteristics, which makes the in-air hand-
writing recognition more challenging than the conventional
online handwriting recognition.
Fig. 4 Statistics of lengths of words in the created vocabulary. The
Different from conventional handwriting based on touch
longest word contains 15 characters, the shortest word contains only 1
character, and most words contain variable length of characters devices, an in-air handwritten English word is always
ranging from 3 to 10 finished in a single stroke with no breaks between adjacent
characters in writing as shown in Fig. 6. In other words, an
Leap Motion system employs a right-handed Cartesian in-air handwritten word contains no pen-up/pen-down
coordinate system, where the origin is centered at the top of information since the corresponding motion is tracked with
the device, the x- and z-axes lie in the horizontal plane, and a continuous stream of sensor data. However, the pen-up/
the y-axis is vertical, with positive values increasing pen-down information is useful for the character segmen-
upwards. In our system setup (Inter core i7 CPU 3.60 GHz, tation as well as the stroke delimitation of the cursive
Leap SDK2.2.6?29154 with USB2.0 connection), the handwriting. For conventional handwriting based on touch
Leap Motion updates the tracking results at a rate of devices, the pen-up/pen-down information delimits the
100 Hz, which achieves the precise and real-time tracking strokes of a handwritten word, and the delimitation infor-
of the fingertip. mation can further help to recognize the handwritten word.
On the contrary, the in-air handwriting is always finished
3.1.2 Corpus preparation with a single stroke, and thus the successive letters of an
in-air handwritten word are connected without explicit
Since the vocabulary of all English words is extremely pen-up moves. As a result, the ligatures that join the
large, it would be impossible to collect all words in prac- characters of an in-air handwritten word are more likely to
tice. For the recording purpose, we create a vocabulary of cause ambiguity, which make in-air handwritten words
2280 English words according to the frequencies of usage. more difficult to be recognized than conventional hand-
The selected 2280 English words are frequently used in written words. For example, when letter ‘‘c’’ and letter ‘‘e’’
daily life, which contains variable-length characters and appear successively, it is impossible to identify whether
only involves lowercase letters. As shown in Fig. 4 the ‘‘ce’’ or ‘‘cc’’ is written in the in-air handwriting if no
longest word contains 15 characters, the shortest word context information is provided, because both the corre-
contains only 1 character, and most words contain variable sponding parts of handwritten words are similar when no
length of characters ranging from 3 to 10. Furthermore, we pen-up/pen-down information is given. However, the
measure the relative frequencies of English letters in the delimitation information in conventional online handwrit-
created vocabulary as shown in Fig. 5, which is consistent ing can eliminate this kind of ambiguity. Some examples
with the relative frequencies of letters in the English lan- of the ambiguity of in-air handwritten words are shown in
guage reported by wikipedia.3 Fig. 7.
Furthermore, for conventional handwriting, writing on
3.1.3 Recording with the built system the touch screen can provide tactile feedback, which pre-
vents the person writing too casually. However, the fin-
With the presented in-air handwriting system, we construct gertip suffers less friction and space restriction in the in-air
an in-air handwritten English word dataset named IAHEW- handwriting. Therefore, the writer could write more freely
and casually than conventional handwriting, and thus the
3
Relative frequencies of letters in the English language in wikipedia: structures of characters in in-air handwritten words vary
https://en.wikipedia.org/wiki/Letter_frequency. considerably even in the intra-class. As a result, in-air

123
3160 Neural Comput & Applic (2019) 31:3155–3172

Fig. 5 The comparison of the statistics of relative frequencies of English letters

Fig. 6 Some samples of in-air


handwritten words acquired by
Leap Motion. The word is
always finished in a single
stroke with no breaks between
adjacent characters.
a ‘‘agriculture’’, b ‘‘pretty’’,
c ‘‘cottage’’, d ‘‘punctual’’

Fig. 7 Some examples of the ambiguity of in-air handwritten words. The ligatures between the adjacent characters make the word rather
confusing. a ‘‘accept’’, b ‘‘improve’’, c ‘‘language’’, d ‘‘morning’’, e ‘‘invention’’, f ‘‘arm’’

handwritten words contain much more diversity than con- 4.1 Word normalization
ventional handwritten words as shown in Fig. 8.
Since the writing styles of different writers are consider-
ably different, it is important to normalize the handwriting
4 Preprocessing trajectories. As shown in Fig. 9, we normalize each word
trajectory according to the following steps. (1) Computing
It is necessary to normalize handwritten words for rotation: calculating a linear regression line from the local
removing the noises and reducing the differences between minimum points of the trajectory. If the residual variance is
writing styles of different writers. After the normalization, larger than a defined threshold, the point with the largest
the feature extraction step is applied for the ready of Euclidean distance from the regression line is removed, and
recognition phase. In the following subsections, we the regression line is calculated again; (2) Normalizing
describe the preprocessing steps including word normal- rotation: rotating the trajectory according to the slant cal-
ization and feature extraction in details. culated in step (1); (3) Normalizing size: transforming

123
Neural Comput & Applic (2019) 31:3155–3172 3161

Fig. 8 Some examples of the


diversity of in-air handwritten
words. The user writes more
freely and casually due to
suffering the less friction and
space restriction, which makes
the in-air handwritten words
contain much more diversity.
a ‘‘most’’, b ‘‘inquire’’,
c ‘‘ability’’

Fig. 9 The process of the word normalization. All the handwritten words are normalized for removing the noises and reducing the differences
between writing styles of different writers

every word to the same given height, while preserving the capture the pen-up/pen-down information due to the uni-
aspect ratio of the word; (4) Smoothing: making the local stroke writing style of the in-air handwritten word. As a
smoothing by computing the new position ðx^i ; y^i Þ for each result, we extract a six-dimensional feature for each point,
point ðxi ; yi Þ based on its neighbors ðxi1 ; yi1 Þ, ðxiþ1 ; yiþ1 Þ and then obtain a sequence of features.
and itself, i.e., ðx^i ; y^i Þ ¼ 0:25ðxi1 ; yi1 Þ þ 0:5ðxi ; yi Þ þ
0:25ðxiþ1 ; yiþ1 Þ; (5) Resampling: replacing the sequence of
captured points with a new series of points having the same 5 Word recognition
spatial distance.
In this section, we detail the proposed approach for the in-
4.2 Feature extraction air handwritten word recognition. Given an input feature
sequence x ¼ ðx1 ; . . .; xi ; . . .; xN Þ, we need to find the label
Feature extraction is essential for the in-air handwritten sequence y ¼ ðy1 ; . . .; yl ; . . .; yL Þ which has the maximal
word recognition. As the in-air handwritten word is always conditional probability, where yl 2 fa; b; . . .; z; SOS;
finished in one single stroke, the recorded trajectory can be EOS; PADg, ‘‘SOS’’ and ‘‘EOS’’ are the special tokens which
considered as a very smooth curve. For the in-air hand- denote the start-of-sequence and the end-of-sequence,
written word trajectory which is a sequence of (x, y)-co- respectively, and ‘‘PAD’’ denotes the blank token. Formally,
ordinates representing the location of the fingertip in a y ¼ argmaxy PðyjxÞ: ð1Þ
vertical plane, we derive following features as shown in
Fig. 10: (1) the offset of x-coordinates D x as well as the The proposed architecture directly models the conditional
offset of y-coordinates D y; (2) the cosine and sine of the probability PðyjxÞ of translating the input feature sequence
writing direction a; (3) the cosine and sine of the curvature x into the target sequence y. The overview of the archi-
b. We think the extracted features can effectively represent tecture, called attention recurrent translator, is shown in
the smooth curve of the in-air handwritten trajectory and Fig. 11. The translator basically consists of two modules:
maintain the temporal information. Different from the (1) The encoder which computes the representation of the
conventional handwritten word recognition, we cannot input sequence x; (2) The decoder which outputs each

123
3162 Neural Comput & Applic (2019) 31:3155–3172

Fig. 10 Handcrafted features of the in-air handwritten trajectory. A six-dimensional feature is extracted for each point, and then a sequence of
features is obtained for each trajectory. a Offset-x D x and offset-y D y, b Writing direction a, c Curvature b

target character yl and models the conditional probability 5.1 Encoder


PðyjxÞ according to the chain rule, i.e.,
Y
L The encoder is actually a deep bidirectional long short-term
PðyjxÞ ¼ Pðyl jx; y\l Þ; ð2Þ memory (DBLSTM) [28] neural network, which computes
l¼1 the representation h of the input sequence x as
where Pðyl jx; y\l Þ denotes the probability to output the l-th h ¼ EncoderðxÞ ð3Þ
character yl conditioned on the previous characters y\l ¼
¼ DBLSTMðxÞ: ð4Þ
ðy1 ; y2 ; . . .; yl1 Þ as well as the input sequence x.
In the following subsections, we introduce each module Specifically, in the typical DBLSTM stacked by M BLSTM
of the architecture and detail the training and the prediction layers, the output at the i-th timestep from the j-th layer is
of the model. computed as

Fig. 11 The overview of the


architecture. The architecture
contains two modules: the
encoder and the decoder. The
encoder first computes the
representation of the input
sequence, and then the decoder
generates each character based
on its previous output characters
and the relevant information
captured form the whole
representation of the input
sequence

123
Neural Comput & Applic (2019) 31:3155–3172 3163

 
hji ¼ BLSTM hji1 ; hj1 : ð5Þ sl ¼ tanhðWc ½ sbl ; cl Þ; ð11Þ
i

where [,] denotes the concatenation operation, and Wc is


In the in-air handwriting recognition, the length of the
the weight matrix.
input feature sequence can be hundreds of frames. How-
Attention mechanism We now detail how the model
ever, the information in the higher level of the deep RNN
computes the context vector cl at timestep l in the decoding
tends to be more spread out in longer sequences, and the
phase by applying the attention mechanism, which is first
long-range interdependencies are generally harder to learn
introduced by Bahdanau et al. [8]. Given the encoded
from the longer sequence [29]. In our model, we subsample
feature sequence h ¼ ðh1 ; . . .; hi ; . . .; hN Þ and current hid-
the outputs at consecutive steps of the higher layer before
den state sbl at timestep l in the decoding phase, the model
feeding it to the next layer, i.e.,
   first computes the content-based energy el ðiÞ for each
hji ¼ BLSTM hji1 ; max hj1 j1
ð6Þ timestep i of the encoded feature sequence h as
2i ; h2iþ1 :
s l > hi ;
el ðiÞ ¼b ð12Þ
This modification is similar with [11], and it can reduce the
time resolution in the higher level of deep RNN and thus where > denotes the transpose operation. Then the align-
allows the decoder to extract the relevant information from ment vector al ¼ ðal ð1Þ; . . .; al ðiÞ; . . .; al ðNÞÞ is computed
a smaller number of timesteps. from the content-based energy el ¼ ðel ð1Þ; . . .; el ðiÞ; . . .;
el ðNÞÞ by using the softmax function:
5.2 Decoder
al ¼ Softmaxðel Þ: ð13Þ
The decoder is an attention-based recurrent cell which lit- Finally, the context vector cl is computed as the weighted
erally predicts each target character yl based on the proba- average over the encoded feature h by given the alignment
bility distribution over its current decoder state sl . Formally vector al , i.e.,
Pðyl jx; y\l Þ ¼ SoftmaxðWs sl þ bs Þ; ð7Þ X
N
cl ¼ al ðiÞhi : ð14Þ
where sl is the decoder state at timestep l in the decoding i¼1
phase, the Ws is the weight matrix, and the bs is the bias vector.
The decoder in our architecture is similar with [30],
which is a simpler variant of the conventional ver- 5.3 Teacher-forcing Learning
sion [8, 11]. At each timestep l in the decoding phase, the
decoder computes the current decoder state sl according to Given the whole training data D, the lossG L of the model
its previous decoder state sl1 , the previous generated is defined as
X
character yl1 and the global information captured from the L¼  logPðyjxÞ; ð15Þ
encoded feature sequence h, i.e., ðx;yÞ2D

sl ¼ Decoder ðsl1 ; yl1 ; hÞ: ð8Þ where ðx; yÞ denotes each pair of the input feature sequence
x and the target sequence y. The model minimizes the loss
Specifically, the decoder first computes the hidden state sbl
L according to
according to X
sbl ¼ RNNðsl1 ; yl1 Þ; ð9Þ min  logPðyjx; hÞ ð16aÞ
h
ðx;yÞ2D
where RNN denotes the recurrent cell such as long short- X X
L
term memory (LSTM) [17] or gate recurrent unit  min  logPðyl jx; y\l ; hÞ; ð16bÞ
h
(GRU) [31]. Then the attention mechanism is applied to ðx;yÞ2D l¼1
derive the context vector cl , which captures the relevant
where y\l denotes the ground truth of the previous l1
information from the encoded feature sequence h and helps
characters, and h denotes the model parameters.
to generate the current character yl . Formally, the context
vector cl is calculated as
5.4 Word decoding
cl ¼ Attentionð sbl ; hÞ: ð10Þ
The goal of the network is to find the label sequence y
Finally, the decoder combines the information of the current
with the maximal conditional probability, i.e.,
hidden state sbl and the calculated context vector cl to pro-
y ¼ arg maxy PðyjxÞ. For in-air handwritten word recog-
duce the current decoder state sl , which is computed as
nition, mainly two types of transcriptions can be used to

123
3164 Neural Comput & Applic (2019) 31:3155–3172

determine the sequence y : lexicon-free transcription and It should be noted that the lexicon-based transcription
lexicon-based transcription. fully considers the context information since it chooses the
Lexicon-free transcription In lexicon-free transcription, word having the maximal likelihood over the given lexi-
the generated character sequence y is not always a con, which is similar to the holistic approach. Therefore,
meaningful word. The sequence prediction starts with we think the lexicon-based transcription can reduce the
‘‘SOS’’ token, and the model samples from the probability ambiguity of the in-air handwritten words by considering
distribution by taking the character having the maximal the context information and thus benefits the word recog-
probability at each timestep, i.e., yl  SoftmaxðWs sl þ bs Þ, nition. On the other hand, different from the lexicon-free
until ‘‘EOS’’ token is produced which indicates the end of transcription, the lexicon-based transcription always out-
the inference process. puts the word within the given lexicon. The experimental
Lexicon-based transcription In lexicon-based transcrip- results reported in Sect. 6.7 will show that lexicon-based
tion, the label sequence y is found by selecting the transcription outperforms the lexicon-free transcription for
sequence y in the given lexicon D that has the maximal the in-air handwritten English word recognition.
conditional probability, i.e., y ¼ arg maxy2D PðyjxÞ, where
PðyjxÞ is calculated according to the chain rule,
Q 6 Experimental evaluation
PðyjxÞ ¼ Ll¼1 Pðyl jx; y\l Þ. However, it is time-consuming
to decode on a huge lexicon since it is required to go
through the whole lexicon. As a result, an efficient search 6.1 Dataset preparation
scheme is needed to prune the search space. Since many
words in the dictionary share the same prefixes and the next IAHEW-UCAS2016 With the in-air handwriting system, we
generated character yl of each word depends on its previous construct an in-air handwritten English word dataset named
output characters y\l , a pre-constructed prefix tree [32] IAHEW-UCAS2016, which involves 324 different partic-
over the whole given lexicon D can benefit the lexicon- ipants. The built dataset contains 150,480 recordings cov-
based transcription. As shown in Fig. 12, the prediction ering 2280 English words, where every word has 66
starts with the root node ‘‘SOS’’, and for each step the different samples. Some samples of the handwritten Eng-
model chooses the child node yl on the path y ¼ ðy\l ; yl Þ lish word are shown in Fig. 6. In the experiments, 44
which has the maximal probability, i.e., ðy\l ; yl Þ ¼ samples per class are randomly chosen for training, and the
arg maxy\l ;yl Pðyl jy\l ; xÞPðy\l jxÞ. The process repeats until remaining for testing. It should be noted that for each
category the handwritten samples of each participant in the
a leaf node ‘‘EOS’’ is reached, and then a word is found on
training set do not appear in the test set.
the path y from the root to that leaf. Moreover, the search
efficiency could be further improved by incorporating the
6.2 Vocabularies for word recognition
beam search but might result in minor loss of the recog-
nition performance.
To evaluate the effect of vocabulary size on the scalability,
we prepare two vocabularies of different sizes for the word
recognition in our experiments: the small vocabulary of
2280 English words and the large vocabulary of 25k
English words. The vocabulary of 2280 words is created by
us according to the frequencies of usage as mentioned in
Sect. 3.1.2. On the other hand, the vocabulary of 25k
English words4 is chosen from the common words by
statistically analyzing a sample of 29 million words used in
English TV and movie scripts. It should be noted that the
word recognition gets harder when the given lexicon grows
larger since the increase in the word categories results in
the larger search space. However, a considerable large
Fig. 12 The example of the lexicon-based transcription by using a
vocabulary may meet much more practical cases for the
prefix tree. The prefix tree is constructed with four words: ‘‘act’’,
‘‘art’’, ‘‘arm’’, and ‘‘a’’. The root is the ‘‘SOS’’ token, and leaves are word recognition.
the ‘‘EOS’’ tokens. Each child node denotes one character, and the
edge between the adjacent nodes gives the conditional probability to
output the next character. The recognition starts with the root and
repeats until one leaf node is reached. The orange nodes denote the
4
path having the maximal probability over the prefix tree, and the The details of the vocabulary of 25k English words can be found in:
corresponding predicted word is ‘‘art’’ https://github.com/dolph/dictionary.

123
Neural Comput & Applic (2019) 31:3155–3172 3165

6.3 Performance metric

We use character accuracy rate (CAR) and word recognition


rate (WRR) to measure the recognition performance of the
classifier. We define Y ¼ fy1 ; . . .; yi ; . . .; yN g are all words
predicted by the recognizer, where yi denotes the i-th pre-
dicted word. Similarly, Z ¼ fz1 ; . . .; zi ; . . .; zN g are all target
words in the test set, where zi denotes the i-th target word.
Then the character accuracy rate (CAR) is calculated by
PN
i¼1 Eðyi ; zi Þ
CARðY; ZÞ ¼ 1  P N ; ð17Þ
i¼1 Lðzi Þ

where Eðyi ; zi Þ is the Levenshtein distance of the predicted


word yi and the target word zi , and Lðzi Þ is the length of
the target word zi . On the other hand, the word recognition
Fig. 13 The detailed configuration of the architecture. The encoder
rate (WRR) is calculated by part is the DBLSTM stacked by 3 BLSTM layers, where each
BLSTM contains 256 hidden units. The decoder part uses the GRU as
1X N
the recurrent cell which contains 256 hidden units. Besides, the
WRRðY; ZÞ ¼ Iðyi ¼ zi Þ; ð18Þ
N i¼1 dropout technique is utilized in each recurrent layer to avoid the
overfitting problem
where Iðyi ¼ zi Þ equals to 1 if the predicted word yi is
same with the target word zi , otherwise equals to 0.

6.4 Implementation details

We implement the architecture in the PyTorch5 framework,


which is a deep learning research platform that provides
maximum flexibility and speed. The experiments are car-
ried out on a Dell workstation with a 3.40 GHz
Intel(R) Core(TM) i7-6700 CPU, 16GB RAM, and a
NVIDIA Quadro K2200 GPU.
The detailed configuration of the model is shown in
Fig. 13. The model consists of two parts: the encoder part
and the decoder part. The encoder part is the DBLSTM
stacked by 3 BLSTM layers, and each BLSTM contains
256 hidden units. However, the decoder part uses the GRU
Fig. 14 The training losses of the model. The model is trained with
recurrent cell containing 256 hidden units. The initialized ADAM and reaches the convergence after over 4000 iterations
weights of each RNN cell are sampled from the uniform
distribution Uð p1ffiffih ; p1ffiffihÞ, where h is the number of hidden 6.5 Overall performance
units.
The network is trained with ADAM [33], setting the The overall performance of the proposed model is listed in
parameter b1 (the exponential decay rate for the first Table 1. It should be noted that when we calculate the
moment estimates) to 0.9, b2 (the exponential decay rate for CAR and the WRR without the lexicon, the word is pre-
the second moment estimates) to 0.999,  (the constant for dicted via the lexicon-free transcription; however, when we
numerical stability) to 108 and l (the initial learning rate) calculate the WRR with the specific lexicon, the word is
to 0.001. During training, each mini-batch contains 240 predicted via the lexicon-based transcription. Besides, we
samples, and all mini-batches are randomly shuffled for set the width of beam search to 8 in the experiments.
every epoch. We train the network for 10 epochs (around Moreover, we compare the proposed model with the CTC
4180 iterations), in which case the model reaches the classifier which is chosen as the baseline method, and we
convergence. The training losses of the model are shown in find that the proposed approach achieves the satisfactory
Fig. 14. results and is comparable with CTC. In addition, the further
comparison of the WRRs between the proposed method
5
PyTorch: http://pytorch.org/.

123
3166 Neural Comput & Applic (2019) 31:3155–3172

Table 1 The overall


Methods CARs (%) WRRs
performance of the proposed
approach on dataset IAHEW- 2280 (%) 25k (%) None (%)
UCAS2016
DBLSTM ? CTC [24] (baseline1) 96.01 96.64 95.09 84.12
DBLSTM(subsample) ? CTC [25] (baseline2) 97.13 97.37 96.21 88.53
DBLSTM(subsample) ? Translator (proposed) 96.86 97.74 96.61 88.63
Bold indicates the best result in the corresponding column
In the 2nd row of column WRRs, ‘‘2280’’ denotes the recognition result with the given lexicon of 2280
English words, ‘‘25k’’ corresponds to that with the lexicon of 25k English words, and ‘‘None’’ corresponds
to that with no lexicon. In addition, ‘‘DBLSTM(subsample)’’ denotes the model subsamples the feature
sequence in the higher level of DBLSTM as mentioned in Sect. 5.1

and other existing approaches will be made in the later the following reason: the in-air handwritten English word
Sect. 6.13. is ambiguous for its unique characteristics as mentioned in
Sect. 3.2, and thus it is difficult to recognize each character
6.6 Attention visualization correctly. Therefore, even a failure of one character will
fail in the word recognition. However, the WRR via lexi-
In this subsection, we will visualize how the model gen- con-based transcription significantly increases by using the
erates the output characters based on the encoded features lexicon of 2280 English words since the correction mech-
with the content-based attention mechanism. The attention anism is introduced and the given lexicon helps the word
alignment between the encoded features and the output recognition. On the other hand, the WRR achieves lower
characters is visualized in Fig. 15. It should be noted that but satisfactory recognition results when using the large
model can learn the explicit monotonic alignment between lexicon containing 25k English words, which indicates the
the encoded features and output characters by using the proposed recognizer is suitable for recognizing the in-air
attention mechanism without a need for any location-based handwritten English word in a considerable large
priors. More specifically, as shown in Fig. 16, we observe vocabulary.
that the model pays attention to the different parts of the
handwritten word, in which the attention mechanism shifts 6.8 Effects of beam width
from one character to the next in the proper order.
We investigate how the width of beam search affects the
6.7 Effects of lexicon recognition performance of the model. According to the
results shown in Fig. 17, the CAR and WRR via lexicon-
We also investigate how the given lexicon affects the WRR free transcription consistently raise when increasing the
in the experiments. As shown in Table 1, the WRR via beam width up to 4. Moreover, the significant performance
lexicon-free transcription is pretty low, which is caused by improvement is observed by increasing the beam width

Fig. 15 The monotonic


alignment between the encoded
features and the output
characters produced by the
model for the recording
‘‘confession’’. The attention
mechanism is visualized by
recording the attention
distribution on the encoded
feature sequence at every
timestep in the decoding phase.
The attention is confused when
generating some characters due
to the ambiguity of the in-air
handwritten word

123
Neural Comput & Applic (2019) 31:3155–3172 3167

Fig. 16 The attention visualization at each timestep in the decoding corresponding part of the trajectory. It can be observed that the
phase for the trajectory ‘‘confession’’. At each timestep, the attention attention mechanism shifts from one character to the next in the
mechanism is visualized by deepening the black color of the proper order

Fig. 18 The effects of word length for the recognition performance.


The reported recognition accuracies are CAR, WRR via the lexicon-
Fig. 17 The effects of the beam width for the recognition free transcription and WRRs via the lexicon-based transcription by
performance. The reported recognition accuracies are CAR, WRR using the lexicons of both 2280 words and 25k words. The
via the lexicon-free transcription and WRRs via the lexicon-based figure shows that the model performs poor in predicting long words
transcription by using the lexicons of both 2280 words and 25k words via the lexicon-free transcription. In addition, it should be noted that
the majority of the word lengths range from 3 to 10 as shown in Fig. 4
from 1 to 2. When the beam width is 8, the WRR via
lexicon-free transcription is 88.63%, and the WRR via
lexicon-based transcription is 97.74% by using the lexicon 6.10 Strict writer-independent evaluation
of 2280 words and 96.61% by using the lexicon of 25k
words, respectively. To further evaluate the performance of the proposed
method, the strict writer-independent evaluation is con-
6.9 Effects of word length ducted on the built dataset. To organize a writer-indepen-
dent dataset, we remove some handwritten samples of a
To evaluate the effects of word length, we group the words few participants from the original dataset IAHWE-
of same lengths together and compute the CAR and WRRs UCAS2016 and split the dataset into three parts: Part-A,
per group. As shown in Fig. 18, the model performs poor in Part-B, and Part-C. Each part covers 2280 English words,
predicting long words via lexicon-free transcription, since and each word has 20 different samples. As a result, the
each output character yl heavily relies on its previous handwritten samples of each participant in each part are
generated characters y\l for each word, and thus the longer guaranteed against appearing in the other two parts. In the
words are more likely to be misclassified. However, when experiments, two parts are used for training and the
the model predicts the words via lexicon-based transcrip- remaining part is used for testing. Figure 19 shows the
tion, the WRRs of the words with different lengths remain cross-validation results, and Table 2 lists the average
stable, which indicates the robustness of the lexicon-based results. As shown in Fig. 19 the recognition performances
transcription. of Part-A and Part-C vary considerably, which is caused by

123
3168 Neural Comput & Applic (2019) 31:3155–3172

6.11 Qualitative analysis

In this subsection, we show the qualitative analysis of the


test results with a few examples as listed in Table 3. Part I
shows some samples which can be easily recognized by
humans, and those words can also be correctly recognized
by the classifier since the adjacent characters vary con-
siderably and thus they can be easily distinguished even if
the adjacent characters are joined. However, Part II shows
some samples which are barely discernible for both
humans and the classifier. For example, the person cannot
correctly recognize the word ‘‘ornament’’ at first sight,
since the letters ‘‘r’’ and ‘‘n’’ are connected with the liga-
ture, and thus the person might wrongly recognize the word
Fig. 19 The experimental results of the writer-independent evalua-
tion. The dataset is divided into three parts: Part-A, Part-B, and Part- ‘‘ornament’’ as ‘‘omament’’ as the classifier recognized via
C. The handwritten samples of each participant in each part are the lexicon-free transcription. However, the person would
guaranteed against appearing in the rest two parts soon realize that ‘‘omament’’ is not a meaningful word
according to his knowledge, and then he might guess that
Table 2 The average recognition performance of the writer-inde- the handwritten word should be ‘‘ornament’’. Therefore,
pendent evaluation the uni-stoke writing style of the in-air handwritten word
CAR WRR makes ambiguity in the specific situation. Part III shows
some samples which are hard to be recognized by humans
2280 25k None
but can be recognized with the proposed classifier. For in-
Mean 96.20% 97.24% 95.84% 86.30% air handwriting, the fingertip suffers less friction and space
Std 0.34 0.24 0.31 1.02 restriction, and thus the user might write more freely and
more casually than conventional online handwriting.
Therefore, those samples are similar to illegible scrawls as
the differences between writing styles of different partici- shown in Part III. In addition, Part IV shows some low-
pants. If the writer in the test set has a very different quality handwritten samples which are hard to be recog-
writing style compared to the majority in the training set nized correctly. However, it is interesting that the classifier
learned by the classifier, the lower performance is recognizes the sample ‘‘anxiety’’ as ‘‘anxic’’ or ‘‘arrow’’,
expected. which makes sense to a certain extent.

Table 3 The qualitative Prediction


analysis of the test results No. Trajectories Ground truth
Lexicon-free Lexicon-based
“pattern” “pattern” “pattern”
Part I
“confession” “confession” “confession”

“ornament” “omament” “ornament”


Part II
”storm” ”stonn” ”storm”

“inquiry” “inquiry” “inquiry”


Part III

“mankind” “markid” “mankind”

“anxiety” “anxic” “arrow”


Part IV
“motherhood” “nelinemance” “religon”

123
Neural Comput & Applic (2019) 31:3155–3172 3169

Table 4 The analysis of failure Prediction


cases No. Trajectories Ground truth
Lexicon-free Lexicon-based
“till” “tin” “tin”
Case I

“tail” “tall” “tall”

“scold” “scose” “score”


Case II
“music” “ming” “mind”

“bucket” “bunce” “bunch”


Case III

“system” “spicen” “spice”

“wrap” “warp” “warp”


Case IV
“connect” “cornet” “comet”

6.12 Analysis of failure cases existing recognition methods is necessary. However, we


only compare the proposed method with several typical
Although the proposed approach has achieved the consid- recognition methods, which can be applied in in-air
erably high performance, the analysis of failure cases is handwritten word recognition, because of the unavailability
conducted to find the possible factors that may affect the of some recognition systems to us. The comparison results
accuracy of the proposed method. As listed in Table 4, we are summarized in Table 5. According to the experimental
have summarized four major failure cases. Case I shows results, we find that the proposed model is very effective
some ambiguous samples in which the ambiguity is caused for the in-air handwritten English word recognition.
by the ligature between the adjacent characters. For exam-
ple, as the sample ‘‘till’’ shown in Case I, the ligature con- 6.14 Evaluation on public in-air handwriting
nects both letters ‘‘l’’ which results that the corresponding dataset
part is similar to the letter ‘‘n’’, however, both ‘‘till’’ and
‘‘tin’’ are meaningful words in the given lexicon. As a result, The proposed method is also evaluated on a smaller but
the proposed method fails to recognize a few samples shown public in-air handwriting dataset released by Kumar
in Case I due to the ambiguity of in-air handwritten words. et al. [5]. The dataset involves ten participants, and each
Instead, Case II gives some special samples which are participant is required to write 21 different English words
written in the unusual temporal order. As the sample ‘‘mu- for four times, which results in a total of 840 recordings.
sic’’ shown in Case II, both the letter ‘‘s’’ and the letter ‘‘c’’ Since the dataset is relatively small, the experiment is
are written in the reverse order, which is entirely different conducted with a tenfold cross-validation by equally
from the conventional writing style. As the writing styles of dividing the dataset into ten parts according to the different
samples shown in Case II are notably different from the participants. As listed in Table 6, the proposed approach
majority of the training set learned by the classifier, the has achieved 89.65% WRR via lexicon-free transcription
classifier fails to recognize those words. However, Case III
shows some low-quality handwritten samples which are Table 5 The comparison between the proposed method and the
written too casually, and those samples are still difficult to be existing approaches
recognized by the classifier. Finally, in Case IV, during the Methods WRRs (%)
word recording process, the participants carelessly write the
wrong words that are inconsistent with the ground-truths, Holistic HMM(32 states) [20] 48.74
which results that the classifier cannot correctly recognize Directional Features ? MQDF [34] 89.64
those false samples in theory as well as in practice. DCNN ? Softmax [35] 92.15
DBLSTM ? Softmax [26] 96.34
6.13 Comparison with the existing approaches DBLSTM ? CTC [24] 96.64
DBLSTM(subsample) ? CTC [25] 97.37
To prove the effectiveness of the proposed approach, the DBLSTM(subsample) ? Translator(proposed) 97.74
comparison between the proposed method and other Bold indicates the best result in the corresponding column

123
3170 Neural Comput & Applic (2019) 31:3155–3172

Table 6 The evaluation of the


Authors Methods Lexicon-size WRRs (%)
proposed method on the public
in-air handwriting dataset Kumar et al. [5] HMM 100 90.24
HMM ? Dynamic features 100 92.73
Ours DBLSTM ? Translator None 89.65
21 89.77
DBLSTM ?Translator? Pre-training None 95.72
21 98.22
Bold indicates the best result in the corresponding column

Table 7 The evaluation of the proposed method on conventional using personal digit assistant (PDAs) as well as smart
handwriting dataset Pinyin of SCUT-COUCH2009. Besides, in the phones with touch screens, and the dataset contains 130
2nd row of column WRRs, ‘‘Pen’’ denotes that the pen-up/pen-down writers’ samples of 2010 categories of pinyins (total 130 
information is utilized for the recognition, and ‘‘None’’ corresponds to
the opposite. In addition, ‘‘-’’ denotes that the corresponding approach 2010 samples). In addition, the dataset Pinyin is chal-
does not exist lenging for the existence of many similar pinyins such as
‘‘zui’’, ‘‘zuı, ‘‘zuı’’, ‘‘zuı’’ and ‘‘zuı’’. Similar to Jin
Methods WRRs
et al. [36], for each category we randomly select 104 (or
None (%) Pen (%) 80%) samples for training and the remaining for testing.
Directional Features ? MQDF 81.73 – Moreover, we also investigate whether the pen-up/pen-
(baseline) [34, 36] down information contributes to the word recognition since
DCNN ? Softmax [35] – 90.74 one of the major differences between in-air handwriting
DBLSTM ? Softmax [26] 92.43 93.59 and conventional handwriting is that the former lacks the
DBLSTM ? CTC [24, 25] 93.01 93.83 pen-up/pen-down information. Therefore, we compare the
DBLSTM ? Translator (proposed) 93.03 93.86 WRRs of different approaches with/without the use of pen-
up/pen-down information. From the experimental results
Bold indicates the best result in the corresponding column
summarized in Table 7, we observe that the WRR increa-
ses when the pen-up/pen-down information is utilized for
and 89.77% WRR via lexicon-based transcription on the the recognition. It indicates that the pen-up/pen-down
public dataset. Since the architecture is the deep neural information improves the recognition performance. Fur-
network while the evaluated dataset is relatively small, the thermore, the experimental results show that the proposed
performance of the proposed approach is poor. However, method performs effectively as well for conventional
we find that pre-training our model with the large dataset online handwriting recognition.
IAHWE-UCAS2016 can significantly benefit the recogni-
tion performance on the evaluated dataset, in which case
the model achieves WRR of 95.72% via lexicon-free 7 Conclusion and future works
transcription and 98.22% via lexicon-based transcription.
In-air handwriting is different from conventional hand-
6.15 Evaluation on conventional handwriting writing based on touch devices because of its unique
dataset characteristics. In this paper, we present an in-air hand-
written English word recognition system. Furthermore, we
The in-air handwriting recognition is more challenging propose an effective approach, called attention recurrent
than conventional handwriting recognition due to the translator, for the in-air handwritten English word recog-
unique characteristics of in-air handwritten words as nition. We evaluate the proposed architecture on the col-
mentioned in Sect. 3.2. However, except for the prepro- lected in-air hand written English word dataset, and the
cessing steps, the proposed method mainly works on the experimental results show that proposed approach is very
word recognition phase. Therefore, the proposed approach effective for the in-air handwritten English word
should also be effective for the conventional handwriting recognition.
recognition if the corresponding features are extracted. However, although the collected in-air handwritten
To demonstrate the model generalization for the con- English word dataset covering 2280 English words may be
ventional handwriting recognition, we evaluate the pro- sufficient for the current requirement on the tasks, we plan
posed method on an online handwriting dataset Pinyin, to expand our dataset with a larger vocabulary in future
which is a subset of SCUT-COUCH2009 released by Jin works for the sake of meeting more practical cases.
et al. [36]. The samples in the dataset Pinyin were collected

123
Neural Comput & Applic (2019) 31:3155–3172 3171

Moreover, it would be interesting to further investigate 10. Zhang H, Li J, Ji Y, Yue H (2017) Understanding subtitles by
possible extensions of our method with respect to CNN. character-level sequence-to-sequence learning. IEEE Trans Ind
Inform 13(2):616–624
With the popularity rise of deep learning, CNN-based deep 11. Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and
learning has achieved many successes in feature repre- spell: a neural network for large vocabulary conversational
sentation and classification [37–39]. As the CNN can speech recognition. In: IEEE international conference on acous-
effectively learn the features of images, the complexity of tics, speech and signal processing, pp 4960–4964
12. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015)
manual feature design can be avoided. We believe the Attention-based models for speech recognition. In: Advances in
proposed approach can be extended to other vision-based neural information processing systems, pp 577–585
applications with the cooperation of CNN, such as vision- 13. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel
based continuous gesture recognition. For example, in the R, Bengio Y (2015) Show, attend and tell: neural image caption
generation with visual attention. In: International conference on
vision-based continuous gesture recognition, we can first machine learning, pp 2048–2057
use CNN for the feature extraction and then feed the 14. Xu N, Wang W, Qu X (2015) Recognition of in-air handwritten
extracted features into the proposed recognizer, and finally, Chinese character based on leap motion controller. In: Interna-
the recognizer predicts the gesture sequence. tional conference on image and graphics, pp 160–168
15. Qu X, Wang W, Lu K, et al (2016) High-order directional fea-
tures and sparse representation based classification for in-air
Acknowledgements The work was supported in part by the National handwritten Chinese character recognition. In: International
Key Research and Development Program of China (No. conference on multimedia and expo (ICME), pp 1–6
2017YFB1002203), by the National Natural Science Foundation of 16. Ren H, Wang W, Lu K, Zhou J, Yuan Q (2017) An end-to-end
China (No. 61772495 and 61232013), by Beijing Advanced Innova- recognizer for in-air handwritten Chinese characters based on a
tion Center for Imaging Technology (No. BAICIT-2016009) new recurrent neural networks. In: International conference on
multimedia and expo (ICME)
Compliance with ethical standards 17. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
Neural Comput 9(8):1735–1780
Conflict of interest We declare that we have no financial and per- 18. Graves A, Gomez F (2006) Connectionist temporal classification:
sonal relationships with other people or organizations that can inap- labelling unsegmented sequence data with recurrent neural net-
propriately influence our work; there is no professional or other works. In: International conference on machine learning,
personal interest of any nature or kind in any product, service and/or pp 369–376
company that could be construed as influencing the position presented 19. Kimura F, Takashina K, Tsuruoka S, Miyake Y (1987) Modified
in, or the review of, the manuscript entitled. quadratic discriminant functions and the application to chinese
character recognition. IEEE Trans Pattern Anal Mach Intell
9(1):149–153
References 20. Dehghan M, Faez K, Ahmadi M, Shridhar M (2001) Handwritten
farsi (arabic) word recognition: a holistic approach using discrete
hmm. Pattern Recognit 34(5):1057–1065
1. Amma C, Gehrig D, Schultz T (2010) Airwriting recognition
21. Liwicki M, Bunke H (2005) IAM-OnDB-an on-line English
using wearable motion sensors. In: Augmented human interna-
sentence database acquired from handwritten text on a white-
tional conference, p 10
board. In: Eighth international conference on document analysis
2. Amma C, Georgi M, Schultz T (2012) Airwriting: Hands-free
and recognition, pp 956–961
mobile text input by spotting and continuous recognition of 3d-
22. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H,
space handwriting with inertial sensors. In: International sym-
Schmidhuber J (2008) A novel connectionist system for uncon-
posium on wearable computers, pp 52–59
strained handwriting recognition. IEEE Trans Pattern Anal Mach
3. Chen M, Alregib G, Juang BH (2016) Air-writing recognition-
Intell 31(5):855–68
part I: modeling and recognition of characters, words, and con-
23. Ahmed SB, Naz S, Razzak MI, Rashid SF, Afzal MZ, Breuel TM
necting motions. IEEE Trans Hum Mach Syst 46(3):403–413
(2016) Evaluation of cursive and non-cursive scripts using
4. Kumar P, Saini R, Roy P, Dogra D (2016) Study of text seg-
recurrent neural networks. Neural Comput Appl 27(3):603–613
mentation and recognition using leap motion sensor. IEEE Sens J
24. Frinken V, Uchida S (2015) Deep BLSTM neural networks for
17(5):1293–1301
unconstrained continuous handwritten text recognition. In:
5. Kumar P, Saini R, Roy PP, Dogra DP (2016) 3D text segmen-
International conference on document analysis and recognition,
tation and recognition using leap motion. Multimed Tools Appl
pp 911–915
76(15):16491–16510
25. Ray A, Rajeswar S, Chaudhury S (2015) Text recognition using
6. Zhang X, Ye Z, Jin L, Feng Z, Xu S (2013) A new writing
deep BLSTM networks. In: Eighth international conference on
experience: finger writing in the air using a kinect sensor. IEEE
advances in pattern recognition, pp 1–6
Multimed 20(4):85–93
26. Zhang XY, Yin F, Zhang YM, Liu CL, Bengio Y (2017) Drawing
7. Vikram S, Li L, Russell S (2013) Handwriting and gestures in the
and recognizing Chinese characters with recurrent neural net-
air, recognizing on the fly. In: Proceedings of the CHI,
work. IEEE Trans Pattern Anal Mach Intell 99:1–1
pp 1179–1184
27. Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Razzak MI
8. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation
(2015) Urdu nasta’liq text recognition system based on multi-
by jointly learning to align and translate. arXiv:1409.0473
dimensional recurrent neural network and statistical features.
9. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W,
Neural Comput Appl 28(2):219–231
Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s
28. Graves A, Mohamed AR, Hinton G (2013) Speech recognition
neural machine translation system: bridging the gap between
with deep recurrent neural networks. In: International conference
human and machine translation. arXiv:1609.08144

123
3172 Neural Comput & Applic (2019) 31:3155–3172

on acoustics, speech and signal processing (ICASSP). IEEE, 35. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based
pp 6645–6649 learning applied to document recognition. Proc IEEE
29. Graves A (2012) Supervised sequence labelling with recurrent 86(11):2278–2324
neural networks. Springer, Berlin 36. Jin L, Gao Y, Liu G, Ding K (2011) SCUT-COUCH2009-a
30. Luong T, Pham H, Manning CD (2015) Effective approaches to comprehensive online unconstrained Chinese handwriting data-
attention-based neural machine translation. In: EMNLP base and benchmark evaluation. Int J Doc Anal Recognit
31. Chung J, Gulcehre C, Cho KH, Bengio Y (2014) Empirical 14(1):53–64
evaluation of gated recurrent neural networks on sequence 37. Oyedotun OK, Khashman A (2017) Deep learning in vision-
modeling. arXiv:1412.3555 based static hand gesture recognition. Neural Comput Appl
32. De La Briandais R (1959) File searching using variable length 28(12):3941–3951
keys. In: western joint computer conference. ACM, pp 295–298 38. Zhang H, Wang S, Cao X, Yue H, Wang K (2016) Learning to
33. Kingma D, Ba J (2014) Adam: a method for stochastic opti- link human objects in videos and advertisements with clothes
mization. arXiv:1412.6980 retrieval. In: International joint conference on neural networks
34. Ding K, Jin L, Gao X (2009) A new method for rotation free (IJCNN). IEEE, pp 5006–5013
method for online unconstrained handwritten Chinese word 39. Zhang H, Cao X, Ho JK, Chow TW (2017) Object-level video
recognition: a holistic approach. In: International conference on advertising: an optimization framework. IEEE Trans Ind Inform
document analysis and recognition, pp 1131–1135 13(2):520–531

123

You might also like