Professional Documents
Culture Documents
DOI 10.1007/s00521-017-3260-9
ORIGINAL ARTICLE
Received: 5 July 2017 / Accepted: 14 October 2017 / Published online: 3 November 2017
Ó The Natural Computing Applications Forum 2017
123
3156 Neural Comput & Applic (2019) 31:3155–3172
Fig. 1 The framework of the in-air handwritten English word corresponding feature sequence. In the recognition stage, the neural
recognition. After the acquisition of the in-air handwritten word, the network model translates the input feature sequence into the final
system normalizes the collected trajectory and extracts the hypothesis
With current advances in technology, in-air handwriting English word dataset contains 2280 English words, and
can be achieved with many somatosensory devices such as each English word has 66 different samples, which
Kinect1 and Leap Motion,2 which are capable of capturing results in a total of 150,480 recordings.
finger positions and movements in real time. The high – We propose a new approach for the in-air handwritten
precision and small expense of those somatosensory devi- English word recognition, which is inspired by the
ces make the in-air handwriting possible to become ubiq- attention-based model for natural machine transla-
uitous and popular. By using different kinds of motion tion [8–10], speech recognition [11, 12] and image
tracking devices, many in-air handwriting systems have caption [13]. To our best knowledge, we believe we
been implemented in recent years [1–7]. However, the first attempt to extend the attention model to the in-air
majority of the existing in-air handwriting systems mainly handwritten English word recognition.
focus on character and digit recognition. Even though a The rest of the paper is organized as follows. Section 2
few in-air handwriting systems [2–5, 7] have investigated briefly reviews the relevant works on the in-air handwriting
the word recognition or the sentence recognition, they are and word recognition methods. Section 3 introduces the in-
limited to the small amount of training data and the lack of air handwriting and the collected in-air handwritten Eng-
robust classification algorithms, which might make their lish word dataset. Section 4 presents the preprocessing
systems far from practical. steps for the in-air handwriting, and Sect. 5 presents the
Therefore, the primary objective of this study is to proposed recognizer in detail. Section 6 conducts the
implement a practical in-air handwriting system which can experimental evaluation, and Sect. 7 concludes the paper
recognize the in-air handwritten English word in a large and discusses the future works.
vocabulary by using the more robust recognizer. However,
the handwriting is actually defined on a 2D surface
regardless of the actual writing motion. Therefore we can 2 Related works
consider the in-air handwriting as representing the writing
motion projected on a vertical plane. The framework of the 2.1 In-air handwriting
presented in-air handwriting system is shown in Fig. 1, and
we will detail each module in the following sections. As a natural and comfortable human–computer interaction
The main contributions of the paper are summarized as way, the in-air handwriting is a quite attractive and exciting
follows: research topic. With current advances in technology, vari-
– We present an in-air handwriting system which can ous kinds of in-air handwriting systems have been imple-
record and recognize the in-air handwritten English mented with different somatosensory devices.
word in a large vocabulary, where the in-air handwrit- Amma et al. [1, 2] presented an air-writing system based
ing is achieved via the Leap Motion controller. on a wearable glove. They [1] used the hidden Markov
– We construct a moderately large in-air handwritten model (HMM) to classify 26 English letters. They con-
English word dataset. The collected in-air handwritten ducted the experiments on the dataset containing a total of
6500 characters, and their method achieved an average
writer-independent character recognition rate of 81.9%.
1
Kinect: https://developer.microsoft.com/en-us/windows/kinect. Furthermore, their approach achieved a single-writer word
2
Leap Motion: https://www.leapmotion.com.
123
Neural Comput & Applic (2019) 31:3155–3172 3157
recognition rate of 97.5% based on a small vocabulary of recognition. In their latter works [4], the authors further
652 words. In their latter works [2], they proposed a two- used the bidirectional long short-term memory
stage approach for spotting and recognition of continuous (BLSTM) [17] and connectionist temporal classification
air-writing, in which the spotting stage used the support (CTC) [18] for the word recognition, which outperforms
vector machine (SVM), and the recognition stage used the the HMM classifier.
HMM classifier. They conducted the experiments on the However, the majority of the existing in-air handwriting
dataset containing a total of 720 English sentences, and a systems mainly focus on character and digit recognition.
word error rate of 11% was achieved for the person inde- Even though a few in-air handwriting systems have
pendent setup. investigated the word recognition or the sentence recog-
Zhang et al. [6] presented a vision-based air-writing nition, they are limited to the small amount of training data
system focusing on Chinese character recognition based on as well as the lack of robust classification algorithms,
Kinect. They used the compact modified quadratic dis- which might make their systems far from practical. In this
criminant function (MQDF) classifier to recognize 6763 work, we present an in-air handwriting system based on
frequently used Chinese characters, 26 English letters (both Leap Motion for the in-air handwritten English word
uppercase and lowercase), and 10 digits. The experiments recognition in a large vocabulary. We extend the previous
were conducted on 375 videos with a total of 44,522 works in the following aspects: (1) The system is specifi-
frames, and a recognition rate of 78.46% for Chinese cally designed for the English word recognition rather than
character was achieved. On the other hand, Xu et al. [14] the character or the digit recognition; (2) A much larger
presented an in-air handwriting system for the Chinese dataset of the in-air handwritten English word is con-
character recognition based on Leap Motion. They recog- structed, which can provide sufficient samples for training
nized the projected 2D handwritten trajectories by com- the robust classifier; (3) A novel and effective neural net-
bining the directional features and the MQDF classifier. work architecture based on attention mechanism is pro-
They conducted the experiments on their constructed posed for the in-air handwritten English word recognition.
dataset IAHCC-UCAS2014 containing 244,075 recordings
of 3755 Chinese characters, and a character recognition 2.2 Word recognition methods
rate of 69.67% was achieved. In the latter works, other
different recognition methods were evaluated on their Generally, the approaches of handwritten word recognition
expanded dataset IAHCC-UCAS2015, such as the sparse can be typically classified into two categories: the analytic
representation-based classifier [15] and the recurrent neural approach and the holistic approach. The former approach
network (RNN) using the softmax classifier [16], and the first individually recognizes each isolated character and
better recognition results were achieved. then maps the output characters into the complete word.
Vikram et al. [7] achieved the air-writing based on Leap The latter approach treats the word as a single, indivisible
Motion. They recognized the writing sequence by using entity and attempts to recognize it using features of the
dynamic time warping distance (DTW) classifier on their word as a whole.
constructed dataset, which contains 3000 recordings of 30 Many approaches can be applied to the handwritten
English words and 26,000 recordings of 52 English letters. word recognition, such as DTW [7], MQDF [19],
On the other hand, Chen et al. [3] addressed overlapped air- HMM [20, 21]. However, many recent studies show that
writing by using the Leap Motion, in which the characters the deep neural networks usually outperform those tradi-
of a handwritten word are overlapped. The authors used the tional approaches for the recognition tasks. For example,
HMM for both the air-writing modeling and recognition. Graves et al. [22] have used BLSTM with CTC for
They conducted the experiments on the dataset which has unconstrained handwritten text recognition, and the
5720 motion characters, 4400 motion words covering the experimental results show that such a network architecture
40 words vocabulary from common television channels and outperforms the HMM classifier. RNN is extremely
1000 motion words covering 1k-word vocabulary. The applicable for the tasks such as unsegmented connected
proposed system achieved a word error rate of 0.8% for handwriting recognition or speech recognition since the
word-based recognition and 1.9% for letter-based recog- RNN can process the sequence data with longer range
nition. Recently Kumar et al. [5] presented the 3D text context information compared with the HMM. Specifically,
segmentation and recognition using Leap Motion. They Ahmed [23] has presented a thorough investigation of the
segmented the text into words first, and then used the performance of RNN for cursive and non-cursive scripts.
HMM for word recognition. They conducted the experi- Inspired by the great success of deep learning, deep bidi-
ments on the dataset containing 840 recordings covering 21 rectional recurrent neural network (DBRNN) has also been
different words. Their approach achieved overall accuracy investigated in the handwriting recognition and outper-
of 80.3% in word segmentation and 92.73% in word forms the single layer RNN [24, 25]. Furthermore, the
123
3158 Neural Comput & Applic (2019) 31:3155–3172
3 In-air handwriting
123
Neural Comput & Applic (2019) 31:3155–3172 3159
123
3160 Neural Comput & Applic (2019) 31:3155–3172
Fig. 7 Some examples of the ambiguity of in-air handwritten words. The ligatures between the adjacent characters make the word rather
confusing. a ‘‘accept’’, b ‘‘improve’’, c ‘‘language’’, d ‘‘morning’’, e ‘‘invention’’, f ‘‘arm’’
handwritten words contain much more diversity than con- 4.1 Word normalization
ventional handwritten words as shown in Fig. 8.
Since the writing styles of different writers are consider-
ably different, it is important to normalize the handwriting
4 Preprocessing trajectories. As shown in Fig. 9, we normalize each word
trajectory according to the following steps. (1) Computing
It is necessary to normalize handwritten words for rotation: calculating a linear regression line from the local
removing the noises and reducing the differences between minimum points of the trajectory. If the residual variance is
writing styles of different writers. After the normalization, larger than a defined threshold, the point with the largest
the feature extraction step is applied for the ready of Euclidean distance from the regression line is removed, and
recognition phase. In the following subsections, we the regression line is calculated again; (2) Normalizing
describe the preprocessing steps including word normal- rotation: rotating the trajectory according to the slant cal-
ization and feature extraction in details. culated in step (1); (3) Normalizing size: transforming
123
Neural Comput & Applic (2019) 31:3155–3172 3161
Fig. 9 The process of the word normalization. All the handwritten words are normalized for removing the noises and reducing the differences
between writing styles of different writers
every word to the same given height, while preserving the capture the pen-up/pen-down information due to the uni-
aspect ratio of the word; (4) Smoothing: making the local stroke writing style of the in-air handwritten word. As a
smoothing by computing the new position ðx^i ; y^i Þ for each result, we extract a six-dimensional feature for each point,
point ðxi ; yi Þ based on its neighbors ðxi1 ; yi1 Þ, ðxiþ1 ; yiþ1 Þ and then obtain a sequence of features.
and itself, i.e., ðx^i ; y^i Þ ¼ 0:25ðxi1 ; yi1 Þ þ 0:5ðxi ; yi Þ þ
0:25ðxiþ1 ; yiþ1 Þ; (5) Resampling: replacing the sequence of
captured points with a new series of points having the same 5 Word recognition
spatial distance.
In this section, we detail the proposed approach for the in-
4.2 Feature extraction air handwritten word recognition. Given an input feature
sequence x ¼ ðx1 ; . . .; xi ; . . .; xN Þ, we need to find the label
Feature extraction is essential for the in-air handwritten sequence y ¼ ðy1 ; . . .; yl ; . . .; yL Þ which has the maximal
word recognition. As the in-air handwritten word is always conditional probability, where yl 2 fa; b; . . .; z; SOS;
finished in one single stroke, the recorded trajectory can be EOS; PADg, ‘‘SOS’’ and ‘‘EOS’’ are the special tokens which
considered as a very smooth curve. For the in-air hand- denote the start-of-sequence and the end-of-sequence,
written word trajectory which is a sequence of (x, y)-co- respectively, and ‘‘PAD’’ denotes the blank token. Formally,
ordinates representing the location of the fingertip in a y ¼ argmaxy PðyjxÞ: ð1Þ
vertical plane, we derive following features as shown in
Fig. 10: (1) the offset of x-coordinates D x as well as the The proposed architecture directly models the conditional
offset of y-coordinates D y; (2) the cosine and sine of the probability PðyjxÞ of translating the input feature sequence
writing direction a; (3) the cosine and sine of the curvature x into the target sequence y. The overview of the archi-
b. We think the extracted features can effectively represent tecture, called attention recurrent translator, is shown in
the smooth curve of the in-air handwritten trajectory and Fig. 11. The translator basically consists of two modules:
maintain the temporal information. Different from the (1) The encoder which computes the representation of the
conventional handwritten word recognition, we cannot input sequence x; (2) The decoder which outputs each
123
3162 Neural Comput & Applic (2019) 31:3155–3172
Fig. 10 Handcrafted features of the in-air handwritten trajectory. A six-dimensional feature is extracted for each point, and then a sequence of
features is obtained for each trajectory. a Offset-x D x and offset-y D y, b Writing direction a, c Curvature b
123
Neural Comput & Applic (2019) 31:3155–3172 3163
hji ¼ BLSTM hji1 ; hj1 : ð5Þ sl ¼ tanhðWc ½ sbl ; cl Þ; ð11Þ
i
sl ¼ Decoder ðsl1 ; yl1 ; hÞ: ð8Þ where ðx; yÞ denotes each pair of the input feature sequence
x and the target sequence y. The model minimizes the loss
Specifically, the decoder first computes the hidden state sbl
L according to
according to X
sbl ¼ RNNðsl1 ; yl1 Þ; ð9Þ min logPðyjx; hÞ ð16aÞ
h
ðx;yÞ2D
where RNN denotes the recurrent cell such as long short- X X
L
term memory (LSTM) [17] or gate recurrent unit min logPðyl jx; y\l ; hÞ; ð16bÞ
h
(GRU) [31]. Then the attention mechanism is applied to ðx;yÞ2D l¼1
derive the context vector cl , which captures the relevant
where y\l denotes the ground truth of the previous l1
information from the encoded feature sequence h and helps
characters, and h denotes the model parameters.
to generate the current character yl . Formally, the context
vector cl is calculated as
5.4 Word decoding
cl ¼ Attentionð sbl ; hÞ: ð10Þ
The goal of the network is to find the label sequence y
Finally, the decoder combines the information of the current
with the maximal conditional probability, i.e.,
hidden state sbl and the calculated context vector cl to pro-
y ¼ arg maxy PðyjxÞ. For in-air handwritten word recog-
duce the current decoder state sl , which is computed as
nition, mainly two types of transcriptions can be used to
123
3164 Neural Comput & Applic (2019) 31:3155–3172
determine the sequence y : lexicon-free transcription and It should be noted that the lexicon-based transcription
lexicon-based transcription. fully considers the context information since it chooses the
Lexicon-free transcription In lexicon-free transcription, word having the maximal likelihood over the given lexi-
the generated character sequence y is not always a con, which is similar to the holistic approach. Therefore,
meaningful word. The sequence prediction starts with we think the lexicon-based transcription can reduce the
‘‘SOS’’ token, and the model samples from the probability ambiguity of the in-air handwritten words by considering
distribution by taking the character having the maximal the context information and thus benefits the word recog-
probability at each timestep, i.e., yl SoftmaxðWs sl þ bs Þ, nition. On the other hand, different from the lexicon-free
until ‘‘EOS’’ token is produced which indicates the end of transcription, the lexicon-based transcription always out-
the inference process. puts the word within the given lexicon. The experimental
Lexicon-based transcription In lexicon-based transcrip- results reported in Sect. 6.7 will show that lexicon-based
tion, the label sequence y is found by selecting the transcription outperforms the lexicon-free transcription for
sequence y in the given lexicon D that has the maximal the in-air handwritten English word recognition.
conditional probability, i.e., y ¼ arg maxy2D PðyjxÞ, where
PðyjxÞ is calculated according to the chain rule,
Q 6 Experimental evaluation
PðyjxÞ ¼ Ll¼1 Pðyl jx; y\l Þ. However, it is time-consuming
to decode on a huge lexicon since it is required to go
through the whole lexicon. As a result, an efficient search 6.1 Dataset preparation
scheme is needed to prune the search space. Since many
words in the dictionary share the same prefixes and the next IAHEW-UCAS2016 With the in-air handwriting system, we
generated character yl of each word depends on its previous construct an in-air handwritten English word dataset named
output characters y\l , a pre-constructed prefix tree [32] IAHEW-UCAS2016, which involves 324 different partic-
over the whole given lexicon D can benefit the lexicon- ipants. The built dataset contains 150,480 recordings cov-
based transcription. As shown in Fig. 12, the prediction ering 2280 English words, where every word has 66
starts with the root node ‘‘SOS’’, and for each step the different samples. Some samples of the handwritten Eng-
model chooses the child node yl on the path y ¼ ðy\l ; yl Þ lish word are shown in Fig. 6. In the experiments, 44
which has the maximal probability, i.e., ðy\l ; yl Þ ¼ samples per class are randomly chosen for training, and the
arg maxy\l ;yl Pðyl jy\l ; xÞPðy\l jxÞ. The process repeats until remaining for testing. It should be noted that for each
category the handwritten samples of each participant in the
a leaf node ‘‘EOS’’ is reached, and then a word is found on
training set do not appear in the test set.
the path y from the root to that leaf. Moreover, the search
efficiency could be further improved by incorporating the
6.2 Vocabularies for word recognition
beam search but might result in minor loss of the recog-
nition performance.
To evaluate the effect of vocabulary size on the scalability,
we prepare two vocabularies of different sizes for the word
recognition in our experiments: the small vocabulary of
2280 English words and the large vocabulary of 25k
English words. The vocabulary of 2280 words is created by
us according to the frequencies of usage as mentioned in
Sect. 3.1.2. On the other hand, the vocabulary of 25k
English words4 is chosen from the common words by
statistically analyzing a sample of 29 million words used in
English TV and movie scripts. It should be noted that the
word recognition gets harder when the given lexicon grows
larger since the increase in the word categories results in
the larger search space. However, a considerable large
Fig. 12 The example of the lexicon-based transcription by using a
vocabulary may meet much more practical cases for the
prefix tree. The prefix tree is constructed with four words: ‘‘act’’,
‘‘art’’, ‘‘arm’’, and ‘‘a’’. The root is the ‘‘SOS’’ token, and leaves are word recognition.
the ‘‘EOS’’ tokens. Each child node denotes one character, and the
edge between the adjacent nodes gives the conditional probability to
output the next character. The recognition starts with the root and
repeats until one leaf node is reached. The orange nodes denote the
4
path having the maximal probability over the prefix tree, and the The details of the vocabulary of 25k English words can be found in:
corresponding predicted word is ‘‘art’’ https://github.com/dolph/dictionary.
123
Neural Comput & Applic (2019) 31:3155–3172 3165
123
3166 Neural Comput & Applic (2019) 31:3155–3172
and other existing approaches will be made in the later the following reason: the in-air handwritten English word
Sect. 6.13. is ambiguous for its unique characteristics as mentioned in
Sect. 3.2, and thus it is difficult to recognize each character
6.6 Attention visualization correctly. Therefore, even a failure of one character will
fail in the word recognition. However, the WRR via lexi-
In this subsection, we will visualize how the model gen- con-based transcription significantly increases by using the
erates the output characters based on the encoded features lexicon of 2280 English words since the correction mech-
with the content-based attention mechanism. The attention anism is introduced and the given lexicon helps the word
alignment between the encoded features and the output recognition. On the other hand, the WRR achieves lower
characters is visualized in Fig. 15. It should be noted that but satisfactory recognition results when using the large
model can learn the explicit monotonic alignment between lexicon containing 25k English words, which indicates the
the encoded features and output characters by using the proposed recognizer is suitable for recognizing the in-air
attention mechanism without a need for any location-based handwritten English word in a considerable large
priors. More specifically, as shown in Fig. 16, we observe vocabulary.
that the model pays attention to the different parts of the
handwritten word, in which the attention mechanism shifts 6.8 Effects of beam width
from one character to the next in the proper order.
We investigate how the width of beam search affects the
6.7 Effects of lexicon recognition performance of the model. According to the
results shown in Fig. 17, the CAR and WRR via lexicon-
We also investigate how the given lexicon affects the WRR free transcription consistently raise when increasing the
in the experiments. As shown in Table 1, the WRR via beam width up to 4. Moreover, the significant performance
lexicon-free transcription is pretty low, which is caused by improvement is observed by increasing the beam width
123
Neural Comput & Applic (2019) 31:3155–3172 3167
Fig. 16 The attention visualization at each timestep in the decoding corresponding part of the trajectory. It can be observed that the
phase for the trajectory ‘‘confession’’. At each timestep, the attention attention mechanism shifts from one character to the next in the
mechanism is visualized by deepening the black color of the proper order
123
3168 Neural Comput & Applic (2019) 31:3155–3172
123
Neural Comput & Applic (2019) 31:3155–3172 3169
123
3170 Neural Comput & Applic (2019) 31:3155–3172
Table 7 The evaluation of the proposed method on conventional using personal digit assistant (PDAs) as well as smart
handwriting dataset Pinyin of SCUT-COUCH2009. Besides, in the phones with touch screens, and the dataset contains 130
2nd row of column WRRs, ‘‘Pen’’ denotes that the pen-up/pen-down writers’ samples of 2010 categories of pinyins (total 130
information is utilized for the recognition, and ‘‘None’’ corresponds to
the opposite. In addition, ‘‘-’’ denotes that the corresponding approach 2010 samples). In addition, the dataset Pinyin is chal-
does not exist lenging for the existence of many similar pinyins such as
‘‘zui’’, ‘‘zuı, ‘‘zuı’’, ‘‘zuı’’ and ‘‘zuı’’. Similar to Jin
Methods WRRs
et al. [36], for each category we randomly select 104 (or
None (%) Pen (%) 80%) samples for training and the remaining for testing.
Directional Features ? MQDF 81.73 – Moreover, we also investigate whether the pen-up/pen-
(baseline) [34, 36] down information contributes to the word recognition since
DCNN ? Softmax [35] – 90.74 one of the major differences between in-air handwriting
DBLSTM ? Softmax [26] 92.43 93.59 and conventional handwriting is that the former lacks the
DBLSTM ? CTC [24, 25] 93.01 93.83 pen-up/pen-down information. Therefore, we compare the
DBLSTM ? Translator (proposed) 93.03 93.86 WRRs of different approaches with/without the use of pen-
up/pen-down information. From the experimental results
Bold indicates the best result in the corresponding column
summarized in Table 7, we observe that the WRR increa-
ses when the pen-up/pen-down information is utilized for
and 89.77% WRR via lexicon-based transcription on the the recognition. It indicates that the pen-up/pen-down
public dataset. Since the architecture is the deep neural information improves the recognition performance. Fur-
network while the evaluated dataset is relatively small, the thermore, the experimental results show that the proposed
performance of the proposed approach is poor. However, method performs effectively as well for conventional
we find that pre-training our model with the large dataset online handwriting recognition.
IAHWE-UCAS2016 can significantly benefit the recogni-
tion performance on the evaluated dataset, in which case
the model achieves WRR of 95.72% via lexicon-free 7 Conclusion and future works
transcription and 98.22% via lexicon-based transcription.
In-air handwriting is different from conventional hand-
6.15 Evaluation on conventional handwriting writing based on touch devices because of its unique
dataset characteristics. In this paper, we present an in-air hand-
written English word recognition system. Furthermore, we
The in-air handwriting recognition is more challenging propose an effective approach, called attention recurrent
than conventional handwriting recognition due to the translator, for the in-air handwritten English word recog-
unique characteristics of in-air handwritten words as nition. We evaluate the proposed architecture on the col-
mentioned in Sect. 3.2. However, except for the prepro- lected in-air hand written English word dataset, and the
cessing steps, the proposed method mainly works on the experimental results show that proposed approach is very
word recognition phase. Therefore, the proposed approach effective for the in-air handwritten English word
should also be effective for the conventional handwriting recognition.
recognition if the corresponding features are extracted. However, although the collected in-air handwritten
To demonstrate the model generalization for the con- English word dataset covering 2280 English words may be
ventional handwriting recognition, we evaluate the pro- sufficient for the current requirement on the tasks, we plan
posed method on an online handwriting dataset Pinyin, to expand our dataset with a larger vocabulary in future
which is a subset of SCUT-COUCH2009 released by Jin works for the sake of meeting more practical cases.
et al. [36]. The samples in the dataset Pinyin were collected
123
Neural Comput & Applic (2019) 31:3155–3172 3171
Moreover, it would be interesting to further investigate 10. Zhang H, Li J, Ji Y, Yue H (2017) Understanding subtitles by
possible extensions of our method with respect to CNN. character-level sequence-to-sequence learning. IEEE Trans Ind
Inform 13(2):616–624
With the popularity rise of deep learning, CNN-based deep 11. Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and
learning has achieved many successes in feature repre- spell: a neural network for large vocabulary conversational
sentation and classification [37–39]. As the CNN can speech recognition. In: IEEE international conference on acous-
effectively learn the features of images, the complexity of tics, speech and signal processing, pp 4960–4964
12. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015)
manual feature design can be avoided. We believe the Attention-based models for speech recognition. In: Advances in
proposed approach can be extended to other vision-based neural information processing systems, pp 577–585
applications with the cooperation of CNN, such as vision- 13. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel
based continuous gesture recognition. For example, in the R, Bengio Y (2015) Show, attend and tell: neural image caption
generation with visual attention. In: International conference on
vision-based continuous gesture recognition, we can first machine learning, pp 2048–2057
use CNN for the feature extraction and then feed the 14. Xu N, Wang W, Qu X (2015) Recognition of in-air handwritten
extracted features into the proposed recognizer, and finally, Chinese character based on leap motion controller. In: Interna-
the recognizer predicts the gesture sequence. tional conference on image and graphics, pp 160–168
15. Qu X, Wang W, Lu K, et al (2016) High-order directional fea-
tures and sparse representation based classification for in-air
Acknowledgements The work was supported in part by the National handwritten Chinese character recognition. In: International
Key Research and Development Program of China (No. conference on multimedia and expo (ICME), pp 1–6
2017YFB1002203), by the National Natural Science Foundation of 16. Ren H, Wang W, Lu K, Zhou J, Yuan Q (2017) An end-to-end
China (No. 61772495 and 61232013), by Beijing Advanced Innova- recognizer for in-air handwritten Chinese characters based on a
tion Center for Imaging Technology (No. BAICIT-2016009) new recurrent neural networks. In: International conference on
multimedia and expo (ICME)
Compliance with ethical standards 17. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
Neural Comput 9(8):1735–1780
Conflict of interest We declare that we have no financial and per- 18. Graves A, Gomez F (2006) Connectionist temporal classification:
sonal relationships with other people or organizations that can inap- labelling unsegmented sequence data with recurrent neural net-
propriately influence our work; there is no professional or other works. In: International conference on machine learning,
personal interest of any nature or kind in any product, service and/or pp 369–376
company that could be construed as influencing the position presented 19. Kimura F, Takashina K, Tsuruoka S, Miyake Y (1987) Modified
in, or the review of, the manuscript entitled. quadratic discriminant functions and the application to chinese
character recognition. IEEE Trans Pattern Anal Mach Intell
9(1):149–153
References 20. Dehghan M, Faez K, Ahmadi M, Shridhar M (2001) Handwritten
farsi (arabic) word recognition: a holistic approach using discrete
hmm. Pattern Recognit 34(5):1057–1065
1. Amma C, Gehrig D, Schultz T (2010) Airwriting recognition
21. Liwicki M, Bunke H (2005) IAM-OnDB-an on-line English
using wearable motion sensors. In: Augmented human interna-
sentence database acquired from handwritten text on a white-
tional conference, p 10
board. In: Eighth international conference on document analysis
2. Amma C, Georgi M, Schultz T (2012) Airwriting: Hands-free
and recognition, pp 956–961
mobile text input by spotting and continuous recognition of 3d-
22. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H,
space handwriting with inertial sensors. In: International sym-
Schmidhuber J (2008) A novel connectionist system for uncon-
posium on wearable computers, pp 52–59
strained handwriting recognition. IEEE Trans Pattern Anal Mach
3. Chen M, Alregib G, Juang BH (2016) Air-writing recognition-
Intell 31(5):855–68
part I: modeling and recognition of characters, words, and con-
23. Ahmed SB, Naz S, Razzak MI, Rashid SF, Afzal MZ, Breuel TM
necting motions. IEEE Trans Hum Mach Syst 46(3):403–413
(2016) Evaluation of cursive and non-cursive scripts using
4. Kumar P, Saini R, Roy P, Dogra D (2016) Study of text seg-
recurrent neural networks. Neural Comput Appl 27(3):603–613
mentation and recognition using leap motion sensor. IEEE Sens J
24. Frinken V, Uchida S (2015) Deep BLSTM neural networks for
17(5):1293–1301
unconstrained continuous handwritten text recognition. In:
5. Kumar P, Saini R, Roy PP, Dogra DP (2016) 3D text segmen-
International conference on document analysis and recognition,
tation and recognition using leap motion. Multimed Tools Appl
pp 911–915
76(15):16491–16510
25. Ray A, Rajeswar S, Chaudhury S (2015) Text recognition using
6. Zhang X, Ye Z, Jin L, Feng Z, Xu S (2013) A new writing
deep BLSTM networks. In: Eighth international conference on
experience: finger writing in the air using a kinect sensor. IEEE
advances in pattern recognition, pp 1–6
Multimed 20(4):85–93
26. Zhang XY, Yin F, Zhang YM, Liu CL, Bengio Y (2017) Drawing
7. Vikram S, Li L, Russell S (2013) Handwriting and gestures in the
and recognizing Chinese characters with recurrent neural net-
air, recognizing on the fly. In: Proceedings of the CHI,
work. IEEE Trans Pattern Anal Mach Intell 99:1–1
pp 1179–1184
27. Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Razzak MI
8. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation
(2015) Urdu nasta’liq text recognition system based on multi-
by jointly learning to align and translate. arXiv:1409.0473
dimensional recurrent neural network and statistical features.
9. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W,
Neural Comput Appl 28(2):219–231
Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s
28. Graves A, Mohamed AR, Hinton G (2013) Speech recognition
neural machine translation system: bridging the gap between
with deep recurrent neural networks. In: International conference
human and machine translation. arXiv:1609.08144
123
3172 Neural Comput & Applic (2019) 31:3155–3172
on acoustics, speech and signal processing (ICASSP). IEEE, 35. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based
pp 6645–6649 learning applied to document recognition. Proc IEEE
29. Graves A (2012) Supervised sequence labelling with recurrent 86(11):2278–2324
neural networks. Springer, Berlin 36. Jin L, Gao Y, Liu G, Ding K (2011) SCUT-COUCH2009-a
30. Luong T, Pham H, Manning CD (2015) Effective approaches to comprehensive online unconstrained Chinese handwriting data-
attention-based neural machine translation. In: EMNLP base and benchmark evaluation. Int J Doc Anal Recognit
31. Chung J, Gulcehre C, Cho KH, Bengio Y (2014) Empirical 14(1):53–64
evaluation of gated recurrent neural networks on sequence 37. Oyedotun OK, Khashman A (2017) Deep learning in vision-
modeling. arXiv:1412.3555 based static hand gesture recognition. Neural Comput Appl
32. De La Briandais R (1959) File searching using variable length 28(12):3941–3951
keys. In: western joint computer conference. ACM, pp 295–298 38. Zhang H, Wang S, Cao X, Yue H, Wang K (2016) Learning to
33. Kingma D, Ba J (2014) Adam: a method for stochastic opti- link human objects in videos and advertisements with clothes
mization. arXiv:1412.6980 retrieval. In: International joint conference on neural networks
34. Ding K, Jin L, Gao X (2009) A new method for rotation free (IJCNN). IEEE, pp 5006–5013
method for online unconstrained handwritten Chinese word 39. Zhang H, Cao X, Ho JK, Chow TW (2017) Object-level video
recognition: a holistic approach. In: International conference on advertising: an optimization framework. IEEE Trans Ind Inform
document analysis and recognition, pp 1131–1135 13(2):520–531
123