Professional Documents
Culture Documents
7, JULY 2019
Abstract—This work develops a continuous sign language (SL) Typical SL learning problems involve isolated gesture clas-
recognition framework with deep neural networks, which directly sification [3], [5]–[7], sign spotting [8]–[10], and continuous
transcribes videos of SL sentences to sequences of ordered gloss SL recognition [11]–[13]. Generally speaking, gesture classi-
labels. Previous methods dealing with continuous SL recognition
usually employ hidden Markov models with limited capacity fication is to classify isolated gestures to correct categories,
to capture the temporal information. In contrast, our proposed while sign spotting is to detect predefined signs from continu-
architecture adopts deep convolutional neural networks with ous video streams, with precise temporal boundaries of gestures
stacked temporal fusion layers as the feature extraction module, provided for training detectors. Different from these problems,
and bidirectional recurrent neural networks as the sequence continuous SL recognition is to transcribe videos of SL sen-
learning module. We propose an iterative optimization process
for our architecture to fully exploit the representation capability tences to ordered sequences of glosses (here we use “gloss”
of deep neural networks with limited data. We first train the to represent a gesture with its closest meaning in natural lan-
end-to-end recognition model for alignment proposal, and then guages [1]), and the continuous video streams are provided
use the alignment proposal as strong supervisory information without prior segmentation. Continuous SL recognition con-
to directly tune the feature extraction module. This training cerns more about learning unsegmented gestures of long-term
process can run iteratively to achieve improvements on the
recognition performance. We further contribute by exploring the video streams, and is more suitable for processing continuous
multimodal fusion of RGB images and optical flow in sign language. gestural videos in real-world systems. Its training also does not
Our method is evaluated on two challenging SL recognition require an expensive annotation on temporal boundary for each
benchmarks, and outperforms the state of the art by a relative gesture. Recognizing SL indicates simultaneous analysis and
improvement of more than 15% on both databases. integration of gestural movements and appearance features, as
Index Terms—Continuous sign language recognition, sequence well as disparate body parts [1], and therefore probably using a
learning, iterative training, multimodal fusion. multimodal approach. In this paper, we focus on the problem of
continuous SL recognition on videos, where learning the spa-
tiotemporal representations as well as their temporal matching
I. INTRODUCTION
for the labels is crucial.
IGN language (SL) is commonly known as the primary
S language of deaf people, and usually collected or broad-
cast in the form of video. SL is often considered as the most
Many studies [11], [14]–[16] have made their efforts on rep-
resenting SL with hand-crafted features. For example, hand and
joint locations are used in [11], [17], local binary patterns (LBP)
grammatically structured gestural communications [1]. This na- is used in [16], histogram of oriented gradients (HOG) is utilized
ture makes SL recognition an ideal research field for developing in [15], and its extension HOG-3D is applied in [11]. Recently,
methods to address problems such as human motion analysis, deep convolutional neural networks have achieved a tremendous
human-computer interaction (HCI) and user interface design, impact on related tasks on videos, e.g., human action recogni-
and makes it receive great attention in multimedia and com- tion [18]–[20], gesture recognition [6] and sign spotting [9],
puter vision [2]–[4]. [10], and recurrent neural networks (RNNs) have shown signifi-
cant performance on learning the temporal dependencies in sign
spotting [4], [21]. Several recent approaches taking advantage
Manuscript received August 14, 2017; revised May 31, 2018; accepted De- of neural networks have also been proposed for continuous SL
cember 10, 2018. Date of publication January 1, 2019; date of current version
June 21, 2019. This work was supported in part by the National Natural Science recognition [12], [13], [22]. In these works, neural networks
Foundation of China (NSFC) under Grant 61876095 and Grant 61751308, and are restricted to learning frame-wise representations, and hid-
in part by the Beijing Natural Science Foundation under Grant L172037. The den Markov models (HMMs) are utilized for sequence learning.
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Guillaume Gravier. (Corresponding author: Runpeng Cui.) However, the frame-wise labelling adopted in [12], [13], [22] is
The authors are with the Institute for Artificial Intelligence, Tsinghua Uni- noisy for training the deep neural networks, and HMMs might
versity (THUAI), State Key Laboratory of Intelligent Technology and Sys- be hard to learn the complex dynamic variations, considering
tems, Beijing National Research Center for Information Science and Technol-
ogy (BNRist), Department of Automation, Tsinghua University, Beijing 100084, their limited representation capability.
China (e-mail:,crp16@mails.tsinghua.edu.cn; liuhu15@mails.tsinghua.edu.cn; This paper therefore develops a recurrent convolutional neural
zcs@mail.tsinghua.edu.cn). networks for continuous SL recognition. Our proposed neural
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. model consists of two modules for spatiotemporal feature ex-
Digital Object Identifier 10.1109/TMM.2018.2889563 traction and sequence learning respectively. Due to the limited
1520-9210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1881
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
1882 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
HMM-based recognition system. Koller et al. [12], [13], [25] quence into some spatial representation sequence {r t }Tt=1 =
adopt CNNs for feature extraction from cropped hand regions fCNN ({xt }Tt=1 ) with r t ∈ RC , where C is the feature dimen-
and also use HMMs to model the temporal relationships. As sionality. The feature sequence {r t }Tt=1 is then processed by
the amount of training data is not sufficient enough, training stacked temporal convolution and pooling operations fTem p :
of deep neural networks is inclined to end in overfitting. To R×C → RD , with temporal stride δ, receptive field and out-
alleviate this problem, Koller et al. [12] embed a CNN within put dimensionality D, to get:
a weakly supervised learning framework. Weakly labelled se-
quence of hand shape annotations are brought in as an initial- {sk }K T T
k =1 = fTem p ({r t }t=1 ) = (fTem p ◦ fCNN )({xt }t=1 ),
ization, to iteratively train CNN and re-estimate hand shape (1)
labels within Expectation Maximization (EM) [35] framework. where K = T /δ represents the length of extracted spatiotempo-
Similarly, annotations of finger and palm orientations are also ral representation sequence {sk }Kk =1 , and we use fTem p ◦ fCNN
imported as weakly supervised information to train CNN [25]. In to denote the processing of the proposed feature extraction
their later works [13], [22], they use the frame-state alignment, module. The feature extraction module transforms k-th video
provided by a baseline HMM recognition system, as frame la- segments of length to representation sk . In general, we set
belling to train the embedded neural networks. In contrast with the receptive field approximate to the length of an isolated
these works [12], [13], [22], [25], our sequence learning mod- sign. Therefore, we consider the video segments as approxi-
ule of recurrent neural networks with end-to-end training shows mate “gloss-level”.
much more learning capacity and better performance for the One shortcoming of unidirectional RNNs is that the hidden
dynamic dependencies. Besides, instead of using noisy frame- states are computed only from previous time steps. However
wise labelling as training targets of neural networks, we adopt in SL, the gestural performance and meaning is closely re-
the gloss-level alignment proposal to train our feature extrac- lated to its previous as well as succeeding contexts. Therefore,
tion module, which takes more local temporal dynamics into Bi-LSTMs [37] are employed to learn the complex dynamics
consideration. Moreover, no extra supervisory information such by mapping sequences of spatiotemporal representation to se-
as hand shape annotations is imported in our approach. Notice quences of ordered labels. Bi-LSTM computes the forward and
that the development of such lexicon requires laborious annota- backward hidden sequences by iterating the LSTM computa-
tion with expert knowledge, while our method is free from this tion [38] from k = 1 to K and from k = K to 1 respectively:
limitation.
Different from our previous work [36], we propose a distinc- hfk , cfk = fLSTM −frw (sk , hfk −1 , cfk −1 ), (2)
tive segment-gloss alignment method to learn from the outputs
hbk , cbk = fLSTM −b ck (sk , hbk +1 , cbk +1 ), (3)
of our sequence learning module, and we provide an explicit il-
lustration for our iterative training scheme, by proving the train- where hfk , cfk denote the hidden state and memory cell of for-
ing of feature extraction module to be maximizing the lower ward LSTM module fLSTM −frw at the k-th time step, and hbk , cbk
bound of the objective function, instead of using an intuitive denote those of the backward one fLSTM −b ck . This scheme helps
approach. We also contribute by investigating more on the mul- the recurrent neural networks to exploit future context as well as
timodal integration of appearance and motion cues in this work. previous context at the same time. Finally, the output categorical
probabilities of M gloss labels at time k are computed through
III. METHOD a softmax classifier, which takes the concatenation of hidden
In this work, our proposed architecture adopts a feature ex- states of Bi-LSTM as the input:
traction module composed of a deep CNN followed by temporal z k = softmax(W [hfk ; hbk ] + b), (4)
fusion layers, and a sequence learning module using RNNs with
bidirectional long short-term memory (Bi-LSTM) architecture. where W and b are the weight matrix and bias vector for the
We propose a novel iterative optimization scheme to effec- softmax classifier, and we use [·; ·] to represent the concatenation
tively train our deep architecture. We use the end-to-end recog- operation.
nition system to generate alignment proposal between video We let θ = [θ f ; θ s ] denote the vector of all parameters em-
segments and gestural labels. Given the large amount of gestural ployed in the end-to-end recognition system, where θ f and θ s
segments with supervisory labels, we train the feature extrac- denote the parameters of the feature extraction and sequence
tion module and then fine-tune the whole system iteratively. An learning module respectively. We will introduce our approach
overview of our approach is presented in Fig. 1. In the remainder to learning these parameters in the remainder of this section.
of this section, we will first present our model formulation and
then introduce its iterative training strategy. B. Training With CTC Objective Function
Since the video streams are unsegmented in continuous SL
A. Model Formulation
recognition during training, we introduce connectionist tempo-
We use a CNN followed by temporal convolutional and pool- ral classification (CTC) [31] to solve this transcription problem.
ing layers to learn spatiotemporal representations from input CTC is an objective function originally designed for speech
video streams. Letting {xt }Tt=1 be the input video stream of recognition, requiring no prior alignments between input and
length T , the employed CNN fCNN transforms the input se- output sequences.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1883
Our recognition system, with θ = [θ f ; θ s ] as the stacked vec- C. Learning from Alignments
tor of all its filters, takes video stream x = {xt }Tt=1 as the input At this stage, we utilize the alignment proposal given by the
sequence to predict the sequence of gloss labels y. We add an end-to-end tuned system to train the feature extraction mod-
extra token “blank” to the gloss vocabulary to model the gestural ule. To fully exploit the representation capability of the deep
transitions explicitly. As the categorical probabilities z k of each convolutional network θ f , here we consider using the feature
gloss (including blank) for segment k has been normalized by extraction module to maximize the objective function:
the softmax classifier, we let Pr(m, k|x; θ) = zkm , which indi-
cates the emission probability of label m at time step k using the L(θ f ) = log Pr(y|x; θ f ). (10)
m-th element of z k . In the formulation of CTC, the probability As the feature extractor cannot model the long dynamic de-
of an alignment π = {πk }K k =1 given length K is defined as the pendencies, instead of training it directly with CTC objective,
product at each time step: we have:
K
L(θ f ) = log Pr(y|x; θ f ) (11)
Pr(π|x; θ) = Pr(πk , k|x; θ). (5)
k =1 = log Pr(y, π|x; θ f ) (12)
π∈A
The alignments typically include blanks and repeated tokens.
Alignments are mapped into the given target sequence y by Pr(y, π|x; θ f )
≥ Pr(π|x, y; θ ∗ ) log (13)
the many-to-one mapping B, with repeated labels and blanks Pr(π|x, y; θ ∗ )
π∈A
removed. For example, both the alignment (−, a, a, −, b) and
(−, a, b, −, −) correspond to the target sequence (a, b), where = Pr(π|x, y; θ ∗ ) log Pr(y, π|x; θ f ) + const
a, b are gestural labels, and “−” denote the blank label for gestu- π∈A
ral transition. Letting V(y) = {π|B(π) = y} represent all the (14)
alignments corresponding to target sequence y, the probability
of the target transcription can be computed by summing the = Q(θ ) + const,
f
(15)
probabilities of all the alignments corresponding to it: where A represents all the possible alignments of length K, θ ∗
denotes the parameters of the trained end-to-end model, and the
Pr(y|x; θ) = Pr(π|x; θ). (6) constant is negative entropy of the distribution Pr(π|x, y; θ ∗ )
π∈V(y) and therefore independent of θ f . (13) uses Jensen’s inequality
for the concave function log(·).
Using this computation for Pr(y|x; θ), the CTC objective func-
Note that given a particular alignment π, the output sequence
tion is defined as:
is uniquely determined as B(π). It is easy to find Pr(y|π, x) =
Pr(y|π) = 1(π ∈ V(y)) to represent the constraint to possible
LCTC (θ) = − log Pr(y|x; θ). (7)
alignments with given target sequence, where 1(·) represents
The deep neural network for end-to-end SL recognition can then the indicator function. Then we can write:
be trained to minimize the CTC objective function. Pr(π|x; θ ∗ )
Pr(π|x, y; θ ∗ ) = · 1(π ∈ V(y)), (16)
The calculation of CTC objective function is solved using Pr(y|x; θ ∗ )
a dynamic programming algorithm described below [31]. A
modified sequence y is defined by inserting blanks before and and
after every gestural label, to allow for the blank label in the Pr(y, π|x; θ f ) = Pr(π|x; θ f ) · 1(π ∈ V(y)). (17)
alignments. Let U = |y| be the number of labels contained in
y, we have U = |y | = 2U + 1 as the length of y . We define Based on the constraint to the alignments, the objective is for-
the forward variable α(k, u) as the summed probability of all mulated as:
paths up to time step k and the prefix of y with length u. In L(θ f ) = log Pr(π|x; θ f ), (18)
the formulation of CTC, note that α(k, u) can be calculated π∈V(y)
recursively as:
and the lower bound is further transformed into:
u
Pr(π|x; θ ∗ )
α(k, u) = Pr(yu , k|x; θ) α(k − 1, i), (8) Q(θ f ) = log Pr(π|x; θ f ). (19)
i=g (u ) Pr(y|x; θ ∗ )
π∈V(y)
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
1884 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
possible alignments in V(y), we can see that: module from alignment proposal, we partition the video se-
∗ quences to segments with one supervisory gestural label for
Pr(π̂|x; θ )
Q(θ f ) ≈ log Pr(π̂|x; θ f ), (20) each, according to the dominant alignment proposal.
Pr(y|x; θ ∗ ) The objective function also puts more weights ρ(x, y) to
where we make an approximation to Q(θ f ) by choosing the training examples with estimated alignments of more confi-
alignment with highest probability to represent the integration dence. In practice, we extend the feature extractor with a soft-
over possible alignments. The proposed alignment is given by: max layer and tune the parameters by maximizing Lalign .
After tuning the feature extractor parameters θ f from segment
π̂ = arg max Pr(π|x; θ ∗ ). (21) samples provided by alignment, we continue to train the full
π∈V(y)
deep neural architecture with the improved feature extraction
We define α̂(k, u) as the maximum path probability up to module. The objective function of CTC is employed to fine-
time step k and the prefix of y with length u, and we have tune the recognition system, and the fine-tuned system can give
initial conditions as: further improved alignments. This training procedure can run
iteratively until no improvement is observed in the performance
α̂(k, 0) = 0, 1 ≤ k ≤ K, (22) of the system.
Pr(yu , 1|x; θ ∗ ) u = 1, 2
α̂(1, u) = . (23) IV. MODEL IMPLEMENTATION
0 2 < u ≤ U
In this section, we provide more implementation details of
Similar to (8), we can calculate α̂(k, u) recursively as: our approach.
α̂(k, u) = Pr(yu , k|x; θ ∗ ) max α̂(k − 1, i). (24)
i∈{i|g (u )≤i≤u } A. Model Design
With the formulation of CTC and all the notations used before, The proposed deep neural architecture consists of a deep CNN
we develops following algorithm to find the proposed alignment followed by temporal operations for representation learning, and
π̂ = {π̂k }K
k =1 . Bi-LSTMs for sequence learning.
For experiments with modalities from dominant hands as
Input: α̂(k, u), 1 ≤ k ≤ K, 1 ≤ u ≤ U the inputs, we build the deep convolutional network based on
Output: π̂ the VGG-S model [39] (from layer conv1 to fc6), which is
1: γ ← arg maxi∈{U −1,U } α̂(K, i). memory-efficient and shows competitive classification perfor-
2: π̂K ← yγ . mance on ILSVRC-2012 dataset [40]. The input images, which
3: for k = K − 1 to 1 do are the region of dominant hands cropped from original frames,
4: γ ← arg maxi∈{i|g (γ )≤i≤γ } α̂(k, i). are resized to 101 × 101 in dimension, and they are then trans-
5: π̂k ← yγ . formed to 1024-dimensional feature vectors through the fully
6: end for connected layer fc6.
7: return π̂ The stacked temporal convolution and pooling layers are uti-
lized to generate spatiotemporal representation for each seg-
We note that the optimization of (20) contributes to the max- ment. Note that it is hard to learn the extremely long dynamic
imization of original objective L(θ f ) in (18), by maximizing dependencies with no temporal pooling, while a coarse temporal
the weighted likelihood of alignment proposal with the highest stride will lead to loss of temporal details. We select the tem-
estimated probability. poral stride δ to ensure sufficient overlapping between neigh-
As the bidirectional recurrence for sequence learning takes boring segments, as well as pool the representation sequence
full context into consideration, we assume that the alignment to a moderate length. For videos in RWTH-PHOENIX-Weather
proposal π̂ typically gives a reliable estimation of temporal lo- database, we set = 16 frames, δ = 4 frames, and we set = 25
calization for most signs, and the spatiotemporal representation frames, δ = 9 frames in experiments on SIGNUM corpus. In the
from each segment should have strong correspondence to the feature extraction module, rectifier and max-pooling are adopted
segment label given by the alignment proposal. for all the non-linearity and pooling operations.
Based on these assumptions, letting S be the training set as We use Bi-LSTMs with 2 × 512 dimensional hidden states
the collection of video stream with its annotation (x, y), the and peephole connections to learn the temporal dependencies.
objective function for feature extraction module training can be The hidden states are then fed into the softmax classifier, with
transformed from (20) to: the dimension equal to the vocabulary size.
We also investigate the performance of our training
K
framework with full video frames as the inputs. We use
Lalign (θ ) =
f
ρ(x, y) log Pr(π̂k , k|x, θ f ), (25)
GoogLeNet [41] and also VGG-S net as the deep convolutional
(x,y)∈S k =1
network in our feature extractor, and we adopt two stacked
where ρ(x, y) = Pr(π̂|x; θ ∗ )/Pr(y|x; θ ∗ ), and we present the Bi-LSTMs to build the sequence learning module. Due to the
probability of alignment proposal as the product of emission limitations on GPU memory to fit in the whole system, we fix
probabilities at each segment. To learn the feature extraction the parameters of CNN at the end-to-end stage and only tune
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1885
Fig. 2. Deep neural architecture for RGB and optical flow modalities of dominant hands. We take the model developed on RWTH-PHOENIX-Weather 2014
database for example, where T is the length of the input video sequence, K = T /4 is the sequence length after temporal pooling, and M = 1296 is the vocabulary
size (including blank). Each displayed feature map is annotated with its dimensionality respectively. Parameters for different modalities are not shared in this
architecture.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
1886 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
deep networks, we also adopt the 2 -penalty for the weights of TABLE I
STATISTICS OF RWTH-PHOENIX-WEATHER 2014 AND SIGNUM
the network layers with a hyperparameter of 5 × 10−4 to balance SIGNER-DEPENDENT DATABASES
the objective and the regularization when tuning the feature
extractor and the full system. We stop the iterative training
procedure until no performance improvement is observed on
validation sets. During our experiments, the training process
usually lasts for 3 to 4 iterations.
Our neural architecture is implemented in Theano [47], and
experiments are ran on NVIDIA Titan X GPUs.
V. EXPERIMENTS
This section reports experiments performed on two bench-
marks for continuous SL recognition to validate our approach.
In Section V-A, we introduce the databases and the experimen-
tal protocol that we follow in the experiments. In the remainder
of this section, we present and analyze our experimental results
on public SL recognition benchmarks, and we compare the SL
recognition performance of our framework with the state-of-
the-arts in Section V-E.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1887
TABLE II
COMPARISON OF DEV WERS FOR DIFFERENT TEMPORAL CONVOLUTION AND POOLING PARAMETERS ON RWTH-PHOENIX-WEATHER 2014 DATABASE
TABLE III
COMPARISON OF WERS FOR DIFFERENT TEMPORAL CONVOLUTION AND POOLING PARAMETERS ON SIGNUM DATABASE
TABLE IV
RECOGNITION DEV/TEST WERS WITH MULTIPLE MODALITIES ON RWTH-PHOENIX-WEATHER 2014 DATABASE
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
1888 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
TABLE V
RECOGNITION WERS WITH MULTIPLE MODALITIES ON SIGNUM DATABASE
Fig. 6. Recognition results with multiple modalities of full frames on test set. Our architecture with modal fusion and iterative training gives a better prediction
of the gestures. Recognition errors of substitutions and insertions are annotated in red.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1889
TABLE VI
RECOGNITION DEV/TEST WERS ON SIGNER INDEPENDENT EXPERIMENT
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
1890 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
TABLE VII
PERFORMANCE COMPARISON OF CONTINUOUS SL RECOGNITION APPROACHES ON RWTH-PHOENIX-WEATHER 2014 AND SIGNUM DATABASES
introducing multimodal fusion to our framework. The best re- [3] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture recog-
sult on RWTH-PHOENIX-Weather 2014 dataset sees an im- nition with Kinect depth camera,” IEEE Trans. Multimedia, vol. 17, no. 1,
pp. 29–39, Jan. 2015.
provement of WER from 26.8% to 22.86% on test set, and on [4] P. Molchanov et al., “Online detection and classification of dynamic hand
SIGNUM corpus, our system reduces WER of the state-of-the- gestures with recurrent 3D convolutional neural network,” in Proc. IEEE
art from 4.8% to 2.80%. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4207–4215.
[5] H. Cooper, E. J. Ong, N. Pugeault, and R. Bowden, “Sign language recog-
nition using sub-units,” J. Mach. Learn. Res., vol. 13, pp. 2205–2231,
VI. CONCLUSION 2012.
[6] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition
In this paper, we develop a continuous SL recognition system with 3D convolutional neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. Workshops, 2015, pp. 1–7.
with recurrent convolutional neural networks on multimodal [7] G. Zen, L. Porzi, E. Sangineto, E. Ricci, and N. Sebe, “Learning personal-
data of RGB frames and optical flow images. In contrast to ized models for facial expression analysis and gesture recognition,” IEEE
previous state-of-the-art methods, our framework employs re- Trans. Multimedia, vol. 18, no. 4, pp. 775–788, Apr. 2016.
[8] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recogni-
current neural networks as the sequence learning module, which tion from articulated poses,” in Proc. Eur. Conf. Comput. Vis. Workshops,
shows a superior capability of learning temporal dependencies 2014, pp. 595–607.
compared to HMMs. The scale of training data is the bottleneck [9] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Multi-scale deep learning
for gesture detection and localization,” in Proc. Eur. Conf. Comput. Vis.
in fully training a deep neural network of high complexity on Workshops, 2014, pp. 474–490.
this task. To alleviate this problem, we propose a novel training [10] D. Wu et al., “Deep dynamic neural networks for multimodal gesture
scheme to make our feature extraction module fully exploited segmentation and recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 38, no. 8, pp. 1583–1597, Aug. 2016.
to learn the relevant gestural labels on video segments, and [11] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition:
keep on benefitting from the iteratively refined alignment pro- Towards large vocabulary statistical recognition systems handling multiple
posals. We develop a multimodal fusion approach to integrate signers,” Comput. Vis. Image Understand., vol. 141, pp. 108–125, 2015.
[12] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a CNN on 1
appearance and motion cues from SL videos, which presents million hand images when your data is continuous and weakly labelled,”
better spatiotemporal representations for gestures. We evaluate in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3793–3802.
our model on two publicly available SL recognition benchmarks, [13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid CNN-
HMM for continuous sign language recognition,” in Proc. Brit. Mach. Vis.
and experimental results present the effectiveness of our method, Conf., 2016, pp. 136.1–136.12.
where both the iterative training strategy and the multimodal fu- [14] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial
sion contribute to a better representation and the performance features for automatic sign language recognition,” in Proc. 8th IEEE Int.
Conf. Autom. Face Gesture Recognit., 2008, pp. 1–6.
improvements. [15] P. Buehler, A. Zisserman, and M. Everingham, “Learning sign language
There are several directions for future work. First, since ges- by watching TV (using weakly aligned subtitles),” in Proc. IEEE Conf.
tures consist of simultaneous related channels of information, Comput. Vis. Pattern Recognit., 2009, pp. 2961–2968.
[16] L.-C. Wang, R. Wang, D.-H. Kong, and B.-C. Yin, “Similarity assessment
the integration approach of multiple modalities needs more ex- model for Chinese sign language videos,” IEEE Trans. Multimedia, vol. 16,
ploration. It would also be interesting to exploit prior knowledge no. 3, pp. 751–761, Apr. 2014.
on sign language, such as subunits, to help the model learn from [17] C. Monnier, S. German, and A. Ost, “A multi-scale boosted detector for
efficient and robust gesture recognition,” in Proc. Eur. Conf. Comput. Vis.
the limited data. Another promising research path is to develop Workshops, 2014, pp. 491–502.
other sequence learning approach, such as attention-based meth- [18] J. Donahue et al., “Long-term recurrent convolutional networks for visual
ods, to make better use of the temporal dependencies. recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 2625–2634.
[19] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for
REFERENCES very deep two-stream ConvNets,” 2015, arXiv:1507.02159.
[20] Y. Shi, Y. Tian, Y. Wang, and T. Huang, “Sequential deep trajectory
[1] S. C. Ong and S. Ranganath, “Automatic sign language analysis: A survey descriptor for action recognition with three-stream CNN,” IEEE Trans.
and the future beyond lexical meaning,” IEEE Trans. Pattern Anal. Mach. Multimedia, vol. 19, no. 7, pp. 1510–1520, Jul. 2017.
Intell., vol. 27, no. 6, pp. 873–891, Jun. 2005. [21] L. Pigou, A. v. d. Oord, S. Dieleman, M. M. Van Herreweghe, and J.
[2] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust part-based hand gesture Dambre, “Beyond temporal pooling: Recurrence and temporal convolu-
recognition using Kinect sensor,” IEEE Trans. Multimedia, vol. 15, no. 5, tions for gesture recognition in video,” Int. J. Comput. Vis., vol. 126,
pp. 1110–1120, Aug. 2013. no. 2–4, pp. 430–439, Apr. 2018.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1891
[22] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end [43] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow:
sequence modelling with deep recurrent CNN-HMMs,” in Proc. IEEE Large displacement optical flow with deep matching,” in Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3416–3424. Conf. Comput. Vis., 2013, pp. 1385–1392.
[23] T. Pfister, J. Charles, and A. Zisserman, “Domain-adaptive discriminative [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
one-shot learning of gestures,” in Proc. Eur. Conf. Comput. Vis., 2014, object detection with region proposal networks,” in Proc. Adv. Neural Inf.
pp. 814–829. Process. Syst., 2015, pp. 91–99.
[24] D. Wu and L. Shao, “Leveraging hierarchical parametric networks for [45] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
skeletal joints based action segmentation and recognition,” in Proc. IEEE with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro-
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 724–731. cess. Syst., 2012, pp. 1097–1105.
[25] O. Koller, H. Ney, and R. Bowden, “Automatic alignment of HamNoSys [46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
subunits for continuous sign language recognition,” in Proc. Int. Conf. Proc. Int. Conf. Learn. Representations, 2015.
Lang. Resour. Eval. Workshops, 2016, pp. 121–128. [47] Theano Development Team, “Theano: A python framework for fast com-
[26] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign lan- putation of mathematical expressions,” 2014, arXiv:1412.6980.
guage recognition using convolutional neural networks,” in Proc. Eur. [48] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions of
Conf. Comput. Vis. Workshops, 2014, pp. 572–578. the sign language recognition and translation corpus RWTH-PHOENIX-
[27] T. Pfister, J. Charles, and A. Zisserman, “Large-scale learning of sign Weather,” in Proc. Int. Conf. Lang. Resour. Eval., 2014, pp. 1911–1916.
language by watching TV (using co-occurrences),” in Proc. Brit. Mach.
Vis. Conf., 2013, pp. 20.1–20.11.
[28] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with re-
current neural networks,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1764–
Runpeng Cui received the B.S. degree from
1772.
Tsinghua University, Beijing, China, in 2013. He is
[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
currently working toward the Ph.D. degree at the State
jointly learning to align and translate,” 2014, arXiv:1409.0473.
Key Laboratory of Intelligent Technology and Sys-
[30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, tems, Department of Automation, Tsinghua Univer-
sity. His research interests include deep learning and
pp. 3104–3112.
computer vision.
[31] A. Graves, S. Fernàndez, F. Gomez, and J. Schmidhuber, “Connectionist
temporal classification: Labelling unsegmented sequence data with recur-
rent neural networks,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 369–376.
[32] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines
for multiple-instance learning,” in Proc. Adv. Neural Inf. Process. Syst.,
2002, pp. 577–584.
[33] J. S. Chung and A. Zisserman, “Signs in time: Encoding human motion
as a temporal image,” 2016, arXiv:1608.02059.
[34] Y. L. Gweth, C. Plahl, and H. Ney, “Enhanced continuous sign language Hu Liu received the B.S. degree from Tsinghua Uni-
recognition using PCA and neural network features,” in Proc. IEEE Conf. versity, Beijing, China, in 2015. He is currently work-
Comput. Vis. Pattern Recognit. Workshops, 2012, pp. 55–60. ing toward the master’s degree at the State Key Lab-
[35] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from oratory of Intelligent Technology and Systems, De-
incomplete data via the EM algorithm,” J. Roy. Statistical Soc. Series B partment of Automation, Tsinghua University. His re-
Methodological, 1977, pp. 1–38. search interests include deep learning and computer
[36] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks vision.
for continuous sign language recognition by staged optimization,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1610–1618.
[37] A. Graves and J. Schmidhuber, “Framewise phoneme classification with
bidirectional LSTM and other neural network architectures,” Neural Netw.,
vol. 18, no. 5, pp. 602–610, 2005.
[38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[39] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the Changshui Zhang (M’02–SM’15–F’18) received
devil in the details: Delving deep into convolutional nets,” in Proc. Brit. the B.S. degree from Peking University, Beijing,
Mach. Vis. Conf., 2014, pp. 54.1–54.12. China, in 1986, and the Ph.D. degree from Tsinghua
[40] O. Russakovsky et al., “ImageNet large scale visual recognition chal- University, Beijing, China, in 1992. He is currently
lenge,” Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252, 2015. a Professor with the Department of Automation,
[41] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Tsinghua University. His research interests include
Comput. Vis. Pattern Recognit., 2015, pp. 1–9. pattern recognition and machine learning.
[42] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2016, pp. 1933–1941.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.