Paper

1880 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO.
7, JULY 2019
A Deep Neural Framework for Continuous Sign

Language Recognition by Iterative Training
Runpeng Cui , Hu Liu , and Changshui Zhang , Fellow, IEEE
Abstract—This work develops a continuous sign language (SL) Typical SL learning problems involve isolated gesture clas-
recognition framework with deep neural networks, which directly sification [3], [5]–[7], sign spotting [8]–[10], and continuous
transcribes videos of SL sentences to sequences of ordered gloss SL recognition [11]–[13]. Generally speaking, gesture classi-
labels. Previous methods dealing with continuous SL recognition
usually employ hidden Markov models with limited capacity fication is to classify isolated gestures to correct categories,
to capture the temporal information. In contrast, our proposed while sign spotting is to detect predefined signs from continu-
architecture adopts deep convolutional neural networks with ous video streams, with precise temporal boundaries of gestures
stacked temporal fusion layers as the feature extraction module, provided for training detectors. Different from these problems,
and bidirectional recurrent neural networks as the sequence continuous SL recognition is to transcribe videos of SL sen-
learning module. We propose an iterative optimization process
for our architecture to fully exploit the representation capability tences to ordered sequences of glosses (here we use “gloss”
of deep neural networks with limited data. We first train the to represent a gesture with its closest meaning in natural lan-
end-to-end recognition model for alignment proposal, and then guages [1]), and the continuous video streams are provided
use the alignment proposal as strong supervisory information without prior segmentation. Continuous SL recognition con-
to directly tune the feature extraction module. This training cerns more about learning unsegmented gestures of long-term
process can run iteratively to achieve improvements on the
recognition performance. We further contribute by exploring the video streams, and is more suitable for processing continuous
multimodal fusion of RGB images and optical flow in sign language. gestural videos in real-world systems. Its training also does not
Our method is evaluated on two challenging SL recognition require an expensive annotation on temporal boundary for each
benchmarks, and outperforms the state of the art by a relative gesture. Recognizing SL indicates simultaneous analysis and
improvement of more than 15% on both databases. integration of gestural movements and appearance features, as
Index Terms—Continuous sign language recognition, sequence well as disparate body parts [1], and therefore probably using a
learning, iterative training, multimodal fusion. multimodal approach. In this paper, we focus on the problem of
continuous SL recognition on videos, where learning the spa-
tiotemporal representations as well as their temporal matching
I. INTRODUCTION
for the labels is crucial.
IGN language (SL) is commonly known as the primary
S language of deaf people, and usually collected or broad-
cast in the form of video. SL is often considered as the most
Many studies [11], [14]–[16] have made their efforts on rep-
resenting SL with hand-crafted features. For example, hand and
joint locations are used in [11], [17], local binary patterns (LBP)
grammatically structured gestural communications [1]. This na- is used in [16], histogram of oriented gradients (HOG) is utilized
ture makes SL recognition an ideal research field for developing in [15], and its extension HOG-3D is applied in [11]. Recently,
methods to address problems such as human motion analysis, deep convolutional neural networks have achieved a tremendous
human-computer interaction (HCI) and user interface design, impact on related tasks on videos, e.g., human action recogni-
and makes it receive great attention in multimedia and com- tion [18]–[20], gesture recognition [6] and sign spotting [9],
puter vision [2]–[4]. [10], and recurrent neural networks (RNNs) have shown signifi-
cant performance on learning the temporal dependencies in sign
spotting [4], [21]. Several recent approaches taking advantage
Manuscript received August 14, 2017; revised May 31, 2018; accepted De- of neural networks have also been proposed for continuous SL
cember 10, 2018. Date of publication January 1, 2019; date of current version
June 21, 2019. This work was supported in part by the National Natural Science recognition [12], [13], [22]. In these works, neural networks
Foundation of China (NSFC) under Grant 61876095 and Grant 61751308, and are restricted to learning frame-wise representations, and hid-
in part by the Beijing Natural Science Foundation under Grant L172037. The den Markov models (HMMs) are utilized for sequence learning.
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Guillaume Gravier. (Corresponding author: Runpeng Cui.) However, the frame-wise labelling adopted in [12], [13], [22] is
The authors are with the Institute for Artificial Intelligence, Tsinghua Uni- noisy for training the deep neural networks, and HMMs might
versity (THUAI), State Key Laboratory of Intelligent Technology and Sys- be hard to learn the complex dynamic variations, considering
tems, Beijing National Research Center for Information Science and Technol-
ogy (BNRist), Department of Automation, Tsinghua University, Beijing 100084, their limited representation capability.
China (e-mail:,crp16@mails.tsinghua.edu.cn; liuhu15@mails.tsinghua.edu.cn; This paper therefore develops a recurrent convolutional neural
zcs@mail.tsinghua.edu.cn). networks for continuous SL recognition. Our proposed neural
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. model consists of two modules for spatiotemporal feature ex-
Digital Object Identifier 10.1109/TMM.2018.2889563 traction and sequence learning respectively. Due to the limited
1520-9210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: SASTRA. Downloaded on February 18,2020 at 07:18:58 UTC from IEEE Xplore. Restrictions apply.
CUI et al.: DEEP NEURAL FRAMEWORK FOR CONTINUOUS SIGN LANGUAGE RECOGNITION BY ITERATIVE TRAINING 1881
Many hand-crafted features have been introduced for gesture

and SL recognition. These features characterize handshape, ap-
pearance and motion cues, by using image pixel intensity [16],
gradients [11], [15], [23] and motion trajectories or veloci-
ties [8], [11], [17]. In recent years, there has been a growing
trend to learn feature representations by deep neural networks.
Wu et al. [24] employ a deep belief network to extract high-
level skeletal joint features for gesture recognition. Convolu-
tional neural networks (CNNs) [25], [26] and 3D convolutional
neural networks (3D-CNNs) [4], [9], [10] have also been em-
ployed to capture visual cues for hand regions. For instance,
Molchanov et al. [4] apply 3D-CNNs for spatiotemporal feature
Fig. 1. Iterative training process in our approach. The end-to-end trained full
architecture provides alignment proposals, and the feature extractor is further extraction from video streams on color, depth and optical flow
tuned to learn the matching of video segments and gestural labels. data. Neverova et al. [9] present a multi-scale deep architecture
on color, depth data and hand-crafted pose descriptors.
Temporal model is to learn the correspondences between se-
scale of the datasets, we find an end-to-end training cannot fully quential representations and gloss labels. HMMs are the most
exploit the deep neural network of high complexity. To address widely used temporal models in SL recognition [10], [11], [13].
this problem, we investigates an iterative optimization process Besides, dynamic time warping (DTW) [16] and SVMs [27] are
(Fig. 1) to train our recurrent deep neural architecture effec- also used for measuring similarity between gestures. Recently,
tively. We use gloss-level gestural supervision given by forced RNNs have been successfully applied to sequential problems
alignment from end-to-end system to directly guide the training such as speech recognition [28] and machine translation [29],
process of the feature extractor. Afterwards, we fine-tune the re- [30], and some progress has also been made for exploring the ap-
current neural system with the improved feature extractor, and plication of RNNs in SL recognition. Pigou et al. [21] propose an
the system can provide further refined alignment for the feature end-to-end neural model with temporal convolutions and bidi-
extraction module. Through this iterative training strategy, our rectional recurrence for sign spotting, which is taken as frame-
deep neural network can keep learning and benefiting from the wise classification in their framework. However, with only weak
refined gestural alignments. The main contributions of our work supervision in sentence level, recurrent neural networks are hard
can be summarized as follows: to learn to match the over-length input sequence frame by frame
1) We develop our architecture with recurrent convolutional with the ordered labels. Different from their model, we use
neural networks of more learning capacity to achieve temporal pooling layers to integrate the temporal dynamics be-
state-of-the-art performance on continuous SL recogni- fore the bidirectional recurrence. Molchanov et al. [4] employ
tion, without importing extra supervisory information; a recurrent 3D-CNN with connectionist temporal classification
2) We design an iterative optimization process for training (CTC) [31] as the cost function for gesture recognition, while
our deep neural network architecture, and our approach, in our experiments, we find that our architecture shows a much
with the neural networks better exploited, is proved to take superior performance compared to 3D-CNN model on the SL
notable effect on the limited training set in contrast to the recognition benchmarks.
end-to-end trained system; Due to lack of temporal boundaries for the sign glosses in
3) We propose a multimodal version of our framework with the image sequences, continuous SL recognition is also a typi-
RGB frames and optical flow images, and experiments cal weakly supervised learning problem. There have been some
present that our multimodal fusion scheme provides better attempts focusing on the problem of mining gestures of interest
representations for the gestures and further improves the from large amount of SL videos, where signs and annotations
performance of the system. are usually coarsely aligned with considerable noise. Different
The remainder of this paper is organized as follows. Section II from our problem, they usually take more focus on local tempo-
reviews related works on SL recognition. Section III introduces ral dynamics but not long-term dependencies. Buehler et al. [15]
the formulation of our deep neural network for SL recogni- propose a scoring function based on multiple instance learning
tion, and its iterative optimization scheme. Section IV provides (MIL) and search for signs of interest by maximizing the score.
implementation details on our model. Section V presents the Pfister et al. [27] use subtitle text, lip and hand motion cues
experimental results of the proposed method and Section VI to select candidate temporal windows, and these candidates are
concludes the paper. further refined using MI-SVM [32]. Chung and Zisserman [33]
use a ConvNet learned on image encoding representing human
keypoint motion for recognition, and they locate temporal posi-
II. RELATED WORK tions of signs via saliency map by back-propagation.
SL recognition systems on videos usually consist of a feature There have been a few works exploring the problem of
extraction module, which extracts sequential representations to continuous SL recognition. Gweth et al. [34] employ a one-
characterize gesture sequences, and a temporal model mapping hidden-layer perceptron to estimate posterior from appearance-
sequential representations to labels. based features, and use the probabilities as inputs to train an
1882 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019
HMM-based recognition system. Koller et al. [12], [13], [25] quence into some spatial representation sequence {r t }Tt=1 =
adopt CNNs for feature extraction from cropped hand regions fCNN ({xt }Tt=1 ) with r t ∈ RC , where C is the feature dimen-
and also use HMMs to model the temporal relationships. As sionality. The feature sequence {r t }Tt=1 is then processed by
the amount of training data is not sufficient enough, training stacked temporal convolution and pooling operations fTem p :
of deep neural networks is inclined to end in overfitting. To R×C → RD , with temporal stride δ, receptive field and out-
alleviate this problem, Koller et al. [12] embed a CNN within put dimensionality D, to get:
a weakly supervised learning framework. Weakly labelled se-
quence of hand shape annotations are brought in as an initial- {sk }K T T
k =1 = fTem p ({r t }t=1 ) = (fTem p ◦ fCNN )({xt }t=1 ),
ization, to iteratively train CNN and re-estimate hand shape (1)
labels within Expectation Maximization (EM) [35] framework. where K = T /δ represents the length of extracted spatiotempo-
Similarly, annotations of finger and palm orientations are also ral representation sequence {sk }Kk =1 , and we use fTem p ◦ fCNN
imported as weakly supervised information to train CNN [25]. In to denote the processing of the proposed feature extraction
their later works [13], [22], they use the frame-state alignment, module. The feature extraction module transforms k-th video
provided by a baseline HMM recognition system, as frame la- segments of length to representation sk . In general, we set
belling to train the embedded neural networks. In contrast with the receptive field approximate to the length of an isolated
these works [12], [13], [22], [25], our sequence learning mod- sign. Therefore, we consider the video segments as approxi-
ule of recurrent neural networks with end-to-end training shows mate “gloss-level”.
much more learning capacity and better performance for the One shortcoming of unidirectional RNNs is that the hidden
dynamic dependencies. Besides, instead of using noisy frame- states are computed only from previous time steps. However
wise labelling as training targets of neural networks, we adopt in SL, the gestural performance and meaning is closely re-
the gloss-level alignment proposal to train our feature extrac- lated to its previous as well as succeeding contexts. Therefore,
tion module, which takes more local temporal dynamics into Bi-LSTMs [37] are employed to learn the complex dynamics
consideration. Moreover, no extra supervisory information such by mapping sequences of spatiotemporal representation to se-
as hand shape annotations is imported in our approach. Notice quences of ordered labels. Bi-LSTM computes the forward and
that the development of such lexicon requires laborious annota- backward hidden sequences by iterating the LSTM computa-
tion with expert knowledge, while our method is free from this tion [38] from k = 1 to K and from k = K to 1 respectively:
limitation.
Different from our previous work [36], we propose a distinc- hfk , cfk = fLSTM −frw (sk , hfk −1 , cfk −1 ), (2)
tive segment-gloss alignment method to learn from the outputs
hbk , cbk = fLSTM −b ck (sk , hbk +1 , cbk +1 ), (3)
of our sequence learning module, and we provide an explicit il-
lustration for our iterative training scheme, by proving the train- where hfk , cfk denote the hidden state and memory cell of for-
ing of feature extraction module to be maximizing the lower ward LSTM module fLSTM −frw at the k-th time step, and hbk , cbk
bound of the objective function, instead of using an intuitive denote those of the backward one fLSTM −b ck . This scheme helps
approach. We also contribute by investigating more on the mul- the recurrent neural networks to exploit future context as well as
timodal integration of appearance and motion cues in this work. previous context at the same time. Finally, the output categorical
probabilities of M gloss labels at time k are computed through
III. METHOD a softmax classifier, which takes the concatenation of hidden
In this work, our proposed architecture adopts a feature ex- states of Bi-LSTM as the input:
traction module composed of a deep CNN followed by temporal z k = softmax(W [hfk ; hbk ] + b), (4)
fusion layers, and a sequence learning module using RNNs with
bidirectional long short-term memory (Bi-LSTM) architecture. where W and b are the weight matrix and bias vector for the
We propose a novel iterative optimization scheme to effec- softmax classifier, and we use [·; ·] to represent the concatenation
tively train our deep architecture. We use the end-to-end recog- operation.
nition system to generate alignment proposal between video We let θ = [θ f ; θ s ] denote the vector of all parameters em-
segments and gestural labels. Given the large amount of gestural ployed in the end-to-end recognition system, where θ f and θ s
segments with supervisory labels, we train the feature extrac- denote the parameters of the feature extraction and sequence
tion module and then fine-tune the whole system iteratively. An learning module respectively. We will introduce our approach
overview of our approach is presented in Fig. 1. In the remainder to learning these parameters in the remainder of this section.
of this section, we will first present our model formulation and
then introduce its iterative training strategy. B. Training With CTC Objective Function
Since the video streams are unsegmented in continuous SL
A. Model Formulation
recognition during training, we introduce connectionist tempo-
We use a CNN followed by temporal convolutional and pool- ral classification (CTC) [31] to solve this transcription problem.
ing layers to learn spatiotemporal representations from input CTC is an objective function originally designed for speech
video streams. Letting {xt }Tt=1 be the input video stream of recognition, requiring no prior alignments between input and
length T , the employed CNN fCNN transforms the input se- output sequences.
Our recognition system, with θ = [θ f ; θ s ] as the stacked vec- C. Learning from Alignments
tor of all its filters, takes video stream x = {xt }Tt=1 as the input At this stage, we utilize the alignment proposal given by the
sequence to predict the sequence of gloss labels y. We add an end-to-end tuned system to train the feature extraction mod-
extra token “blank” to the gloss vocabulary to model the gestural ule. To fully exploit the representation capability of the deep
transitions explicitly. As the categorical probabilities z k of each convolutional network θ f , here we consider using the feature
gloss (including blank) for segment k has been normalized by extraction module to maximize the objective function:
the softmax classifier, we let Pr(m, k|x; θ) = zkm , which indi-
cates the emission probability of label m at time step k using the L(θ f ) = log Pr(y|x; θ f ). (10)
m-th element of z k . In the formulation of CTC, the probability As the feature extractor cannot model the long dynamic de-
of an alignment π = {πk }K k =1 given length K is defined as the pendencies, instead of training it directly with CTC objective,
product at each time step: we have:
K
L(θ f ) = log Pr(y|x; θ f ) (11)
Pr(π|x; θ) = Pr(πk , k|x; θ). (5)
k =1 = log Pr(y, π|x; θ f ) (12)
π∈A
The alignments typically include blanks and repeated tokens.
Alignments are mapped into the given target sequence y by Pr(y, π|x; θ f )
≥ Pr(π|x, y; θ ∗ ) log (13)
the many-to-one mapping B, with repeated labels and blanks Pr(π|x, y; θ ∗ )
π∈A
removed. For example, both the alignment (−, a, a, −, b) and
(−, a, b, −, −) correspond to the target sequence (a, b), where = Pr(π|x, y; θ ∗ ) log Pr(y, π|x; θ f ) + const
a, b are gestural labels, and “−” denote the blank label for gestu- π∈A
ral transition. Letting V(y) = {π|B(π) = y} represent all the (14)
alignments corresponding to target sequence y, the probability
of the target transcription can be computed by summing the = Q(θ ) + const,
f
(15)
probabilities of all the alignments corresponding to it: where A represents all the possible alignments of length K, θ ∗
denotes the parameters of the trained end-to-end model, and the
Pr(y|x; θ) = Pr(π|x; θ). (6) constant is negative entropy of the distribution Pr(π|x, y; θ ∗ )
π∈V(y) and therefore independent of θ f . (13) uses Jensen’s inequality
for the concave function log(·).
Using this computation for Pr(y|x; θ), the CTC objective func-
Note that given a particular alignment π, the output sequence
tion is defined as:
is uniquely determined as B(π). It is easy to find Pr(y|π, x) =
Pr(y|π) = 1(π ∈ V(y)) to represent the constraint to possible
LCTC (θ) = − log Pr(y|x; θ). (7)
alignments with given target sequence, where 1(·) represents
The deep neural network for end-to-end SL recognition can then the indicator function. Then we can write:
be trained to minimize the CTC objective function. Pr(π|x; θ ∗ )
Pr(π|x, y; θ ∗ ) = · 1(π ∈ V(y)), (16)
The calculation of CTC objective function is solved using Pr(y|x; θ ∗ )
a dynamic programming algorithm described below [31]. A
modified sequence y is defined by inserting blanks before and and
after every gestural label, to allow for the blank label in the Pr(y, π|x; θ f ) = Pr(π|x; θ f ) · 1(π ∈ V(y)). (17)
alignments. Let U = |y| be the number of labels contained in
y, we have U = |y | = 2U + 1 as the length of y . We define Based on the constraint to the alignments, the objective is for-
the forward variable α(k, u) as the summed probability of all mulated as:

paths up to time step k and the prefix of y with length u. In L(θ f ) = log Pr(π|x; θ f ), (18)
the formulation of CTC, note that α(k, u) can be calculated π∈V(y)
recursively as:
and the lower bound is further transformed into:
u
Pr(π|x; θ ∗ )
α(k, u) = Pr(yu , k|x; θ) α(k − 1, i), (8) Q(θ f ) = log Pr(π|x; θ f ). (19)
i=g (u ) Pr(y|x; θ ∗ )
π∈V(y)
where In general, we will not be able to optimize all possible align-

ments which are corresponding to y. As the converged end-to-
u−1 if yu is blank or yu −2 = yu end system typically outputs single alignment with dominant
g(u) = (9)
u−2 otherwise probability on training examples, we select the alignment in
V(y) with the highest probability estimation as the proposal,
with the boundary conditions given in [31], and we have and use the feature extraction module to learn from it instead.
Pr(y|x; θ) = α(K, U − 1) + α(K, U ). When alignment π̂ take a dominant probability among all the
possible alignments in V(y), we can see that: module from alignment proposal, we partition the video se-
∗ quences to segments with one supervisory gestural label for
Pr(π̂|x; θ )
Q(θ f ) ≈ log Pr(π̂|x; θ f ), (20) each, according to the dominant alignment proposal.
Pr(y|x; θ ∗ ) The objective function also puts more weights ρ(x, y) to
where we make an approximation to Q(θ f ) by choosing the training examples with estimated alignments of more confi-
alignment with highest probability to represent the integration dence. In practice, we extend the feature extractor with a soft-
over possible alignments. The proposed alignment is given by: max layer and tune the parameters by maximizing Lalign .
After tuning the feature extractor parameters θ f from segment
π̂ = arg max Pr(π|x; θ ∗ ). (21) samples provided by alignment, we continue to train the full
π∈V(y)
deep neural architecture with the improved feature extraction
We define α̂(k, u) as the maximum path probability up to module. The objective function of CTC is employed to fine-
time step k and the prefix of y with length u, and we have tune the recognition system, and the fine-tuned system can give
initial conditions as: further improved alignments. This training procedure can run
iteratively until no improvement is observed in the performance
α̂(k, 0) = 0, 1 ≤ k ≤ K, (22) of the system.

Pr(yu , 1|x; θ ∗ ) u = 1, 2
α̂(1, u) = . (23) IV. MODEL IMPLEMENTATION
0 2 < u ≤ U
In this section, we provide more implementation details of
Similar to (8), we can calculate α̂(k, u) recursively as: our approach.
α̂(k, u) = Pr(yu , k|x; θ ∗ ) max α̂(k − 1, i). (24)
i∈{i|g (u )≤i≤u } A. Model Design
With the formulation of CTC and all the notations used before, The proposed deep neural architecture consists of a deep CNN
we develops following algorithm to find the proposed alignment followed by temporal operations for representation learning, and
π̂ = {π̂k }K
k =1 . Bi-LSTMs for sequence learning.
For experiments with modalities from dominant hands as
Input: α̂(k, u), 1 ≤ k ≤ K, 1 ≤ u ≤ U the inputs, we build the deep convolutional network based on
Output: π̂ the VGG-S model [39] (from layer conv1 to fc6), which is
1: γ ← arg maxi∈{U −1,U } α̂(K, i). memory-efficient and shows competitive classification perfor-
2: π̂K ← yγ . mance on ILSVRC-2012 dataset [40]. The input images, which
3: for k = K − 1 to 1 do are the region of dominant hands cropped from original frames,
4: γ ← arg maxi∈{i|g (γ )≤i≤γ } α̂(k, i). are resized to 101 × 101 in dimension, and they are then trans-
5: π̂k ← yγ . formed to 1024-dimensional feature vectors through the fully
6: end for connected layer fc6.
7: return π̂ The stacked temporal convolution and pooling layers are uti-
lized to generate spatiotemporal representation for each seg-
We note that the optimization of (20) contributes to the max- ment. Note that it is hard to learn the extremely long dynamic
imization of original objective L(θ f ) in (18), by maximizing dependencies with no temporal pooling, while a coarse temporal
the weighted likelihood of alignment proposal with the highest stride will lead to loss of temporal details. We select the tem-
estimated probability. poral stride δ to ensure sufficient overlapping between neigh-
As the bidirectional recurrence for sequence learning takes boring segments, as well as pool the representation sequence
full context into consideration, we assume that the alignment to a moderate length. For videos in RWTH-PHOENIX-Weather
proposal π̂ typically gives a reliable estimation of temporal lo- database, we set = 16 frames, δ = 4 frames, and we set = 25
calization for most signs, and the spatiotemporal representation frames, δ = 9 frames in experiments on SIGNUM corpus. In the
from each segment should have strong correspondence to the feature extraction module, rectifier and max-pooling are adopted
segment label given by the alignment proposal. for all the non-linearity and pooling operations.
Based on these assumptions, letting S be the training set as We use Bi-LSTMs with 2 × 512 dimensional hidden states
the collection of video stream with its annotation (x, y), the and peephole connections to learn the temporal dependencies.
objective function for feature extraction module training can be The hidden states are then fed into the softmax classifier, with
transformed from (20) to: the dimension equal to the vocabulary size.
We also investigate the performance of our training
K
framework with full video frames as the inputs. We use
Lalign (θ ) =
f
ρ(x, y) log Pr(π̂k , k|x, θ f ), (25)
GoogLeNet [41] and also VGG-S net as the deep convolutional
(x,y)∈S k =1
network in our feature extractor, and we adopt two stacked
where ρ(x, y) = Pr(π̂|x; θ ∗ )/Pr(y|x; θ ∗ ), and we present the Bi-LSTMs to build the sequence learning module. Due to the
probability of alignment proposal as the product of emission limitations on GPU memory to fit in the whole system, we fix
probabilities at each segment. To learn the feature extraction the parameters of CNN at the end-to-end stage and only tune
Fig. 2. Deep neural architecture for RGB and optical flow modalities of dominant hands. We take the model developed on RWTH-PHOENIX-Weather 2014
database for example, where T is the length of the input video sequence, K = T /4 is the sequence length after temporal pooling, and M = 1296 is the vocabulary
size (including blank). Each displayed feature map is annotated with its dimensionality respectively. Parameters for different modalities are not shared in this
architecture.
the sequence learning module. The video frames are resized to

224 × 224 as the inputs of CNN, transformed to feature vectors
after the average pooling layer, and then fed into the temporal
fusion layers. The employed GoogLeNet is initialized with the
weights pretrained on ILSVRC-2014 dataset [40], and we ini-
tialize the feature extractor by fitting it to the alignment proposal
generated by the model end-to-end trained on dominant hand
images.
Fig. 3. Fusion structure for RGB and optical flow modalities of full frames.
Sum fusion is used after layer inception_3b and inception_4c. Param-
eters for different modalities are not shared in this architecture.
B. Multimodal Fusion
To incorporate the appearance and motion information, we
also take color image and optical flow for dominant hand regions C. Implementation Details
as the inputs of our deep neural architecture. We adopt sum In our experiments, we use DeepFlow method [43] for opti-
fusion approach at the conv5 layer for fusing the two stream cal flow computation. For dominant hand locations which are
networks. It computes element-wise sum of the two feature not provided in the original databases, we adopt the faster R-
maps at the same spatial location and channel for the fusion. CNN framework [44] to detect the dominant hand in each video
Our intention here is to put appearance and motion cues at the frame. To increase the variability of the training examples, we
same spatial position in correspondence, without introducing add random temporal scaling up to ±20% to video streams,
extra filters in order to join the feature maps together. The sum intensity noises [45] of standard deviation of 0.2 to the RGB
fusion approach also shows a decent performance on the task modal, and randomly jitter the height and width of each input
of action recognition in video [42] compared to other spatial image by ±20%. RGB images and optical flows are fed into the
fusion methods. Our end-to-end architecture for SL recognition neural network with the mean image subtracted. We train the
from dominant hands is depicted in Fig. 2. Note that parameters full architecture for 100 epochs using Adam approach [46] with
for different modalities are not shared before the sum fusion. learning rate of 5 × 10−5 and a mini-batch size of 2.
In experiments on multiple modalities of full frames, we adopt For training the feature extraction module from alignment
fusion of color and optical flow at two layers (after incep- proposals, we split the pairs of segments and gestural labels
tion_3b and inception_4c in GoogLeNet) similar to [42]. into training and validation sets with the ratio of 10 : 1. We
Fig. 3 shows the fusion structures we build for experiments on adopt Adam [46] as the stochastic optimization approach with a
recognition from multiple modalities of full frames. We also fixed learning rate of 5 × 10−5 and a mini-batch size of 20. The
adopt the auxiliary classifiers as in GoogLeNet by adding to training process of feature extractor is ceased when Lalign starts
temporal fusion layers after inception_4a and incep- to plateau on the validation sets, which is usually no more than
tion_4d during the phase of feature extractor fine-tuning. 10 epochs for each iteration. To improve the generalization of the
deep networks, we also adopt the 2 -penalty for the weights of TABLE I
STATISTICS OF RWTH-PHOENIX-WEATHER 2014 AND SIGNUM
the network layers with a hyperparameter of 5 × 10−4 to balance SIGNER-DEPENDENT DATABASES
the objective and the regularization when tuning the feature
extractor and the full system. We stop the iterative training
procedure until no performance improvement is observed on
validation sets. During our experiments, the training process
usually lasts for 3 to 4 iterations.
Our neural architecture is implemented in Theano [47], and
experiments are ran on NVIDIA Titan X GPUs.
V. EXPERIMENTS
This section reports experiments performed on two bench-
marks for continuous SL recognition to validate our approach.
In Section V-A, we introduce the databases and the experimen-
tal protocol that we follow in the experiments. In the remainder
of this section, we present and analyze our experimental results
on public SL recognition benchmarks, and we compare the SL
recognition performance of our framework with the state-of-
the-arts in Section V-E.
A. Datasets and Experimental Protocol

In this work the experiments are carried out on two publicly
available databases, RWTH-PHOENIX-Weather multi-signer Fig. 4. Example frames from the two sign language recognition benchmarks.
(a) RWTH-PHOENIX-Weather 2014 database, (b) SIGNUM database.
2014 database [48] and SIGNUM signer-dependent set [14].
SIGNUM database is created under laboratory conditions,
with recording environment (e.g., lighting, background, signer’s ground truth. WER is defined as:
position and clothes) carefully controlled. The signer-dependent #sub + #del + #ins
subset of SIGNUM corpus contains 603 German SL sentences WER = , (26)
#GT
for training and 177 for testing, each sentence is performed by
a native signer three times. The training corpus contains 11,874 where #sub, #del and #ins denote the number of substitutions,
glosses and 416,620 frames in total, with 455 different gestural deletions and insertions respectively, and #GT is the number
categories. of labels in the ground truth sequence.
In contrast to SIGNUM corpus, RWTH-PHOENIX-Weather
B. Design Choices of Temporal Fusion
2014 database is collected from TV broadcasts of weather fore-
casts. It contains 5,672 video sequences for training, with 65,227 Notice that the temporal receptive field and stride δ are im-
gestural instances and 799,006 frames in total, performed by 9 portant to the temporal fusion of spatial features across video
signers. Each video sequence is for one German SL sentence frames. We compare the performances with different choices
and performed by a single person. The length of the video se- of and δ on RWTH-PHOENIX-Weather 2014 and SIGNUM
quences ranges from 27 to 300 frames, all at the frame rate of datasets (see Table II and Table III). We see that changes in vi-
25 frames per second (fps). The vocabulary of all gestural labels sual inputs and modalities have little influence on the superiority
(excluding “blank”) is up to 1,295. RWTH-PHOENIX-Weather of a suitable temporal fusion structure. One possible explana-
2014 database is a challenging benchmark partially due to its tion is that temporal fusion layers generally focus on capturing
multi-signer settings. Besides, the fast hand motion and blurring temporal integration over different time steps, while visual in-
in signing also add difficulties to accurate recognition. puts and modalities are more about various spatial properties of
We follow the experimental protocol adopted in [11] to split gestures, but with closely related dynamic dependencies. There-
the database into training and testing subsets. The statistics for fore, a suitable temporal fusion structure can also be fit for other
these two corpora in our experiments are shown in Tabel I. visual inputs with close temporal structures. We also observe
In Fig. 4, we present some example frames from these two that overlength temporal sequences with too small δ leads to the
databases. failure of the optimization of CTC, and temporal fusion with
To evaluate the experimental performance quantitatively, in larger δ also results in the performance deterioration, which can
this work we adopt word error rate (WER) as the criterion, which be explained by the loss of temporal details.
quantifies the dissimilarity of the predicted sequence of labels Let Ck denote a temporal convolutional layer with 1024-
and the ground truth transcription. More precisely, WER mea- dimensional filters of size k. Pk denotes a temporal max-
sures the least operations of substitution, deletion and insertion, pooling layer with stride k and kernel size k. Due to the con-
at the word level, to transform the predicted sequence into the sistent superior performances, we set the temporal fusion layers
TABLE II
COMPARISON OF DEV WERS FOR DIFFERENT TEMPORAL CONVOLUTION AND POOLING PARAMETERS ON RWTH-PHOENIX-WEATHER 2014 DATABASE
TABLE III
COMPARISON OF WERS FOR DIFFERENT TEMPORAL CONVOLUTION AND POOLING PARAMETERS ON SIGNUM DATABASE
TABLE IV
RECOGNITION DEV/TEST WERS WITH MULTIPLE MODALITIES ON RWTH-PHOENIX-WEATHER 2014 DATABASE
as C5-P2-C5-P2 for RWTH-PHOENIX-Weather 2014

database, and C5-P3-C5-P3 for SIGNUM database in our
experiments. For these two benchmarks, the selected strides
make the feature sequence before Bi-LSTMs about 4 times as
long as label sequence on average. As for receptive field ,
we suggest that it should be better to cover the approximate
length of single gestures in corpus, so that feature extractor can
learn the representation for gestures with complete “gloss-level”
information.
We also implement a recurrent 3D-CNN architecture [4] for
the continuous SL recognition task. The recurrent 3D-CNN
model reaches WER of 39.64% on development set and WER
of 39.50% on test set. In contrast, our end-to-end training ar- Fig. 5. Training (left) and validation process (right) on RWTH-PHOENIX-
chitecture achieves WER of 32.21% on development set and Weather 2014 database with our model (temporal fusion) and 3D-CNN model.
32.70% on test set. Fig. 5 shows a typical example of the train- Red curves show the CTC objective value and validation performance for 3D-
CNN model over training epoches, and blue curves are for our approach.
ing process. Our approach with temporal fusion converges more
quickly over training epochs, and also achieves a much better
performance on validation set than 3D-CNN architecture. This
scheme brings complementary information to the appearance
result suggests that our proposed model is a more suitable struc-
inputs, and shows a consistently superior performance in con-
ture than 3D-CNN models on this corpus.
trast to learning from single modality for all network configura-
tions. Moreover, the performances of our neural networks, for
C. Recognition With Multimodal Cues all different inputs of modality and network configurations, are
In this section, we present our experimental results with improved by iterative training scheme consistently, although
multimodal setups on both databases. On RWTH-PHOENIX- the last training iteration sometimes sees a slight decrease in
Weather 2014 database (see Table IV), the multimodal fusion performance. Among all the experimental configurations on
TABLE V
RECOGNITION WERS WITH MULTIPLE MODALITIES ON SIGNUM DATABASE
Fig. 6. Recognition results with multiple modalities of full frames on test set. Our architecture with modal fusion and iterative training gives a better prediction
of the gestures. Recognition errors of substitutions and insertions are annotated in red.
D. Gesture Classification With Feature Extractor

RWTH-PHOENIX-Weather 2014 database, Our system, with
multimodal fusion scheme and GoogLeNet architecture in fea- At the stage of learning from alignments, we use the feature
ture extraction module, presents the best performance, achieving extraction module with a softmax classifier to learn the aligned
a WER of 23.10% on development set and 22.86% on test set, gestural labels from video segments. To prove the interpretabil-
which benefits from both multimodal fusion and iterative train- ity of this training process, in this section we directly evaluate
ing, with relative improvements of 6.87% and 6.08% on test set our trained gesture classifiers on the task of isolated gesture
respectively. recognition without finetuning.
In Table V we can observe similar system performances on We evaluate our classifier on the signer-dependent subset of
SIGNUM database. Our deep neural networks consistently ben- isolated gestures in SIGNUM database, which contains 1,350
efit from the proposed iterative training scheme as well as mul- utterances of 450 classes. The isolated gestures are all performed
timodal fusion. The system with architecture of VGG-S net by the same signer as the SL sentences. We use the classifier with
and modal fusion shows the best performance, with WER of temporal convolution and pooling layers to process the image
2.80%, where the multimodal integration approach and the it- stream, and simply take the mean pooling of the predictions for
erative training brings improvements of 1.13% and 0.37% in testing.
WER respectively. It is interesting to find that our classifier with color image
To demonstrate the effectiveness of our proposed method modal shows a 63.70% top-1 accuracy and a 86.37% top-5 ac-
qualitatively, some recognition results with different settings curacy, and the classifier with fusion of color and optical flow
are shown in Fig. 6. We can see that both our multimodal fusion achieves 75.70% for top-1 accuracy and 91.93% for top-5 ac-
design and iterative training strategy improve the recognition curacy on the 450 gestural classes. This result illustrates that
performance and help to provide more precise predictions to the our sequence learning module with bidirectional recurrence can
input videos. provide reliable alignments between frames and gestural labels,
TABLE VI
RECOGNITION DEV/TEST WERS ON SIGNER INDEPENDENT EXPERIMENT
has better inference performance on gestural labels, provides a

more accurate alignment proposal.
E. Signer Independent Recognition

To evaluate the performance of our approach dealing with
inter-signer variations, we present a signer independent exper-
iment for continuous SL recognition in this section. Using the
same experimental configurations as SI5 corpus in [22], we re-
move the video sequences of signer 5 from the training set,
which takes 22.85% off from the whole set, and we evaluate
our trained recognition system only on sequences of signer 5 in
Fig. 7. Comparison of gesture classifiers trained from alignments with differ-
ent modalities. (a) Classification results on some unseen gestural sequences. We development and test set. We use the alignments from our end-
list top-5 predictions of classifiers trained with color modality and multimodal to-end architecture trained on color images of right hands in SI5
fusion respectively, and annotate the ground truth label for each sequence in corpus as the initial alignment proposal. The recognition results
red color. We see that both classifiers can give correct labels on “PAUSE” and
“TIERE”. However, the classifier learning from color modality only, fails to after each iteration of optimization are shown in Table VI.
discriminate gestures with similar hand shape but different types of motion, We observe that the iterative training notably improves the
e.g., “BANK-geld” and “TIERE”. (b) (c) The visualization of flow fields. Ori- performance of the recognition system, with more than 25%
entation and magnitude of flow vectors are represented by hue and saturation
respectively. relative decrease in WER on both development and test sets.
Our architecture achieves WER of 39.82% on development set
and 39.63% on test, with no notable decrease in performance on
further training iterations. Compared to the results of 45.1% on
development and 44.1% on test set reported in [22], our results
reduce WER of the state-of-the-art by a margin around 5%,
which is a relative improvement of more than 10%.
Note that in experiments of multi-signer configurations, our
approach achieves an overall WER of 23.10% on development
and 22.86% on test set, and on those sequences of signer 5,
the WERs on development and test set are 21.28% and 20.48%
respectively. We can observe that the signer independent recog-
Fig. 8. Alignments for training example #1582 from RWTH-PHOENIX-
Weather 2014 database, using color modality (top) and multimodal fusion nition is much more challenging than that on multi-signer set-
scheme (bottom). The x-axis represents time steps of the feature sequence tings. Besides, the reduction in training examples can also par-
after temporal pooling with a stride of 4 frames. The y-axis represents gloss tially explain the decreasing performance of the deep neural
labels, with token #1295 representing “blank”. The blue lines are the most
probable labels that our networks predict at each time step, the red lines denote architecture.
the alignment proposal given by our approach, and the green lines show the
ground truth alignment manually annotated. Our modal fusion scheme exploits
complementary visual cues, and provides a more precise alignment. In contrast, F. Performance Comparison
model with color modality fails to make correct inference around time step 15 Table VII compares the performances of our approach and the
and 33, resulting in a worse alignment proposal. state-of-the-arts. We can observe that our approach achieves the
best performance on both benchmarks. For systems only using
thus we can get decent gesture classifier by training with fea- RGB images as inputs, our approach outperforms the state-of-
ture extraction module on this supervision. Notice also that our the-art method on RWTH-PHOENIX-Weather 2014 database
classifier with multimodal integration presents a higher classifi- by 2.4% in WER (24.43% vs 26.8%), which is a relative im-
cation accuracy. In Fig. 7, we can find that the classifier learning provement of 9.0%. On SIGNUM benchmark, our system learn-
from complementary modalities, other than color images only, ing from color modality, with WER of 3.58%, also sees an
can make more accurate inference. Fig. 8 further shows an exam- improvement of 1.2% in contrast to the state-of-the-art with
ple, where our model using multimodal fusion scheme, which WER of 4.8%. Performance improvements are further gained by
TABLE VII
PERFORMANCE COMPARISON OF CONTINUOUS SL RECOGNITION APPROACHES ON RWTH-PHOENIX-WEATHER 2014 AND SIGNUM DATABASES
introducing multimodal fusion to our framework. The best re- [3] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture recog-
sult on RWTH-PHOENIX-Weather 2014 dataset sees an im- nition with Kinect depth camera,” IEEE Trans. Multimedia, vol. 17, no. 1,
pp. 29–39, Jan. 2015.
provement of WER from 26.8% to 22.86% on test set, and on [4] P. Molchanov et al., “Online detection and classification of dynamic hand
SIGNUM corpus, our system reduces WER of the state-of-the- gestures with recurrent 3D convolutional neural network,” in Proc. IEEE
art from 4.8% to 2.80%. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4207–4215.
[5] H. Cooper, E. J. Ong, N. Pugeault, and R. Bowden, “Sign language recog-
nition using sub-units,” J. Mach. Learn. Res., vol. 13, pp. 2205–2231,
VI. CONCLUSION 2012.
[6] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition
In this paper, we develop a continuous SL recognition system with 3D convolutional neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. Workshops, 2015, pp. 1–7.
with recurrent convolutional neural networks on multimodal [7] G. Zen, L. Porzi, E. Sangineto, E. Ricci, and N. Sebe, “Learning personal-
data of RGB frames and optical flow images. In contrast to ized models for facial expression analysis and gesture recognition,” IEEE
previous state-of-the-art methods, our framework employs re- Trans. Multimedia, vol. 18, no. 4, pp. 775–788, Apr. 2016.
[8] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recogni-
current neural networks as the sequence learning module, which tion from articulated poses,” in Proc. Eur. Conf. Comput. Vis. Workshops,
shows a superior capability of learning temporal dependencies 2014, pp. 595–607.
compared to HMMs. The scale of training data is the bottleneck [9] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Multi-scale deep learning
for gesture detection and localization,” in Proc. Eur. Conf. Comput. Vis.
in fully training a deep neural network of high complexity on Workshops, 2014, pp. 474–490.
this task. To alleviate this problem, we propose a novel training [10] D. Wu et al., “Deep dynamic neural networks for multimodal gesture
scheme to make our feature extraction module fully exploited segmentation and recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 38, no. 8, pp. 1583–1597, Aug. 2016.
to learn the relevant gestural labels on video segments, and [11] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition:
keep on benefitting from the iteratively refined alignment pro- Towards large vocabulary statistical recognition systems handling multiple
posals. We develop a multimodal fusion approach to integrate signers,” Comput. Vis. Image Understand., vol. 141, pp. 108–125, 2015.
[12] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a CNN on 1
appearance and motion cues from SL videos, which presents million hand images when your data is continuous and weakly labelled,”
better spatiotemporal representations for gestures. We evaluate in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3793–3802.
our model on two publicly available SL recognition benchmarks, [13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid CNN-
HMM for continuous sign language recognition,” in Proc. Brit. Mach. Vis.
and experimental results present the effectiveness of our method, Conf., 2016, pp. 136.1–136.12.
where both the iterative training strategy and the multimodal fu- [14] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial
sion contribute to a better representation and the performance features for automatic sign language recognition,” in Proc. 8th IEEE Int.
Conf. Autom. Face Gesture Recognit., 2008, pp. 1–6.
improvements. [15] P. Buehler, A. Zisserman, and M. Everingham, “Learning sign language
There are several directions for future work. First, since ges- by watching TV (using weakly aligned subtitles),” in Proc. IEEE Conf.
tures consist of simultaneous related channels of information, Comput. Vis. Pattern Recognit., 2009, pp. 2961–2968.
[16] L.-C. Wang, R. Wang, D.-H. Kong, and B.-C. Yin, “Similarity assessment
the integration approach of multiple modalities needs more ex- model for Chinese sign language videos,” IEEE Trans. Multimedia, vol. 16,
ploration. It would also be interesting to exploit prior knowledge no. 3, pp. 751–761, Apr. 2014.
on sign language, such as subunits, to help the model learn from [17] C. Monnier, S. German, and A. Ost, “A multi-scale boosted detector for
efficient and robust gesture recognition,” in Proc. Eur. Conf. Comput. Vis.
the limited data. Another promising research path is to develop Workshops, 2014, pp. 491–502.
other sequence learning approach, such as attention-based meth- [18] J. Donahue et al., “Long-term recurrent convolutional networks for visual
ods, to make better use of the temporal dependencies. recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 2625–2634.
[19] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for
REFERENCES very deep two-stream ConvNets,” 2015, arXiv:1507.02159.
[20] Y. Shi, Y. Tian, Y. Wang, and T. Huang, “Sequential deep trajectory
[1] S. C. Ong and S. Ranganath, “Automatic sign language analysis: A survey descriptor for action recognition with three-stream CNN,” IEEE Trans.
and the future beyond lexical meaning,” IEEE Trans. Pattern Anal. Mach. Multimedia, vol. 19, no. 7, pp. 1510–1520, Jul. 2017.
Intell., vol. 27, no. 6, pp. 873–891, Jun. 2005. [21] L. Pigou, A. v. d. Oord, S. Dieleman, M. M. Van Herreweghe, and J.
[2] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust part-based hand gesture Dambre, “Beyond temporal pooling: Recurrence and temporal convolu-
recognition using Kinect sensor,” IEEE Trans. Multimedia, vol. 15, no. 5, tions for gesture recognition in video,” Int. J. Comput. Vis., vol. 126,
pp. 1110–1120, Aug. 2013. no. 2–4, pp. 430–439, Apr. 2018.
[22] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end [43] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow:
sequence modelling with deep recurrent CNN-HMMs,” in Proc. IEEE Large displacement optical flow with deep matching,” in Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3416–3424. Conf. Comput. Vis., 2013, pp. 1385–1392.
[23] T. Pfister, J. Charles, and A. Zisserman, “Domain-adaptive discriminative [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
one-shot learning of gestures,” in Proc. Eur. Conf. Comput. Vis., 2014, object detection with region proposal networks,” in Proc. Adv. Neural Inf.
pp. 814–829. Process. Syst., 2015, pp. 91–99.
[24] D. Wu and L. Shao, “Leveraging hierarchical parametric networks for [45] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
skeletal joints based action segmentation and recognition,” in Proc. IEEE with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro-
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 724–731. cess. Syst., 2012, pp. 1097–1105.
[25] O. Koller, H. Ney, and R. Bowden, “Automatic alignment of HamNoSys [46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
subunits for continuous sign language recognition,” in Proc. Int. Conf. Proc. Int. Conf. Learn. Representations, 2015.
Lang. Resour. Eval. Workshops, 2016, pp. 121–128. [47] Theano Development Team, “Theano: A python framework for fast com-
[26] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign lan- putation of mathematical expressions,” 2014, arXiv:1412.6980.
guage recognition using convolutional neural networks,” in Proc. Eur. [48] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions of
Conf. Comput. Vis. Workshops, 2014, pp. 572–578. the sign language recognition and translation corpus RWTH-PHOENIX-
[27] T. Pfister, J. Charles, and A. Zisserman, “Large-scale learning of sign Weather,” in Proc. Int. Conf. Lang. Resour. Eval., 2014, pp. 1911–1916.
language by watching TV (using co-occurrences),” in Proc. Brit. Mach.
Vis. Conf., 2013, pp. 20.1–20.11.
[28] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with re-
current neural networks,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1764–
Runpeng Cui received the B.S. degree from
1772.
Tsinghua University, Beijing, China, in 2013. He is
[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
currently working toward the Ph.D. degree at the State
jointly learning to align and translate,” 2014, arXiv:1409.0473.
Key Laboratory of Intelligent Technology and Sys-
[30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, tems, Department of Automation, Tsinghua Univer-
sity. His research interests include deep learning and
pp. 3104–3112.
computer vision.
[31] A. Graves, S. Fernàndez, F. Gomez, and J. Schmidhuber, “Connectionist
temporal classification: Labelling unsegmented sequence data with recur-
rent neural networks,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 369–376.
[32] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines
for multiple-instance learning,” in Proc. Adv. Neural Inf. Process. Syst.,
2002, pp. 577–584.
[33] J. S. Chung and A. Zisserman, “Signs in time: Encoding human motion
as a temporal image,” 2016, arXiv:1608.02059.
[34] Y. L. Gweth, C. Plahl, and H. Ney, “Enhanced continuous sign language Hu Liu received the B.S. degree from Tsinghua Uni-
recognition using PCA and neural network features,” in Proc. IEEE Conf. versity, Beijing, China, in 2015. He is currently work-
Comput. Vis. Pattern Recognit. Workshops, 2012, pp. 55–60. ing toward the master’s degree at the State Key Lab-
[35] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from oratory of Intelligent Technology and Systems, De-
incomplete data via the EM algorithm,” J. Roy. Statistical Soc. Series B partment of Automation, Tsinghua University. His re-
Methodological, 1977, pp. 1–38. search interests include deep learning and computer
[36] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks vision.
for continuous sign language recognition by staged optimization,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1610–1618.
[37] A. Graves and J. Schmidhuber, “Framewise phoneme classification with
bidirectional LSTM and other neural network architectures,” Neural Netw.,
vol. 18, no. 5, pp. 602–610, 2005.
[38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[39] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the Changshui Zhang (M’02–SM’15–F’18) received
devil in the details: Delving deep into convolutional nets,” in Proc. Brit. the B.S. degree from Peking University, Beijing,
Mach. Vis. Conf., 2014, pp. 54.1–54.12. China, in 1986, and the Ph.D. degree from Tsinghua
[40] O. Russakovsky et al., “ImageNet large scale visual recognition chal- University, Beijing, China, in 1992. He is currently
lenge,” Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252, 2015. a Professor with the Department of Automation,
[41] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Tsinghua University. His research interests include
Comput. Vis. Pattern Recognit., 2015, pp. 1–9. pattern recognition and machine learning.
[42] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2016, pp. 1933–1941.

Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper

Uploaded by

Copyright:

Available Formats

1880 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO.

A Deep Neural Framework for Continuous Sign

Many hand-crafted features have been introduced for gesture

where In general, we will not be able to optimize all possible align-

the sequence learning module. The video frames are resized to

A. Datasets and Experimental Protocol

as C5-P2-C5-P2 for RWTH-PHOENIX-Weather 2014

D. Gesture Classification With Feature Extractor

has better inference performance on gestural labels, provides a

E. Signer Independent Recognition

You might also like