Professional Documents
Culture Documents
Jie Huang,1 Wengang Zhou,2 Qilin Zhang,3 Houqiang Li,4 Weiping Li5
1,2,4,5
Department of Electronic Engineering and Information Science, University of Science and Technology of China
3
HERE Technologies, Chicago, Illinois, USA
1
hagjie@mail.ustc.edu.cn, {2 zhwg,4 lihq,5 wpli}@ustc.edu.cn, 3 samqzhang@gmail.com
2257
and Li 2014) proposed a threshold matrix based coarse tem- gesture detection and tracking is the tremendous appear-
poral segmentation step followed by a Dynamic Time Warp- ance variations of hand shapes and orientations, in addi-
ing (DTW) algorithm and a bi-grammar model. In (Koller, tion to occlusions. Previous researches (Tang et al. 2015;
Zargaran, and Ney 2017), a new HMM based language Kurakin, Zhang, and Liu 2012; Sun et al. 2013) on pos-
model is incorporated. Recently, transitional movements at- ture/gesture models rely heavily on collected depth maps
tract a lot of attention (Dawod, Nordin, and Abdullah 2016; from Kinect-style RGB-D sensors, which provides a con-
Li, Zhou, and Lee 2016; Yang, Tao, and Ye 2016; Zhang et venient way in 3D modeling. However, a vast majority of
al. 2016) because they can serve as the basis for temporal existing annotated signing videos are recorded with conven-
segmentation. tional RGB camcorders. An effective RGB video based fea-
Despite its popularity, temporal segmentation is intrin- ture representation is essential to take advantage of these
sically difficult: even the transitional movements between valuable historical labeled data.
hand gestures can be subtle and ambiguous. Inaccurate seg- Inspired by the recent success of deep learning based
mentation can incur significant performance penalty on sub- object detection, we propose a two-stream 3-D CNN for
sequent steps (Zhang, Zhou, and Li 2014; Koller, Forster, the generation of video feature representation, with accu-
and Ney 2015; Fang, Gao, and Zhao 2007). Worse still, the rate gesture detection and tracking. This particular two-
isolated SLR step typically requires per-video-frame labels, stream 3-D CNN takes both the entire video frames and
which are highly time consuming. cropped/tracked gesture image patches as two separate in-
puts for each stream, with a late fusion mechanism achieved
Video Description Generation through shared final fully connected layers. Therefore, the
proposed CNN encodes both global and local information.
Video description generation (Pan et al. 2015; Venu-
gopalan et al. 2015; Yao et al. 2015) is a relevant re- Gesture Detection and Tracking
search area, which generates a brief sentence describing
the scenes/objects/motions of a given video sequences. One Faster R-CNN (Girshick 2015) is a popular object detection
popular method is the sequence-to-sequence video-to-text method, which is adopted in our proposed method for ges-
(Venugopalan et al. 2015), a two layer LSTM on top of ture detection and tracking. We pre-train a faster R-CNN
a CNN. Attention mechanism can be incorporated into a on the VOC20071 person-layout dataset2 with two output
LSTM (Yao et al. 2015), which automatically selects the units representing hands versus background. Subsequently,
most likely video frames. There are also some extensions we randomly select 400 frames from our proposed CSL
to the LSTM-style algorithms, such as bidirectional LSTM dataset and manually annotate the hand locations for fine-
(Bin et al. 2016), hierarchical LSTM (Li, Luong, and Juraf- tuning. After that, each video is processed frame-by-frame
sky 2015) and hierarchical attention GRU (Yang et al. 2016). for gesture detection.
The faster R-CNN based detection can fail when the hand-
Despite many similarities in targets and involved tech-
shape varies hugely or is occluded by clothes. To local-
niques, video description generation and continuous SLR
ize gestures in these frames, compressive tracking (Zhang,
are two distinctive tasks. Video description generation pro-
Zhang, and Yang 2012) is utilized. The compressive tracking
vides a brief summary of the appearance of a video se-
model represents target regions in multi-scale compressive
quence, such as “A female person raises a hand and stretch
feature vectors and scores the proposal regions with a Bayes
fingers” while the continuous SLR provides a semantic
classifier. The Bayes classifier parameters are updated based
translation of sign language sentences, such as “A signer
on detected targets in each frame, the compressive tracking
says ‘I love you’ in American Sign Language. ”
model is robust to huge appearance variations. Specifically,
a compressive tracking model is initialized whenever the
Latent Space based Learning faster R-CNN detection fails, with the successfully detec-
Latent space model is a popular tool to bridge the seman- tions from the immediate prior video frame. Conventional
tic gap between two modalities (Zhang et al. 2015b; 2015a; tracking algorithms often suffer from the drift problem, es-
Zhang and Hua 2015), such as between textual short de- pecially with long video sequences. Our proposed method is
scriptions and imagery (Pan et al. 2014). For example, the largely immune to this problem due to limited length of sign
combination of LSTM and latent space model is proposed videos.
in (Pan et al. 2015) for video caption generation, in which
both an embedding model and an LSTM are jointly learned. Two-stream 3D CNN
However, this embedding model is purely based on the Eu- Due to the nature of signing videos, a robust video fea-
clidean distance between video frames and textual descrip- ture representation requires the incorporation of both global
tions, ignoring all temporal structures. It is suitable for the hand locations/motions and local hand gestures. As shown
generation of video appearance summaries, but falls short of in Fig. 1, we design a two-stream 3-D CNN based on the
semantic translation. C3D (Tran et al. 2015), which extracts spatio-temporal fea-
tures from a video clip (16 frames as suggested in (Tran et
Signing Video Feature Representation 1
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html
Signing video is characterized primarily by upper body mo- 2
It contains detailed body part labels (e.g., head, hands and
tions, especially hand gestures. The major challenge of hand feet).
2258
Convolution & pooling Fully
global Output
connected
16 input
local
one-hot vector
5 10 15 20 25 clip
w1 w2 wL y2 h2 the
(a)
word attention GHFRGHU
4 D[i,j] y3 h3 cup
3 ←
− ←
− ←
−
FOLS h i1 h i2 h iT y4
2 h4 is
1 HQFRGHU →
− →
− →
−
h i1 h i2 h iT
y5 h5 blue
5 10 15 20 25 clip
(b) clipi1 clipi2 clipiT
Figure 3: Associated warping paths generated by DTW. X- Figure 4: HAN encodes videos hierarchically and weights
axis represents the frame index, y-axis indicates the word the input sequence via a attention layer. It decodes the hid-
sequence index. Grids represent the matrix elements D[i, j]. den vector representation to a sentence word-by-word.
(a) shows three possible alignment paths of raw DTW. (b)
shows the alignment paths of Window-DTW which are
bounded by windows. given videos. By minimizing the loss, contextual relation-
ship among the words in sentences can be kept.
Tv ∈ RDs ×Dc and Ts ∈ RDs ×Dw are transformation ma- Ec (V, S; θr , θc ) = log p(s1 , s2 , ..., sm |v1 , v2 , ..., vn ; θc )
trices that project video content and semantic sentence into
common space. Ds is the dimension of the latent space. m
= log p(st |v1 , ..., vn , s1 , ..., st−1 ; θc ) .
To measure the relevance between fv (V ) and fs (S), the t=1
Dynamic Time Warping (DTW) algorithm is used to find (6)
the minimal accumulating distance of two sequences and the First, input frame sequences are encoded as a latent vector
temporal warping path, representation on a per-frame basis, followed by decoding
from each representation to a sentence, one word at a time.
D[i, j] = min(D[i − 1, j], D[i − 1, j − 1]) + d(i, j) , (3) We extend the encoder in HAN to reflect the hierarchical
structures in its inputs (clips form word and words form sen-
d(i, j) = ||Tv vi − Ts sj ||2 , (4) tence) and incorporate the attention mechanism. The modi-
fied model is an unnamed variant of HAN.
where D[i, j] denotes the distance between (v1 , ..., vi )
and
(s1 , ..., sj ), and d(i, j) denotes the distance between vi
and As shown in Fig. 4, the inputs (in blue) are clip sequences
sj . Thus we define the loss of video-sentence with single and word sequences represented in latent space. This model
instance as, contains two encoders and a decoder. Each encoder is a bidi-
rectional LSTM with a attention layer; while the decoder is
Er (V, S; θr ) = D(n, m) . (5) a single LSTM. The clip encoder encodes the video clips
aligning to a word. As shown in Fig. 5, we empirically try
This DTW algorithm assumes that words appearing ear-
three alignment strategies:
lier in a sentence should also appear early in video clips,
i.e., a monotonically increasing alignment. Per our evalua- 1. Split clips to two subsequences;
tion dataset, most signers vastly prefer simple sentences to 2. Split to subsequences every two clips;
compound ones, thus simple sentences with a single clause 3. Split evenly to 7 subsequences, since each sentence con-
makes up the majority of the dataset. Therefore, approxi- tains 7 words on average in the training set.
mate monotonically increasing alignment can be assumed.
The associated warping path is recovered using back- The outputs of bidirectional LSTM pass through the atten-
tracking. This warping path is normally interpreted as the tion layer to form a word-level vector, where the attention
alignment between frames and words. For better alignment, layer acts as information selecting weights to the inputs.
the Windowing-DTW (Biba and Xhafa 2011) is used, as Subsequently, the sequences of word-level vectors are pro-
shown in Fig. 3, and D[i, j] is only computed within the jected into a latent vector representation, where decoding is
windows. In the following experiments, the window length carried out with annotation sentence. During the decoding,
is n2 , except the first and last one at boundary 0 and n. All #Start is used as the start symbol to mark the beginning of
adjacent windows have n4 overlap. a sentence and #End as the end symbol that indicates the
end of a sentence. At each time stamp, words encoded with
Recognition with HAN “one-hot vector” are mapped into the latent space and fed to
the LSTM (denoted in green) together with the hidden state
Inspired by recent sequence-to-sequence models (Venu- from the previous timestamp. The LSTM cell output ht is
gopalan et al. 2015), the recognition problem is formulated used to emitted word yt . A softmax function is applied to
as the estimation of log-conditional probability of sentences obtain the probability distribution over the words ŷ in the
2260
attention attention attention
Substitution of Eq. (3) in Eq. (10) leads to
∂D(n, m) ∂D(n − 1, m) ∂D(n − 1, m − 1))
= min( , )
w1 w2 w1 w2 wL/2 w1 w2 wL/7 ∂Tv ∂Tv ∂Tv
attention attention attention attention ∂d(n, m)
+ .
∂Tv
(11)
clip1 clipi clipi+1 clipn clipi clipi+1 clipi+1 clipi+t
Eq. (11) is recursive and allows for gradient back-
ҁD҂ ҁE҂ ҁF҂ propagation through time. As a result, ∂D(n,m)
∂Tv can be rep-
∂d(i,j)
resented by for i < n, j < m,
∂Tv
Figure 5: Alignment construction. (a) Split clips to two sub- ∂D(n, m) ∂d(i, j)
sequences and encode as HAN; (b) Split to subsequences =f |i < n, j < m , (12)
∂Tv ∂Tv
every two adjacent clips; (c) Split evenly to 7 subsequences,
where 7 is average sentence length in the training set. where f (·) denotes a function that can be represented.
Since the term ∂d(i,j)
∂Tv can be obtained according to Eq. (4),
∂D(n,m)
vocabulary voc. ∂Tv can be computed by the chain rule. Likewise,
∂Er (V,S;Tv ,Ts )
exp(Wyt ht ) ∂Ts could be similarly computed.
p(yt |ht ) = , (7) Consider the partial derivative of HAN error Ec with
ŷ∈voc exp(Wŷ ht ) respect to Tv , Ts and the network parameter θ. Regular
where Wyt , Wŷ are parameter vectors in the softmax layer. stochastic gradient descent is utilized, with gradients cal-
All next words are obtained based on the probability in culated by the Back-propagation Through Time (BPTT)
Eq. (7) until the sentence end symbol is emitted. (Werbos 1990), which recursively back-propagates gradi-
Inspired by (Yang et al. 2016), the coherence loss in ents from current to previous timestamps. After gradients
Eq. (6) can be further simplified as, are propagated back to the inputs, ∂Ec (V,S;T ∂V
v ,Ts ,θ)
and
∂Ec (V,S;Tv ,Ts ,θ)
m ∂S are obtained. With Eq. (2), further simpli-
Ec (V, S; θr , θc ) = log p(st |ht−1 , st−1 ) fication could be obtained,
t=1
(8) ∂Ec (V, S; Tv , Ts , θ) ∂Ec (V, S; Tv , Ts , θ)
m = ·V , (13)
exp(Wst ht ) ∂Tv ∂V
= log .
t=1 ŝ ∈voc exp(Wŝ ht )
∂Ec (V, S; Tv , Ts , θ) ∂Ec (V, S; Tv , Ts , θ)
= · S . (14)
∂Ts ∂S
Therefore, all gradients with respect to all parameters could
Learning and Recognition of LS-HAN Model
be obtained. The objective function in Eq. (9) can be mini-
With previous formulations of the latent space and HAN, mized accordingly.
Eq. (1) is equivalent to, During the testing phase, the proposed LS-HAN is used to
translate signing videos sentence-by-sentence. Each video
1
N
min λ1 Er (V (i) , S (i) ; Tv , Ts ) is divided into clips with a sliding window algorithm. The
Tv ,Ts ,θ N alignment information need to be reconstructed. Empirical
i=1
(9) experiments verify that the strategy 3 outperforms others,
+ (1 − λ1 )Ec (V (i) , S (i) ; Tv , Ts , θ) therefore it is employed in the testing phase. After encod-
2 2 2 ing, the start symbol “#Start” is fed to HAN indicating
+ λ2 (Tv + Ts + θ ) ,
the beginning of sentence prediction. During each decoding
where Tv and Ts are latent space parameters, θ is a HAN timestamp, the word with the highest probability after the
parameter. The optimization of LS-HAN requires the partial softmax is chosen as the predicted word, with its representa-
derivatives of Eq. (9) with respect to parameters Tv , Ts and θ tion in the latent space fed to HAN for the next timestamp,
and the solutions of such equations. Thanks to the linearity, until the emission of the end symbol “#End”.
the two loss terms and the regularization item could be opti-
mized separately. For simplicity, we only elaborate the com- Experiments
putation of partial derivatives of Er and Ec in the simplest In this section, datasets are introduced followed by evalu-
case (with only one sample), ignoring unrelated coefficients ation comparison. Additionally, a sensitivity analysis on a
and regularization terms. trade-off parameter is included.
Consider the partial derivative of Er with respect to ma-
trices Tv and Ts . According to Eq. (5), Datasets
∂Er (V, S; Tv , Ts ) ∂D(n, m) Two open source continuous SLR datasets are used in the
= , (10) following experiments, one for CSL and the other is the
∂Tv ∂Tv
2261
Table 1: Statistics on Proposed CSL Video Dataset Table 2: Continuous SLR Results. Methods in bold text are
RGB resolution 1920×1080 # of signers 50 the original and modified versions of the proposed method.
Depth resolution 512×424 Vocab. size 178 Methods Accuracy
Video duration (sec.) 10∼14 Body joints 25 LSTM (Venugopalan et al. 2014) 0.736
Average words/instance 7 FPS 30 S2VT (Venugopalan et al. 2015) 0.745
Total instances 25,000 Total hours 100+ LSTM-A (Yao et al. 2015) 0.757
LSTM-E (Pan et al. 2015) 0.768
HAN (Yang et al. 2016) 0.793
CRF (Lafferty, McCallum, and Pereira ) 0.686
LDCRF (Morency, Quattoni, and Darrell 2007) 0.704
German sign language dataset RWTH-PHOENIX-Weather DTW-HMM (Zhang, Zhou, and Li 2014) 0.716
(Koller, Forster, and Ney 2015). The CSL dataset in Tab. 1 LS-HAN (a) 0.792
is collected by us and released on our project web page4 . A LS-HAN (b) 0.805
Microsoft Kinect camera is used for all recording, providing LS-HAN (c) 0.827
RGB, depth and body joints modalities in all videos. The
additional modalities should provide helpful additional in- DTW Distance
2262
Table 3: Continuous SLR on RWTH-PHOENIX-Weather.
Methods Accuracy
(Koller, Forster, and Ney 2015) 0.444
Deep Hand (Koller, Ney, and Bowden 2016) 0.549
Recurrent CNN (Cui, Liu, and Zhang 2017) 0.613
LS-HAN (with hand sequence) 0.616
2263
Girshick, R. 2015. Fast r-cnn. In Proceedings of ICCV, 1440–1448. Tang, A.; Lu, K.; Wang, Y.; Huang, J.; and Li, H. 2015. A real-time
Guo, D.; Zhou, W.; Wang, M.; and Li, H. 2016. Sign language hand posture recognition system using deep neural networks. ACM
recognition based on adaptive hmms with data augmentation. In Transactions on Intelligent Systems and Technology 6(2):21.
IEEE International Conference on Image Processing, 2876–2880. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M.
Huang, J.; Zhou, W.; Li, H.; and Li, W. 2015. Sign language 2015. Learning spatiotemporal features with 3d convolutional net-
recognition using 3d convolutional neural networks. In IEEE In- works. In IEEE International Conference on Computer Vision,
ternational Conference on Multimedia and Expo, 1–6. 4489–4497.
Koller, O.; Forster, J.; and Ney, H. 2015. Continuous sign lan- Venugopalan, S.; Xu, H.; Donahue, J.; Rohrbach, M.; Mooney, R.;
guage recognition: Towards large vocabulary statistical recognition and Saenko, K. 2014. Translating videos to natural language using
systems handling multiple signers. Computer Vision and Image deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
Understanding 141:108–125. Venugopalan, S.; Rohrbach, M.; Donahue, J.; Mooney, R.; Darrell,
Koller, O.; Ney, H.; and Bowden, R. 2016. Deep hand: How to T.; and Saenko, K. 2015. Sequence to sequence-video to text. In
train a cnn on 1 million hand images when your data is continuous Proceedings of the IEEE International Conference on Computer
and weakly labelled. In Proceedings of the IEEE Conference on Vision, 4534–4542.
Computer Vision and Pattern Recognition, 3793–3802. Werbos, P. J. 1990. Backpropagation through time: what it does
Koller, O.; Zargaran, S.; and Ney, H. 2017. Re-sign: Re-aligned and how to do it. Proceedings of the IEEE 78(10):1550–1560.
end-to-end sequence modelling with deep recurrent cnn-hmms. In Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A. J.; and Hovy, E. H.
The IEEE Conference on Computer Vision and Pattern Recognition 2016. Hierarchical attention networks for document classification.
(CVPR). In Conference of the North American Chapter of the Association
Kurakin, A.; Zhang, Z.; and Liu, Z. 2012. A real time system for for Computational Linguistics, 1480–1489.
dynamic hand gesture recognition with a depth sensor. In Signal Yang, W.; Tao, J.; and Ye, Z. 2016. Continuous sign lan-
Processing Conference, 1975–1979. IEEE. guage recognition using level building based on fast hidden markov
Lafferty, J.; McCallum, A.; and Pereira, F. Conditional random model. Pattern Recognition Letters 78:28–35.
fields: Probabilistic models for segmenting and labeling sequence Yao, L.; Torabi, A.; Cho, K.; Ballas, N.; Pal, C.; Larochelle, H.;
data. In Proceedings of International Conference on Machine and Courville, A. 2015. Describing videos by exploiting temporal
Learning, volume 1, 282–289. structure. In Proceedings of the IEEE International Conference on
Li, J.; Luong, M.-T.; and Jurafsky, D. 2015. A hierarchical neu- Computer Vision, 4507–4515.
ral autoencoder for paragraphs and documents. arXiv preprint Zhang, Q., and Hua, G. 2015. Multi-view visual recognition of
arXiv:1506.01057. imperfect testing data. In Proceedings of the 23rd Annual ACM
Li, K.; Zhou, Z.; and Lee, C.-H. 2016. Sign transition modeling and Conference on Multimedia Conference, 561–570. ACM.
a scalable solution to continuous sign language recognition for real- Zhang, Q.; Abeida, H.; Xue, M.; Rowe, W.; and Li, J. 2011. Fast
world applications. ACM Transactions on Accessible Computing implementation of sparse iterative covariance-based estimation for
8(2):7. array processing. In Signals, Systems and Computers (ASILO-
Liu, T.; Zhou, W.; and Li, H. 2016. Sign language recognition MAR), 2011 Conference Record of the Forty Fifth Asilomar Con-
with long short-term memory. In IEEE International Conference ference on, 2031–2035. IEEE.
on Image Processing, 2871–2875.
Zhang, Q.; Abeida, H.; Xue, M.; Rowe, W.; and Li, J. 2012. Fast
Morency, L.-P.; Quattoni, A.; and Darrell, T. 2007. Latent-dynamic implementation of sparse iterative covariance-based estimation for
discriminative models for continuous gesture recognition. In IEEE source localization. The Journal of the Acoustical Society of Amer-
Conference on Computer Vision and Pattern Recognition, 1–8. ica 131(2):1249–1259.
Pan, Y.; Yao, T.; Tian, X.; Li, H.; and Ngo, C.-W. 2014. Click- Zhang, Q.; Hua, G.; Liu, W.; Liu, Z.; and Zhang, Z. 2015a. Aux-
through-based subspace learning for image search. In Proceedings iliary training information assisted visual recognition. IPSJ Trans-
of ACM International Conference on Multimedia, 233–236. actions on Computer Vision and Applications 7:138–150.
Pan, Y.; Mei, T.; Yao, T.; Li, H.; and Rui, Y. 2015. Jointly modeling Zhang, Q.; Hua, G.; Liu, W.; Liu, Z.; and Zhang, Z. 2015b. Can
embedding and translation to bridge video and language. arXiv visual recognition benefit from auxiliary information in training?
preprint arXiv:1505.01861. In Computer Vision – ACCV 2014, volume 9003 of Lecture Notes
Pu, J.; Zhou, W.; and Li, H. 2016. Sign language recognition with in Computer Science, 65–80. Springer International Publishing.
multi-modal features. In Pacific Rim Conference on Multimedia, Zhang, J.; Zhou, W.; Xie, C.; Pu, J.; and Li, H. 2016. Chinese sign
252–261. Springer. language recognition with adaptive hmm. In IEEE International
Ran, L.; Zhang, Y.; Wei, W.; and Zhang, Q. 2017a. A hyperspec- Conference on Multimedia and Expo, 1–6.
tral image classification framework with spatial pixel pair features. Zhang, K.; Zhang, L.; and Yang, M.-H. 2012. Real-time com-
Sensors 17(10). pressive tracking. In European Conference on Computer Vision,
Ran, L.; Zhang, Y.; Zhang, Q.; and Yang, T. 2017b. Convolutional 864–877. Springer.
neural network-based robot navigation using uncalibrated spherical
Zhang, J.; Zhou, W.; and Li, H. 2014. A threshold-based hmm-dtw
images. Sensors 17(6).
approach for continuous sign language recognition. In Proceedings
Starner, T.; Weaver, J.; and Pentland, A. 1998. Real-time ameri- of ACM International Conference on Internet Multimedia Comput-
can sign language recognition using desk and wearable computer ing and Service, 237.
based video. IEEE Transactions on Pattern Analysis and Machine
Intelligence 20(12):1371–1375.
Sun, C.; Zhang, T.; Bao, B.-K.; Xu, C.; and Mei, T. 2013. Discrim-
inative exemplar coding for sign language recognition with kinect.
IEEE Transactions on Cybernetics 43(5):1418–1428.
2264