BLSTM Neural Networks For Speech Driven Head Motion Synthesis

INTERSPEECH 2015
BLSTM Neural Networks for Speech Driven Head Motion Synthesis

Chuang Ding1 , Pengcheng Zhu2 , Lei Xie1,2
1
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
2
School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, China
{cding, pczhu}@nwpu-aslp.org, lxie@nwpu.edu.cn
Abstract than acoustic features [11]. However, articulatory movements

Head motion naturally occurs in synchrony with speech and need to be recorded by human articulography equipments, e.g.,
carries important intention, attitude and emotion factors. This electromagnetic midsagittal articulography (EMA) [12] and
paper aims to synthesize head motions from natural speech predicting them from acoustic input has inevitable errors [13].
for talking avatar applications. Specifically, we study the More promisingly, regression methods regard the associa-
feasibility of learning speech-to-head-motion regression models tion between speech and head motion as a non-deterministic,
by two types of popular neural networks, i.e., feed-forward and many-to-many mapping. As an early attempt, Yehia et al. [14]
bidirectional long short-term memory (BLSTM). We discover adopted a linear estimator with affine transformation to map
that the BLSTM networks apparently outperform the feed- fundamental frequency (F0) to head motion. Recently, neural
forward ones in this task because of their capacity of learning networks with deep architectures have achieved superior perfor-
long-range speech dynamics. More interestingly, we observe mance in many tasks like automatic speech recognition [15] and
that stacking different networks, i.e., inserting a feed-forward speech synthesis [16, 17, 18]. There are two key advantages of
layer into two BLSTM layers, achieves the best performance. deep neural networks (DNN) in a time-series synthesis task: its
Subjective evaluation shows that this hybrid network can modeling ability of long-span, high dimensional and correlation
produce more plausible head motions from speech. of input features, and the strong capacity to learn non-linear
Index Terms: neural networks, BLSTM, head motion synthe- mapping between input and output with layered, feed-forward
sis, talking avatar or recurrent structures. Motivated by these advantages, besides
speech synthesis, researchers have investigated DNN in various
regression tasks like voice conversion [19] and articulatory
1. Introduction inversion [20]. In a preliminary study [21], we show
Speech originated head motions are essential in expressing that feed-forward neural networks (FFNN) have significantly
feelings, giving feedbacks and engaging human-human speech outperformed HMM in head motion synthesis.
communication. Munhall et al. [1] discovered that head Despite their success in various tasks, FFNN keeps the
motion was an essential ingredient in speech perception and assumption of frame independence and provides only limited
appropriate head motion significantly enhanced the liveliness of temporal context modeling ability by operating on a fixed-size
an animated character. As a central factor of visual prosody [2], time window of acoustic frames [22]. Instead, bidirectional
head motion includes 3D head rotation, namely, nod, yaw recurrent neural networks (RNN) are able to incorporate
and roll. Therefore, natural head movements are important contextual information from both past and future inputs [23].
to realistic facial animation and engaging human-computer Specifically, bidirectional RNN with long short-term memory
interactions for a lifelike talking avatar. (LSTM) cells can model long-range dependencies in both di-
Researchers have been interested in automatically predict- rections more accurately. Recent research has proven BLSTM-
ing head motions from either text [3, 4] or speech [5, 6]. RNN shows superior performance over FFNN in various tasks,
Head motion synthesis is always considered as a classification including speech recognition [24] and synthesis [17].
or regression task. In the classification approaches, head In this paper, we study the feasibility of learning speech-to-
movements are categorized into several patterns manually or head-motion regression models by two types of popular neural
automatically. GMM-HMM is often adopted where Gaussian networks, i.e., FFNN and BLSTM-RNN. Given a bimodal
mixture model (GMM) is used to model the correspondence corpus with a subject’s speech and head motion data, networks
between speech and the head motion patterns while hidden are trained to learn a regression model between acoustic speech
Markov model (HMM) is used to model the time evolution features and head motions. It is not trivial to apply such
of the head motion patterns. In the synthesis phase, the networks to the head motion synthesis task and some important
acoustic speech input is decoded into a sequence of head factors need to investigated. First, head motion is a supra-
motion patterns and a head motion trajectory is subsequently segmental prosodic factor and a distinct head movement may
generated by connecting the Gaussian means of the head last over 400ms [7]. To better catch this long-range movement,
motion patterns [7]. Recent studies have been focusing on the appropriate feature context for network input needs to be
generating smooth head motion trajectories using dynamic examined. Second, to the best of our knowledge, this is the
acoustic features [8] and a better head motion clustering first time BLSTM-RNN is applied to the head motion synthesis
algorithm [9]. The performance of the above classification task. Its long-range modeling ability needs to be investigated
methods heavily relies on the appropriate definition of typical and compared with FFNN. Third, it is interesting to see if
head motion patterns and the pattern recognition accuracy performance can be further improved when two different type
which is still very low [10]. A more recent study has shown of networks are stacked together. Previous studies show that
that articulatory features are more relevant to head motion hybrid network structures lead to superior performances in
Copyright © 2015 ISCA 3345 September 6- 10, 2015, Dresden, Germany

−1 +1
Outputs
Cell (ℎ )
ct
Backward Layer
Forward Layer
Input Forget Output −1 +1

it ft ot Inputs
Gate Gate Gate
( , ℎ−1 ) Figure 2: Bidirectional LSTM RNN layer.
←
−
Figure 1: Long short-term memory cell. hidden sequence h with the following iterative process:
−
→ →
−
speech synthesis [17] and taking head animation [25]. Through h t = H(Wf → − ft + W→
h
−→ − h t−1 + b→
h h
− ),
h
(2)
extensive experiments, we discover that: (1) a long window ←
− ←
−
h t = H(Wf ← − f t + W←
h
−←− h t+1 + b←
h h
− ),
h
(3)
of 15 feature frames is the best choice for the input of an
→
− ←−
FFNN; (2) BLSTM-RNN significantly outperforms FFNN in y t = W h y h t + W h y h t + by .
→
− ←
− (4)
head motion synthesis; (3) a hybrid network (BLSTM-FF-
BLSTM) achieves the best performance in both objective and In a conventional RNN, H is usually a sigmoid or hyperbolic
subjective evaluations. tangent function, which often causes the gradient vanishing
problem, preventing RNN from modeling the long-span rela-
tions in sequence features. An LSTM architecture, which uses
2. Two Types of Neural Networks purpose-built memory cells to store information, can overcome
this problem and model longer contexts. Fig. 1 illustrates a
Our previous work has shown that a one-hidden-layer feed- single LSTM memory cell. For LSTM, H is implemented by
forward neural network with unsupervised generative pre- the following functions:
training significantly outperforms a conventional randomly
initialized network and the GMM-HMM based approach in it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (5)
head motion synthesis [21]. But the correlation between the ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (6)
predicted head motion and the ground truth still remains a low
level. We believe this is due to the unreasonable assumption of ct = ft ct−1 + it tanh(Wxc xt + whc ht−1 + bc ) (7)
frame independence and the limited temporal context modeling ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) (8)
ability by operating on a fixed-size time window of acoustic ht = ot tanh(ct ) (9)
frame. Therefore, it motivates us to seek more powerful
network architectures to further push forward the performance where σ is the logistic function, and i, f , o and c are the input
of speech driven head motion synthesis. gate, forget gate, output gate and cell memory, respectively.
Fig. 2 illustrates a BLSTM-RNN layer. Deep bidirectional
2.1. Feed-Forward Neural Networks LSTM RNN can be established by stacking multiple BLSTM
RNN hidden layers on top of each other. The output sequence
FFNN, trained with a back-propagation learning algorithm, is of one layer is used as the input sequence of the next layer. The
widely used in many practical applications. In a typical FFNN, hidden state sequences, hn , consist of forward and backward
→
− ←−
every unit in a layer is connected with all the units in the sequences h n and h n , iteratively computed from n = 1 to N
previous layer, which takes in the output of the previous layer and t = 1 to T as follows.
and computes a new set of non-linear activations for next layer. →n
− →n−1
− →n
− n
Each layer computes the activation hl using a weight matrix Wl h t = H(W→ h
−n h t
− n−1 →
h
+ W→ − h t−1 + b→
− n→
h hn
− ), (10)
h
and a bias vector bl followed by a non-linear function fl (·): ←
−n ←
−n−1 ←
−n n
h t = H(W← −n−1 ←
h
−n h t
h
+ W ← −n h t−1 + b←
−n ←
h h
− ), (11)
h
→N
− ←
−N
yt = W h N y h t + W h N y h t + b y .
→
− ←− (12)
hl = fl (Wl hl−1 + bl ) for 1≤l<L (1)
The description above shows that it is easy to stack different
types of network layers to respect particular characteristics. A
where the non-linear function fl (·) usually operates in an recent study has shown a network with two BLSTM layers
element-wise manner on the input vector. sitting on top of one feed-forward layer achieves the best
performance for the task of visual speech synthesis [25].
2.2. BLSTM Recurrent Neural Networks
3. Head Motion Synthesis System
Conventional RNNs are only able to make use of previous Speech-driven head motion synthesis can be seen as a process
context information. This is not accurate in modeling speech of mapping the acoustic input to head movements. The diagram
that is highly related with both past and future contexts. of our neural network based head motion synthesis system is
Instead, bidirectional RNNs can access both the preceding and shown in Fig. 3, which is composed of the training phase and
succeeding speech contexts with two separate hidden layers, the animation phase.
which are then fed to the same output layer. Given an input Acoustic Feature Extraction: In both the training and
acoustic feature sequence x = (x1 , ..., xT ) and an output animation phases, the 16kHz-sampled speech waveform first
head motion sequence y = (y1 , ..., yT ), the bidirectional RNN goes through the acoustic feature extraction module, resulting
→
−
computes the forward hidden sequence h and the backward in an 81-dimension feature vector per speech frame (window
3346
Acoustic Feature Audio/Visual Speech Head Motion 0.42 0.895
1 layer 1 layer
Extraction Corpus Extraction
2 layers 2layers
0.41 3 layers 3 layers
Acoustic Head motion
Feature parameters
RMSE (degree)
0.4 0.89
Input Neural Network Output
Training
CCA
0.39
Training 0.885
0.38
Animation NN Model
0.37
Acoustic Feature Acoustic Speech-to-head- Head motion Talking Avatar 0.88

Audio 3 7 11 15 19 23 3 7 11 15 19 23
Extraction Feature motion Synthesis parameters System
Context Context
Figure 4: Performances of different acoustic context for FFNN.

Play Back Head Animation
sampled them to 16kHz. The dataset is partitioned into three

parts: a training set with 1137 utterances, a validation set with
Figure 3: Diagram of the head motion synthesis system. 63 utterances for parameter tuning and a testing set with another
63 utterances.
size: 25ms, window shift:10ms). The feature vector consists of We conducted evaluations by directly comparing the
26-D log Mel-scale filterbank (FBank) parameters with short- predicted head motion with the ground truth. Two metrics were
term energy and their first and second order derivatives. As used: the root mean-squared error (RMSE) and the canonical
concluded from our previous work [21], FBank has shown correlation analysis (CCA) [28]. The former is a commonly
superior performance in head motion synthesis as compared used error metric and the latter is correlation factor to see how
with other features, e.g., LPC and MFCC. close in shape the two curves are. Large CCA corresponds to a
Head Motion Extraction: In the training stage, the head high correlation.
motion sequences are used as the targets of the network. Head
motions are simply parameterized by angles of nod, yaw 5. Experimental Results
and roll. The first and second order derivatives are also
included as the targets (network outputs) in the experiments for 5.1. Feature Context for FFNN
testing dynamic features. In our study, head motion parameters
While training FFNN, a fixed-size context window is often
are collected from motion sensors which are mounted on a
adopted to integrate the temporal information from past and
subject’s head while speaking. Before training and testing, the
future. The average time of a distinct head movement may
head motion is normalized by subtracting the respective global
last over 400 ms [7]. Hence we experimented with the size of
mean and dividing by 4 times the standard deviation for each
the context window (W frames) in FFNN-based head motion
dimension.
synthesis. Specifically in the network, the input is composed of
Neural Network Training: Given input speech, output W ×81 FBank parameters (W =3, 7, 11, 15, 19, 23) and the
head motion and a network structure, the network training output is the 3 head rotation angles. The number of hidden
criterion is to minimize the sum square error (SSE) between units each layer is set to 128. All networks are trained with a
the predicted head motion and the ground truth. In our learning rate of 1e-6 and a momentum of 0.9. Back propagation
study, the feed-forward layers are trained with typical back- is performed using stochastic gradient descent (SGD) with 20
propagation (BP) algorithm and the back-propagation through parallel sentences. The training stops if no new lowest error
time (BPTT) method is used for training of BLSTM layers. on the validation set can be achieved within the last 10 epochs.
BPTT is applied to both forward and backward hidden nodes Results from Fig. 4 illustrates that, no matter how many layers
and back-propagates layer by layer. The weight gradients are we use, the best performance (lowest RMSE and highest CCA)
computed over the entire speech utterance. Our experiments are is consistently achieved when W =15 (7-1-7).
performed using CURRENNT RNN toolbox [26] which uses
NVIDIA graphics cards to accelerate the computations. 5.2. FFNN vs. BLSTM-RNN
Head Motion Synthesis: The synthesis process is quite
straightforward. The neural network model maps the acoustic To compare the two network architectures (FFNN and BLSTM-
feature input into head motion parameters. After that, the RNN), we tested different numbers of hidden layers and
predicted head motion parameters are fed into a talking units and results are shown in Fig. 5. The network training
avatar system to generate head animation and played back configurations are kept the same with that in Section 5.1 except
synchronously with input audio. Besides static head motion the following: in the input layer of FFNN, a context window
parameters, their first and second derivatives (i.e. dynamic of 15 frames is used, while only the current frame is used as
features) can be used as network output as well. With predicted the input in BLSTM-RNN. Previous studies show that BLSTM-
static and dynamic parameters, the maximum likelihood RNN has strong contextual learning ability and does not need a
parameter generation (MLPG) [27] algorithm is used as a big window of features [22]. We can clearly see from Fig. 5 that
poster-filter to generate smooth trajectories. BLSTM-RNN achieves much better performance than FFNN.
The best performed BLSTM-RNN increases CCA dramatically
from 0.432 (best of FFNN) to 0.688 and decreases the RMSE
4. Corpus and Evaluation criteria from 0.881 (best of FFNN) to 0.787. For FFNN, we notice
The MNGU0 electromagnetic midsagittal articulography (E- that one-hidden-layer consistently outperforms two- and three-
MA) corpus [12] was used in our experiments. The dataset hidden layers and the best result is achieved when the number
consists of 1263 utterances recorded from a single speaker. of hidden nodes is set to 1024. FFNN does not converge in the
Parallel acoustic and head motion data are provided. The speech training process when the number of hidden layers is more than
files were recorded using a standard microphone and we down- two and the number of hidden nodes is larger than 1024. For
3347
0.92 Table 2: Performances for network topologies (TP) with
0.7 0.688
different hidden layers (512 hidden nodes each layer).
0.9
0.6 FFNN 1 layer 0.881 TP CCA RMSE TP CCA RMSE
FFNN 2 layers
0.88
FFNN 3 layers FB 0.686 0.802 BF 0.670 0.784
RMSE (degree)
FFNN 4 layers FFNN 1 layer
FFNN 5 layers
0.86
FFNN 2 layers
FFNN 3 layers
FFB 0.631 0.823 BFF 0.695 0.777
0.5 BLSTM 1 layer
CCA
BLSTM 2 layers
BLSTM 3 layers
FFNN 4 layers
FFNN 5 layers
FBF 0.683 0.806 BFB 0.711 0.750
0.84 BLSTM 1 layer
FBB 0.633 0.827 BBF 0.666 0.810
BLSTM 2 layers
0.4 0.432 BLSTM 3 layers
0.82
5.5. Subjective evaluation
0.3 0.8 We conducted A/B preference tests on the naturalness of the
0.78
0.787 synthesized head motion animation. We randomly selected 15
64 128 256 512 1024 2048 64 128 256 512 1024 2048 sentences from the test set and synthesize the head motions for
Nodes Nodes
a talking avatar for four systems:
Figure 5: Performance comparison: FFNN vs. BLSTM-RNN. • FFNN-DYN: the FFNN-DYN system in Table 1;
• BLSTM-RNN: the BLSTM-RNN system in Table 1;
Table 1: Results for post-filtering with dynamic features. • Stacked BFB: the BFB system in Table 2;
System CCA RMSE • Ground Truth: the ground truth head motion is added on
FFNN 0.432 0.881 the avatar.
FFNN-DYN 0.503 0.872 We carried out three sessions of comparative evaluations:
BLSTM-RNN 0.688 0.787 FFNN-DYN vs. BLSTM-RNN, BLSTM-RNN vs. Stacked
BLSTM-RNN-DYN 0.685 0.783 BFB and Stacked BFB vs. Ground truth. Please note that lip-
sync is also realized on the avatar with synchronized speech
playback [29] to make the avatar more lifelike. A group
BLSTM-RNN, two-hidden-layer achieves consistently better of 20 subjects were asked to choose which one is better
result and the best performance is obtained with 512 nodes. The according to the naturalness of head motion. The percentage
general trend of Fig. 5 shows that better performance can be preference is shown in Fig. 6. We can see that BLSTM-RNN
achieved with the increase of hidden nodes in BLSTM-RNN. achieves significantly better preference over FFNN-DYN, while
the stacked BFB is slightly preferred over BLSTM-RNN. The
5.3. Post-Filtering with Dynamic Features preference percentage of stacked BFB is quite close to the
As discussed in Section 3, the network output can be composed ground truth head motion, which suggests that the proposed
of both static and dynamic head motion parameters. With approach can produce plausible head motions synchronized
predicted dynamics, MLPG [27] is used to generate smooth with speech.
trajectories. Therefore, we also tested if this post-filtering step BLSTM-RNN Neutral FFNN-DYN
works for neural networks. The training configuration is the 62% 18% 20%
same as that in Section 5.1. We can see from Table 1 that FFNN
BLSTM-RNN Neutral Stacked BFB
with post-filtering provides a slight performance gain. But this
34% 24% 42%
is not the case for BLSTM-RNN as we see a small drop in
CCA. We believe that BLSTM-RNN has strong long-context Stacked BFB Neutral Ground Truth
modeling ability and provides smooth head motion trajectory 36% 22% 42%
already. This makes this post-filtering step kind of useless.
Figure 6: The percentage preference of A/B test.
5.4. Stacking Feed-forward and BLSTM Layers
Since different neural network layers can be easily combined 6. Conclusion and Future Work
to form new architectures, we also tested what happens if we In this paper, we have studied the feasibility of learning speech-
stack feed-forward and BLSTM layers. We exhaustively tested to-head-motion regression models by feed-forward neural
different network topology (FB, BF, FFB, FBF, BFF, FBB, networks (FFNN) and bidirectional long short-term memory
BFB, BBF) with different number of hidden nodes (64, 128, recurrent neural networks (BLSTM-RNN). From extensive
256, 512), where F and B denote feed-forward layer and the experiments, we conclude that: (1) an appropriate long context
bidirectional LSTM layer, respectively. The number of the of acoustic feature frames is essential for the performance of
nodes were kept the same for all hidden layers in every tested an FFNN ; (2) BLSTM-RNN significantly outperforms FFNN
network architecture. The training configurations are the same in head motion prediction; (3) a hybrid network (BLSTM-FF-
with previous experiments. We discover that networks with BLSTM) shows superior performance in both objective and
512 hidden nodes each layer consistently perform better and subjective evaluations. In the future, we plan to investigate head
thus we list their results in Table 2. We interestingly find motion synthesis using neural networks for multiple speakers.
that the highest CCA (0.711) and the lowest RMSE (0.750)
are achieved by a network that inserts a feed-forward layer
into two BLSTM layers (BFB). This superior performance
7. Acknowledgements
indicates that stacking different types of networks may lead to This work was supported by the National Natural Science
a better performance and it desires particular consideration in a Foundation of China (Grant No. 61175018) and the Seed
regression task. We also notice that an BFB network has shown Foundation of Innovation and Creation for Graduate Students
the best performance in a speech synthesis task [17]. in Northwestern Polytechnical University.
3348
8. References Sainath et al., “Deep neural networks for acoustic
modeling in speech recognition: The shared views of
[1] K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate,
four research groups,” Signal Processing Magazine, IEEE,
and E. Vatikiotis-Bateson, “Visual prosody and speech
vol. 29, no. 6, pp. 82–97, 2012.
intelligibility head movement improves auditory speech
perception,” Psychological science, vol. 15, no. 2, pp. [16] H. Ze, A. Senior, and M. Schuster, “Statistical parametric
133–137, 2004. speech synthesis using deep neural networks,” in Proc.
ICASSP, Vancouver, BC, Canada, vol. 5, 2013, pp. 7962–
[2] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter,
7966.
“Lifelike talking faces for interactive services,” Proceed-
ings of the IEEE, vol. 91, no. 9, pp. 1406–1429, 2003. [17] Y. Fan, Y. Qian, F. Xie, and F. K. Soong, “TTS
synthesis with bidirectional LSTM based recurrent neural
[3] S. Zhang, Z. Wu, H. Meng, and L. Cai, “Head movement
networks,” in Proc. Interspeech, Singapore, 2014.
synthesis based on semantic and prosodic features for a
chinese expressive avatar,” in Proc. ICASSP, vol. 4, 2007, [18] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep
pp. IV–837. belief network for speech synthesis,” in Proc. ICASSP,
2013, pp. 8012–8016.
[4] K. Mu, J. Tao, J. Che, and M. Yang, “Mood avatar:
Automatic text-driven head motion synthesis,” in Proc. [19] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice
ICML. ACM, 2010, p. 37. conversion using deep neural networks with layer-wise
generative training,” IEEE/ACM Transactions on Audio,
[5] C. Busso, Z. Deng, U. Neumann, and S. Narayanan,
Speech and Language Processing, vol. 22, no. 12, pp.
“Natural head motion synthesis driven by acoustic prosod-
1859–1872, 2014.
ic features,” Computer Animation and Virtual Worlds,
vol. 16, no. 3-4, pp. 283–290, 2005. [20] B. Uria, S. Renals, and K. Richmond, “A deep neural
network for acoustic-articulatory speech inversion,” in
[6] C. Busso, Z. Deng, M. Grimm, U. Neumann, and
Proc. NIPS Workshop on Deep Learning and Unsuper-
S. Narayanan, “Rigid head motion in expressive speech
vised Feature Learning, Sierra Nevada, Spain, December
animation: Analysis and synthesis,” IEEE Transactions
2011.
on Audio, Speech, and Language Processing, vol. 15,
no. 3, pp. 1075–1086, 2007. [21] C. Ding, P. Zhu, L. Xie, D. Jiang, and Z. Fu, “Speech-
driven head motion synthesis using neural networks,” in
[7] G. Hofer and H. Shimodaira, “Automatic head motion Proc. INTERSPEECH, Singapore, vol. 5, 2014.
prediction from speech data,” in Proc. INTERSPEECH,
Antwerp, Belgium, vol. 4, 2007. [22] H. Sak, A. Senior, and F. Beaufays, “Long short-
term memory recurrent neural network architectures for
[8] G. Hofer, H. Shimodaira, and J. Yamagishi, “Speech large scale acoustic modeling,” in Proc. INTERSPEECH,
driven head motion synthesis based on a trajectory Singapore, vol. 5, 2014.
model,” in Proc. Siggraph, San Diego, California, USA,
vol. 1, 2007. [23] M. Schuster and K. K. Paliwal, “Bidirectional recurrent
neural networks,” IEEE Transactions on Signal Process-
[9] A. Ben Youssef, H. Shimodaira, and D. A. Braude, ing, vol. 45, no. 11, pp. 2673–2681, 1997.
“Articulatory features for speech-driven head motion
synthesis,” in Proc. INTERSPEECH, Lyon, France, vol. 5, [24] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech
2013, pp. 2758–2762. recognition with deep recurrent neural networks,” in Proc.
ICASSP, 2013, pp. 6645–6649.
[10] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp,
“Analysis of head gesture and prosody patterns for [25] B. Fan, L. Wang, F. K. Soong, and L. Xie, “Photo-real
prosody-driven head-gesture animation,” IEEE Trans- talking head with deep bidirectional LSTM,” in Proc.
actions on Pattern Analysis and Machine Intelligence, ICASSP, 2015.
vol. 30, no. 8, pp. 1330–1345, 2008. [26] F. Weninger, J. Bergmann, and B. Schuller, “Introducing
[11] A. Ben-Youssef, H. Shimodaira, and D. A. Braude, CURRENNT–the Munich open-source CUDA recurrent
“Speech driven talking head from estimated articulatory neural network toolkit,” Journal of Machine Learning
features,” in Proc. ICASSP, Florence, Italy, vol. 5, 2014, Research, vol. 15, 2014.
pp. 4606–4610. [27] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and
[12] K. Richmond, P. Hoole, and S. King, “Announcing T. Kitamura, “Speech parameter generation algorithms for
the electromagnetic articulography (day 1) subset of the hmm-based speech synthesis,” in Proc. ICASSP, vol. 3,
mngu0 articulatory corpus.” in Proc. INTERSPEECH, 2000, pp. 1315–1318.
Florence, Italy, vol. 5, 2011, pp. 1505–1508. [28] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canoni-
[13] T. Toda, A. W. Black, and K. Tokuda, “Acoustic-to- cal correlation analysis: An overview with application to
articulatory inversion mapping with gaussian mixture learning methods,” Neural computation, vol. 16, no. 12,
model.” in Proc. INTERSPEECH, 2004. pp. 2639–2664, 2004.
[14] H. C. Yehia, T. Kuratate, and E. Vatikiotis-Bateson, [29] J. Park and H. Ko, “Real-time continuous phoneme
“Linking facial animation, head motion and speech recognition system using class-dependent tied-mixture
acoustics,” Journal of Phonetics, vol. 30, no. 3, pp. 555– hmm with HBT structure for speech-driven lip-sync,”
568, 2002. IEEE Transactions on Multimedia, vol. 10, no. 7, pp.
1299–1306, 2008.
[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.
3349

BLSTM Neural Networks For Speech Driven Head Motion Synthesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BLSTM Neural Networks For Speech Driven Head Motion Synthesis

Uploaded by

Copyright:

Available Formats

INTERSPEECH 2015

BLSTM Neural Networks for Speech Driven Head Motion Synthesis

Abstract than acoustic features [11]. However, articulatory movements

Copyright © 2015 ISCA 3345 September 6- 10, 2015, Dresden, Germany

Input Forget Output −1 +1

Acoustic Feature Acoustic Speech-to-head- Head motion Talking Avatar 0.88

Figure 4: Performances of different acoustic context for FFNN.

sampled them to 16kHz. The dataset is partitioned into three

You might also like