Professional Documents
Culture Documents
Forward Layer
3346
Acoustic Feature Audio/Visual Speech Head Motion 0.42 0.895
1 layer 1 layer
Extraction Corpus Extraction
2 layers 2layers
0.41 3 layers 3 layers
Acoustic Head motion
Feature parameters
RMSE (degree)
0.4 0.89
Input Neural Network Output
Training
CCA
0.39
Training 0.885
0.38
Animation NN Model
0.37
3347
0.92 Table 2: Performances for network topologies (TP) with
0.7 0.688
different hidden layers (512 hidden nodes each layer).
0.9
0.6 FFNN 1 layer 0.881 TP CCA RMSE TP CCA RMSE
FFNN 2 layers
0.88
FFNN 3 layers FB 0.686 0.802 BF 0.670 0.784
RMSE (degree)
FFNN 4 layers FFNN 1 layer
FFNN 5 layers
0.86
FFNN 2 layers
FFNN 3 layers
FFB 0.631 0.823 BFF 0.695 0.777
0.5 BLSTM 1 layer
CCA
BLSTM 2 layers
BLSTM 3 layers
FFNN 4 layers
FFNN 5 layers
FBF 0.683 0.806 BFB 0.711 0.750
0.84 BLSTM 1 layer
FBB 0.633 0.827 BBF 0.666 0.810
BLSTM 2 layers
0.4 0.432 BLSTM 3 layers
0.82
5.5. Subjective evaluation
0.3 0.8 We conducted A/B preference tests on the naturalness of the
0.78
0.787 synthesized head motion animation. We randomly selected 15
64 128 256 512 1024 2048 64 128 256 512 1024 2048 sentences from the test set and synthesize the head motions for
Nodes Nodes
a talking avatar for four systems:
Figure 5: Performance comparison: FFNN vs. BLSTM-RNN. • FFNN-DYN: the FFNN-DYN system in Table 1;
• BLSTM-RNN: the BLSTM-RNN system in Table 1;
Table 1: Results for post-filtering with dynamic features. • Stacked BFB: the BFB system in Table 2;
System CCA RMSE • Ground Truth: the ground truth head motion is added on
FFNN 0.432 0.881 the avatar.
FFNN-DYN 0.503 0.872 We carried out three sessions of comparative evaluations:
BLSTM-RNN 0.688 0.787 FFNN-DYN vs. BLSTM-RNN, BLSTM-RNN vs. Stacked
BLSTM-RNN-DYN 0.685 0.783 BFB and Stacked BFB vs. Ground truth. Please note that lip-
sync is also realized on the avatar with synchronized speech
playback [29] to make the avatar more lifelike. A group
BLSTM-RNN, two-hidden-layer achieves consistently better of 20 subjects were asked to choose which one is better
result and the best performance is obtained with 512 nodes. The according to the naturalness of head motion. The percentage
general trend of Fig. 5 shows that better performance can be preference is shown in Fig. 6. We can see that BLSTM-RNN
achieved with the increase of hidden nodes in BLSTM-RNN. achieves significantly better preference over FFNN-DYN, while
the stacked BFB is slightly preferred over BLSTM-RNN. The
5.3. Post-Filtering with Dynamic Features preference percentage of stacked BFB is quite close to the
As discussed in Section 3, the network output can be composed ground truth head motion, which suggests that the proposed
of both static and dynamic head motion parameters. With approach can produce plausible head motions synchronized
predicted dynamics, MLPG [27] is used to generate smooth with speech.
trajectories. Therefore, we also tested if this post-filtering step BLSTM-RNN Neutral FFNN-DYN
works for neural networks. The training configuration is the 62% 18% 20%
same as that in Section 5.1. We can see from Table 1 that FFNN
BLSTM-RNN Neutral Stacked BFB
with post-filtering provides a slight performance gain. But this
34% 24% 42%
is not the case for BLSTM-RNN as we see a small drop in
CCA. We believe that BLSTM-RNN has strong long-context Stacked BFB Neutral Ground Truth
modeling ability and provides smooth head motion trajectory 36% 22% 42%
already. This makes this post-filtering step kind of useless.
Figure 6: The percentage preference of A/B test.
5.4. Stacking Feed-forward and BLSTM Layers
Since different neural network layers can be easily combined 6. Conclusion and Future Work
to form new architectures, we also tested what happens if we In this paper, we have studied the feasibility of learning speech-
stack feed-forward and BLSTM layers. We exhaustively tested to-head-motion regression models by feed-forward neural
different network topology (FB, BF, FFB, FBF, BFF, FBB, networks (FFNN) and bidirectional long short-term memory
BFB, BBF) with different number of hidden nodes (64, 128, recurrent neural networks (BLSTM-RNN). From extensive
256, 512), where F and B denote feed-forward layer and the experiments, we conclude that: (1) an appropriate long context
bidirectional LSTM layer, respectively. The number of the of acoustic feature frames is essential for the performance of
nodes were kept the same for all hidden layers in every tested an FFNN ; (2) BLSTM-RNN significantly outperforms FFNN
network architecture. The training configurations are the same in head motion prediction; (3) a hybrid network (BLSTM-FF-
with previous experiments. We discover that networks with BLSTM) shows superior performance in both objective and
512 hidden nodes each layer consistently perform better and subjective evaluations. In the future, we plan to investigate head
thus we list their results in Table 2. We interestingly find motion synthesis using neural networks for multiple speakers.
that the highest CCA (0.711) and the lowest RMSE (0.750)
are achieved by a network that inserts a feed-forward layer
into two BLSTM layers (BFB). This superior performance
7. Acknowledgements
indicates that stacking different types of networks may lead to This work was supported by the National Natural Science
a better performance and it desires particular consideration in a Foundation of China (Grant No. 61175018) and the Seed
regression task. We also notice that an BFB network has shown Foundation of Innovation and Creation for Graduate Students
the best performance in a speech synthesis task [17]. in Northwestern Polytechnical University.
3348
8. References Sainath et al., “Deep neural networks for acoustic
modeling in speech recognition: The shared views of
[1] K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate,
four research groups,” Signal Processing Magazine, IEEE,
and E. Vatikiotis-Bateson, “Visual prosody and speech
vol. 29, no. 6, pp. 82–97, 2012.
intelligibility head movement improves auditory speech
perception,” Psychological science, vol. 15, no. 2, pp. [16] H. Ze, A. Senior, and M. Schuster, “Statistical parametric
133–137, 2004. speech synthesis using deep neural networks,” in Proc.
ICASSP, Vancouver, BC, Canada, vol. 5, 2013, pp. 7962–
[2] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter,
7966.
“Lifelike talking faces for interactive services,” Proceed-
ings of the IEEE, vol. 91, no. 9, pp. 1406–1429, 2003. [17] Y. Fan, Y. Qian, F. Xie, and F. K. Soong, “TTS
synthesis with bidirectional LSTM based recurrent neural
[3] S. Zhang, Z. Wu, H. Meng, and L. Cai, “Head movement
networks,” in Proc. Interspeech, Singapore, 2014.
synthesis based on semantic and prosodic features for a
chinese expressive avatar,” in Proc. ICASSP, vol. 4, 2007, [18] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep
pp. IV–837. belief network for speech synthesis,” in Proc. ICASSP,
2013, pp. 8012–8016.
[4] K. Mu, J. Tao, J. Che, and M. Yang, “Mood avatar:
Automatic text-driven head motion synthesis,” in Proc. [19] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice
ICML. ACM, 2010, p. 37. conversion using deep neural networks with layer-wise
generative training,” IEEE/ACM Transactions on Audio,
[5] C. Busso, Z. Deng, U. Neumann, and S. Narayanan,
Speech and Language Processing, vol. 22, no. 12, pp.
“Natural head motion synthesis driven by acoustic prosod-
1859–1872, 2014.
ic features,” Computer Animation and Virtual Worlds,
vol. 16, no. 3-4, pp. 283–290, 2005. [20] B. Uria, S. Renals, and K. Richmond, “A deep neural
network for acoustic-articulatory speech inversion,” in
[6] C. Busso, Z. Deng, M. Grimm, U. Neumann, and
Proc. NIPS Workshop on Deep Learning and Unsuper-
S. Narayanan, “Rigid head motion in expressive speech
vised Feature Learning, Sierra Nevada, Spain, December
animation: Analysis and synthesis,” IEEE Transactions
2011.
on Audio, Speech, and Language Processing, vol. 15,
no. 3, pp. 1075–1086, 2007. [21] C. Ding, P. Zhu, L. Xie, D. Jiang, and Z. Fu, “Speech-
driven head motion synthesis using neural networks,” in
[7] G. Hofer and H. Shimodaira, “Automatic head motion Proc. INTERSPEECH, Singapore, vol. 5, 2014.
prediction from speech data,” in Proc. INTERSPEECH,
Antwerp, Belgium, vol. 4, 2007. [22] H. Sak, A. Senior, and F. Beaufays, “Long short-
term memory recurrent neural network architectures for
[8] G. Hofer, H. Shimodaira, and J. Yamagishi, “Speech large scale acoustic modeling,” in Proc. INTERSPEECH,
driven head motion synthesis based on a trajectory Singapore, vol. 5, 2014.
model,” in Proc. Siggraph, San Diego, California, USA,
vol. 1, 2007. [23] M. Schuster and K. K. Paliwal, “Bidirectional recurrent
neural networks,” IEEE Transactions on Signal Process-
[9] A. Ben Youssef, H. Shimodaira, and D. A. Braude, ing, vol. 45, no. 11, pp. 2673–2681, 1997.
“Articulatory features for speech-driven head motion
synthesis,” in Proc. INTERSPEECH, Lyon, France, vol. 5, [24] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech
2013, pp. 2758–2762. recognition with deep recurrent neural networks,” in Proc.
ICASSP, 2013, pp. 6645–6649.
[10] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp,
“Analysis of head gesture and prosody patterns for [25] B. Fan, L. Wang, F. K. Soong, and L. Xie, “Photo-real
prosody-driven head-gesture animation,” IEEE Trans- talking head with deep bidirectional LSTM,” in Proc.
actions on Pattern Analysis and Machine Intelligence, ICASSP, 2015.
vol. 30, no. 8, pp. 1330–1345, 2008. [26] F. Weninger, J. Bergmann, and B. Schuller, “Introducing
[11] A. Ben-Youssef, H. Shimodaira, and D. A. Braude, CURRENNT–the Munich open-source CUDA recurrent
“Speech driven talking head from estimated articulatory neural network toolkit,” Journal of Machine Learning
features,” in Proc. ICASSP, Florence, Italy, vol. 5, 2014, Research, vol. 15, 2014.
pp. 4606–4610. [27] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and
[12] K. Richmond, P. Hoole, and S. King, “Announcing T. Kitamura, “Speech parameter generation algorithms for
the electromagnetic articulography (day 1) subset of the hmm-based speech synthesis,” in Proc. ICASSP, vol. 3,
mngu0 articulatory corpus.” in Proc. INTERSPEECH, 2000, pp. 1315–1318.
Florence, Italy, vol. 5, 2011, pp. 1505–1508. [28] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canoni-
[13] T. Toda, A. W. Black, and K. Tokuda, “Acoustic-to- cal correlation analysis: An overview with application to
articulatory inversion mapping with gaussian mixture learning methods,” Neural computation, vol. 16, no. 12,
model.” in Proc. INTERSPEECH, 2004. pp. 2639–2664, 2004.
[14] H. C. Yehia, T. Kuratate, and E. Vatikiotis-Bateson, [29] J. Park and H. Ko, “Real-time continuous phoneme
“Linking facial animation, head motion and speech recognition system using class-dependent tied-mixture
acoustics,” Journal of Phonetics, vol. 30, no. 3, pp. 555– hmm with HBT structure for speech-driven lip-sync,”
568, 2002. IEEE Transactions on Multimedia, vol. 10, no. 7, pp.
1299–1306, 2008.
[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.
3349