Professional Documents
Culture Documents
Deep Learning in Speech Synthesis
Deep Learning in Speech Synthesis
Heiga Zen
Google
August 31st, 2013
Outline
Background
Deep Learning
Conclusion
Text-to-speech as sequence-to-sequence mapping
text
(concept)
frequency speech
transfer
by speech information
characteristics
magnitude
start--end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise
air flow
TEXT
Sentence segmentaiton
Word segmentation
Text normalization Text analysis
Part-of-speech tagging
Pronunciation
Prosody prediction
Speech synthesis
discrete discrete Waveform generation
NLP
Frontend
SYNTHESIZED discrete continuous
SPEECH Speech
Backend
Text Text
Advantages
Flexibility to change voice characteristics
Small footprint
Robustness
Drawback
Quality
hj
hi = f (z i )
j
X
zj = xi wij
... xi ... i i
Input units
... Output units
Input vector x
Output vector y
...
...
...
Hidden units
Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 8 of 50
Difficulties to train DNN
h hj ={0,1}
W
v vi ={0,1}
1
p(v, h | W ) = exp {E(v, h; W )} wij : weight
Z(W )
X X X
E(v, h; W ) = bi vi cj hj vi wij hj bi , cj : bias
i j i,j
Supervised
learning
RBM2 DBN
as DNN
RBM1
copy
stacking
Input Input Input
Input Input Input
Products
Personalized photo search [14, 15]
Voice search [16, 17].
Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 13 of 50
Conventional HMM-GMM [1]
yes no
yes no yes no
reduction
Acoustic space
yes no
yes no yes no
yes no yes no
...
Distributed representation
Can be exponentially more efficient than fragmented
representation
Better representation ability with fewer parameters
Linguistic
features x
yes no
yes no yes no
DBN i DBN j
...
Acoustic Acoustic
features y features y
h1
v h2
Linguistic
features x h3
v
Acoustic
features y
Acoustic
features y
h3
h2
h1
Linguistic
features x
Acoustic
features y
Gaussian
Process h3
Regression
h2
h1
Linguistic
features x
Uses last hidden layer output as input for Gaussian Process (GP)
regression
Replaces last layer of DNN by GP regression
Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 24 of 50
Comparison
features at frame 1
binary & numeric
Excitation
features
Numeric V/UV
features feature
Duration
feature
...
...
...
...
...
Frame position
feature
Input features including
features at frame T
binary & numeric
Waveform Parameter
SPEECH
synthesis generation
Is this new? . . . no
NN [28]
RNN [29]
Removing silences
Decision tree splits silence & speech at the top of the tree
Single neural network handles both of them
Neural network tries to reduce error for silence
Better to remove silence frames as preprocessing
Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 31 of 50
Example of speech parameter trajectories
Natural speech
1 HMM (=1)
DNN (4x512)
-1
0 100 200 300 400 500
Frame
Objective measures
Aperiodicity distortion (dB)
Voiced/Unvoiced error rates (%)
Mel-cepstral distortion (dB)
RMSE in log F0
HMM
DNN (256 units / layer) DNN (512 units / layer)
DNN (1024 units / layer) DNN (2048 units / layer)
1.32
Aperiodicity distortion (dB)
1.30 =16
1.28
1 =4
1.26 1 1
=1
1 =0.375
1.24
2 2
1.22 2 3 3
3 2 4 5
3 5 4 4 5
5
4
1.20
1e+05 1e+06 1e+07
HMM
DNN (256 units / layer) DNN (512 units / layer)
DNN (1024 units / layer) DNN (2048 units / layer)
4.6
Voiced/Unvoiced Error Rate (%)
4.4
4.2
4.0
3.8
3.6
3.4
3.2
1e+05 1e+06 1e+07
Total number of parameters
HMM
DNN (256 units / layer) DNN (512 units / layer)
DNN (1024 units / layer) DNN (2048 units / layer)
5.4
Mel-cepstral distortion (dB)
5.2
5.0
4.8
4.6
1e+05 1e+06 1e+07
Total number of parameters
HMM
DNN (256 units / layer) DNN (512 units / layer)
DNN (1024 units / layer) DNN (2048 units / layer)
RMSE in log F0
0.13
0.12
1e+05 1e+06 1e+07
Total number of parameters
HMM DNN
() (#layers #units) Neutral p value z value
15.8 (16) 38.5 (4 256) 45.7 < 106 -9.9
16.1 (4) 27.2 (4 512) 56.8 < 106 -5.1
12.7 (1) 36.6 (4 1 024) 50.7 < 106 -11.5
[3] Y. Bengio.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning, 2(1):1127, 2009.
[11] P Smolensky.
Information processing in dynamical systems: Foundations of harmony
theory.
In D. Rumelhard and J. McClelland, editors, Parallel Distributed
Processing, volume 1, chapter 6, pages 194281. MIT Press, 1986.
[14] C. Rosenberg.
Improving photo search: a step across the semantic gap.
http://googleresearch.blogspot.co.uk/2013/06/
improving-photo-search-step-across.html.
[15] K. Yu.
https://plus.sandbox.google.com/103688557111379853702/
posts/fdw7EQX87Eq.
[16] V. Vanhoucke.
Speech recognition and deep learning.
http://googleresearch.blogspot.co.uk/2012/08/
speech-recognition-and-deep-learning.html.
[17] Bing makes voice recognition on Windows Phone more accurate and
twice as fast.
http://www.bing.com/blogs/site_blogs/b/search/archive/
2013/06/17/dnn.aspx.