Professional Documents
Culture Documents
state
Problems due to lack of robustness FUNet
State sequence
• Training instability caused by excessive degrees of freedom in alignment 𝑺 = (1, 1, 2, 2, 2, 3, … )
State durations
• Quality degradation caused by exposure bias 𝒗1 , 𝒗1 , 𝒗2 , 𝒗2 , 𝒗2 , 𝒗3 , …
𝑑1 = 2, 𝑑2 = 3, …
• Discrepancy between training and inference Duration
Training HSMM trellis frame
𝒗1 , 𝒗2 , 𝒗3 , …
Predictor • Maximize ELBO over approximate posterior distributions
Hidden semi-Markov model (HSMM)-based structured
𝛼 𝛽
attention [Nankaku et al.; ’21] Text Approximate posterior distributions 𝑄 𝑺 and 𝑄 𝒁 𝒛 𝑡−1 𝜸𝑡 /𝜸𝑡+1
Encoder
• Latent alignment based on VAE framework • Estimate state given entire sequence using forward-
𝑿
• Acoustic modeling with explicit duration models Linguistic backward algorithm 𝑄 𝒁
features 𝒀
• Appropriate handling of duration in statistical manner • Approximate posterior distribution relative to frame 𝑡 Speech
Encoder
• Constraints to yield monotonic alignment 𝑿
𝑿 • Simultaneously infer 𝑄 𝑺 and 𝑄 𝒁 for strong dependence 𝒁
state
𝛼
• Consistent AR structure between training and inference FUNet
𝜸𝑡
𝑄(𝒁)
• Reduce exposure bias 𝜸𝑡+1
𝛽
Inverse of FUNet
Integration of HSMM-based structured attention and AR VAE FUNet
Forward algorithm with FUNet
past samples 𝒛 <𝑡
state
𝑃(𝑽|𝒀) 𝑃(𝑽|𝒀)
Linguistic Encoder 𝑽
features Text Text
Speech Encoder
Encoder Encoder
Conventional AR model
• Achieve remarkable performance 4. Experiments
• Training: Teacher forcing
• Target input, no autoregressive behavior Experimental conditions Evaluation
• Inference: Free running • Ximera dataset • Subjective evaluation metric
⇒ Discrepancy between training and inference • Japanese, single female speaker, sampled at • Mean opinion score (MOS) on 5-point scale
48 kHz • Objective evaluation metrics using DTW
Teacher forcing Free running
𝑿
𝑿 • Waveforms were synthesized using a pre-trained • Mel-cepstral distortion (MCD)
WaveGrad [Chen et al.; ’21] vocoder • F0 root mean square error (F0RMSE)
• Conventional AR vs. AR VAE
𝑿
Baselines • AR VAE > Non-AR > Conv. AR
Posterior Decoder AR prior • Analysis by synthesis (AS) • Attention mechanism
AR VAE • FastSpeech 2 [Ren et al.; ’20]
𝑃 𝒁|𝑿 ∝ ෑ 𝑃 𝒙𝑡 |𝒁 𝑃 𝒛𝑡 𝒛𝑡−1 • HSMM-based attn. > Location-sensitive attn.
• AR modeling on latent variable space 𝒁 • Non-AR model trained using external duration,
𝑡 • Effective on both large and small datasets
• Training: Approx. posterior Approx. 𝑄 𝒁 ∝ ෑ 𝑙 𝒛 ; 𝑿 𝑃 𝒛 𝒛 pitch, and energy
• Inference: Prior posterior 𝑡
𝑡 Encoder
𝑡 𝑡−1
• Tacotron 2 [Shen et al.; ’18]
• Consistent autoregressive behavior • AR model with location-sensitive attention
Ablations
Product of Gaussian PDFs
• HSMM-ATTN [Nankaku et al.; ’21] • Objective evaluation metrics
Training
Inference • Duration mean absolute error (DMAE)
𝑿 𝑿 • AR model with HSMM-based attention
• Mel-cepstral distortion (MCD)
Decoder Decoder • F0 root mean square error (F0RMSE)
𝒁 𝒁
Proposed models • Removing backward path makes F0RMSE
• AR-VAE worse
Encoder
• AR VAE applied to Tacotron 2 • Sharing FUNet parameters is critical
𝑿 • ARHSMM-VAE