You are on page 1of 1

ICASSP 2022

Autoregressive variational autoencoder #8917

with a hidden semi-Markov model-based structured attention for speech synthesis


Takato Fujimoto, Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda (Nagoya Institute of Technology, Japan)

1. Overview 3. Proposed model (ARHSMM-VAE)


Autoregressive (AR) acoustic model with attention mechanism Acoustic ෡
𝑿 𝑿
features Inference 𝑄 𝑺
• Predict acoustic features recursively from linguistic features Speech
Speech 1. Encode linguistic features Encoder
• Estimate alignment between acoustic and linguistic features 𝒛෤ 𝑡−1
Decoder 2. Upsample latent linguistic features based on state durations 𝒁
• Achieve remarkable performance 𝒛𝑡−1 𝒁 3. Predict latent acoustic features in AR manner
• e.g., Tacotron 2 [Shen et al.; ’18] and Transformer TTS [Li et al.; ’19] 𝒀 𝑽 𝑺
4. Decode latent acoustic features into acoustic features

state
Problems due to lack of robustness FUNet
State sequence
• Training instability caused by excessive degrees of freedom in alignment 𝑺 = (1, 1, 2, 2, 2, 3, … )
State durations
• Quality degradation caused by exposure bias 𝒗1 , 𝒗1 , 𝒗2 , 𝒗2 , 𝒗2 , 𝒗3 , …
𝑑1 = 2, 𝑑2 = 3, …
• Discrepancy between training and inference Duration
Training HSMM trellis frame
𝒗1 , 𝒗2 , 𝒗3 , …
Predictor • Maximize ELBO over approximate posterior distributions
Hidden semi-Markov model (HSMM)-based structured
𝛼 𝛽
attention [Nankaku et al.; ’21] Text Approximate posterior distributions 𝑄 𝑺 and 𝑄 𝒁 𝒛෤ 𝑡−1 𝜸𝑡 /𝜸𝑡+1
Encoder
• Latent alignment based on VAE framework • Estimate state given entire sequence using forward-
𝑿
• Acoustic modeling with explicit duration models Linguistic backward algorithm 𝑄 𝒁
features 𝒀
• Appropriate handling of duration in statistical manner • Approximate posterior distribution relative to frame 𝑡 Speech
Encoder
• Constraints to yield monotonic alignment 𝑿 ෡
𝑿 • Simultaneously infer 𝑄 𝑺 and 𝑄 𝒁 for strong dependence 𝒁

AR VAE Loss (KL divergence)


𝒀 𝑽
Speech Encoder Speech Decoder 𝑃(𝒁|𝑺, 𝑽) 𝑺
• AR generation of latent variables instead of observations

state
𝛼
• Consistent AR structure between training and inference FUNet
𝜸𝑡
𝑄(𝒁)
• Reduce exposure bias 𝜸𝑡+1
𝛽

Inverse of FUNet
Integration of HSMM-based structured attention and AR VAE FUNet
Forward algorithm with FUNet
past samples 𝒛෤ <𝑡

• FUNet HSMM trellis frame


• Speech synthesis integrating robust HMMs and flexible DNNs and backward algorithm with BUNet 𝑃(𝑺|𝑽)
• Non-linear prediction from
• VAE-based acoustic model containing autoregressive HSMM (ARHSMM) 𝑄(𝑺) relative Duration previous time step
• Improve speech quality and robustness of model training Duration Generalized 𝑄(𝑺) to state 𝑘
Predictor
Predictor forward-backward algorithm
• 𝑃 𝒛𝑡 𝒛෤ 𝑡−1 , 𝒗
෥𝑘 =
Acoustic 𝒁 𝒩(𝒛𝑡 ; FUNet 𝒛෤ 𝑡−1 , 𝒗
෥𝑘 ) 𝑄 𝑽
𝑿
DNet
features Speech
Speech 𝑺 Speech ෡
• Inverse 𝑃 𝒛෤ 𝑡 𝒛𝑡−1 , 𝒗
෥𝑘 /𝑃 𝒛෤ 𝑡 𝒛෤ 𝑡−1 , 𝒗𝑘 is Encoder
𝑿 𝑿 Calculating output probabilities
Encoder Decoder using FUNet
FUNet 𝑄(𝑽) intractable 𝒁
ARHSMM • Approximated by inverse modules
𝒀
Text BUNet and DNet 𝒀 𝑽 𝑺

state
𝑃(𝑽|𝒀) 𝑃(𝑽|𝒀)
Linguistic Encoder 𝑽
features Text Text
Speech Encoder
Encoder Encoder

2. Autoregressive variational autoencoder 𝒀 Linguistic


features
𝑿 Acoustic
features
𝒀
HSMM trellis frame

Conventional AR model
• Achieve remarkable performance 4. Experiments
• Training: Teacher forcing
• Target input, no autoregressive behavior Experimental conditions Evaluation
• Inference: Free running • Ximera dataset • Subjective evaluation metric
⇒ Discrepancy between training and inference • Japanese, single female speaker, sampled at • Mean opinion score (MOS) on 5-point scale
48 kHz • Objective evaluation metrics using DTW
Teacher forcing Free running

𝑿 ෡
𝑿 • Waveforms were synthesized using a pre-trained • Mel-cepstral distortion (MCD)
WaveGrad [Chen et al.; ’21] vocoder • F0 root mean square error (F0RMSE)
• Conventional AR vs. AR VAE
𝑿
Baselines • AR VAE > Non-AR > Conv. AR
Posterior Decoder AR prior • Analysis by synthesis (AS) • Attention mechanism
AR VAE • FastSpeech 2 [Ren et al.; ’20]
𝑃 𝒁|𝑿 ∝ ෑ 𝑃 𝒙𝑡 |𝒁 𝑃 𝒛𝑡 𝒛𝑡−1 • HSMM-based attn. > Location-sensitive attn.
• AR modeling on latent variable space 𝒁 • Non-AR model trained using external duration,
𝑡 • Effective on both large and small datasets
• Training: Approx. posterior Approx. 𝑄 𝒁 ∝ ෑ 𝑙 𝒛 ; 𝑿 𝑃 𝒛 𝒛 pitch, and energy
• Inference: Prior posterior 𝑡
𝑡 Encoder
𝑡 𝑡−1
• Tacotron 2 [Shen et al.; ’18]
• Consistent autoregressive behavior • AR model with location-sensitive attention
Ablations
Product of Gaussian PDFs
• HSMM-ATTN [Nankaku et al.; ’21] • Objective evaluation metrics
Training ෡
Inference ෡ • Duration mean absolute error (DMAE)
𝑿 𝑿 • AR model with HSMM-based attention
• Mel-cepstral distortion (MCD)
Decoder Decoder • F0 root mean square error (F0RMSE)
𝒁 𝒁
Proposed models • Removing backward path makes F0RMSE
• AR-VAE worse
Encoder
• AR VAE applied to Tacotron 2 • Sharing FUNet parameters is critical
𝑿 • ARHSMM-VAE

You might also like