Professional Documents
Culture Documents
AITRICS1 , KAIST2
zzxc1133@aitrics.com, {alsehdcks95, sjhwang82}@kaist.ac.kr
arXiv:2211.09383v1 [eess.AS] 17 Nov 2022
ABSTRACT samples from the target speaker and heavy computational costs
There has been a significant progress in Text-To-Speech (TTS) to update the model parameters. On the other hand, recent
synthesis technology in recent years, thanks to the advance- works [12, 14, 15] have focused on a zero-shot approach in
ment in neural generative modeling. However, existing meth- which transcribed samples and the additional fine-tuning stage
ods on any-speaker adaptive TTS have achieved unsatisfactory are not necessarily required to adapt to the unseen speaker,
performance, due to their suboptimal accuracy in mimicking thanks to their neural encoder that encode any speech into the
the target speakers’ styles. In this work, we present Grad- latent vector. However, due to their capacity for generative
StyleSpeech, which is an any-speaker adaptive TTS framework modeling [18], they frequently show lower similarity on un-
that is based on a diffusion model that can generate highly seen speakers and are vulnerable to generating speech in an
natural speech with extremely high similarity to target speak- unique style, such as emotional speech.
ers’ voice, given a few seconds of reference speech. Grad- In this work, we propose a zero-shot any-speaker adaptive
StyleSpeech significantly outperforms recent speaker-adaptive TTS model, Grad-StyleSpeech, that generates highly natural
TTS baselines on English benchmarks. Audio samples are and similar speech given only few seconds of reference speech
available at https://nardien.github.io/grad-stylespeech-demo. from any target speaker with a score-based diffusion model [6].
In contrast to previous diffusion-based approaches [7, 8, 9], we
Index Terms— speech synthesis, text-to-speech, any- construct the model using a style-based generative model [15]
speaker TTS, voice cloning to take the target speaker’s style into account. Specifically, we
propose a hierarchical transformer encoder to generate a repre-
1. INTRODUCTION sentative prior noise distribution where the reverse diffusion
process is able to generate more similar speech of target speak-
Recently, the deep neural network-based Text-To-Speech ers, by reflecting the target speakers’ style during embedding
(TTS) synthesis models have shown remarkable performance the input phonemes. In experiments, we empirically show that
on both quality and speed, thanks to the progress on the our method outperforms recent baselines.
generative modeling [1, 2], non-autoregressive acoustic mod- Our main contributions can be summarized as follows:
els [3], and the powerful neural vocoder [4]. Remarkably,
• In this paper, we propose a TTS model Grad-StyleSpeech
diffusion models [5, 6] have recently been shown to synthesize
which synthesizes the production quality of speech with any
high-quality images in the image generation tasks and speech
speaker’s voice, even with a zero-shot approach.
in the TTS synthesis tasks [7, 8, 9]. Beyond the TTS synthesis
on a single speaker, recent works [10, 11] have shown decent • We introduce a hierarchical transformer encoder to build
quality in synthesizing the speech of multiple speakers. Fur- the representative noise prior distribution for any-speaker
thermore, a variety of works [12, 13, 14, 15, 16, 17] focus on adaptive settings with score-based diffusion models.
any-speaker adaptive TTS where the system can synthesize • Our model outperforms recent any-speaker adaptive TTS
the speech of any speaker given the reference speech of them. baselines on both LibriTTS and VCTK datasets.
Due to its extensive possible applications in the real world, the
research on any-speaker adaptive TTS – sometimes, termed
2. METHOD
Voice Cloning – has grown and highlighted a lot.
Most of works on any-speaker adaptive TTS focus to syn-
Speaker Adaptive Text-To-Speech (TTS) task aims to generate
thesize the speech which is highly natural and similar to the
the speech given the text transcription and the reference speech
target speaker’s voice, given the few samples of speech from
of the target speaker. In this work, we focus to synthesize the
the target speaker. Some of previous works [13, 16] need a few
mel-spectrograms (audio feature) instead of the raw waveform
transcribed (supervised) samples for fine-tuning TTS model.
as in previous works [15, 16]. Formally, given the text x =
They have a clear drawback where they require the supervised
[x1 , . . . , xn ] consists of the phonemes and the reference speech
*: Equal Contribution Y = [y1 , . . . , ym ] ∈ Rm×80 , the objective is to generate the
Score-based Diffusion Models of Markov chains [6]. In this subsection, we briefly recap
essential parts of the score-based diffusion model to introduce
𝑌! Reverse SDE 𝑌"
our method in detail.
𝑌! Forward SDE 𝑌" The forward diffusion process progressively adds the noise
drawn from the noise distribution N (0, I) to the samples
drawn from the sample distribution Y0 ∼ p0 . This forward
Style-Adaptive diffusion process can be modeled with the forward SDE as
Encoder Style Vector follows:
1 p
Latent Space dYt = − β(t)Yt dt + β(t)dWt ,
2
Aligner where t ∈ [0, T ] is the continuous time step, β(t) is the
Mel-Style noise scheduling function, and Wt is the standard Wiener
Encoder process [6]. Instead, Grad-TTS [8] proposes to gradually
Text Encoder denoise the noisy samples from the data-driven prior noise dis-
tribution N (µ, I) where µ is the text- and style-conditioned
representations from the neural network.
Hello, World! 1 p
Reference Speech dYt = − β(t)(Yt − µ)dt + β(t)dWt . (1)
2
Text (Phonemes) (Mel Spectrogram)
Fig. 1: Framework Overview. Blue box indicates a hierar- It is tractable to compute the transition kernel p0t (Yt |Y0 )
chical transformer encoder where it outputs µ. where it is also a Gaussian distribution [6] as follows:
Rt
ground-truth speech Ỹ . In the training stage, Ỹ is identical p0t (Yt |Y0 ) = N (Yt ; γt , σt2 ), σt2 = I − e− 0
β(s)ds
(2)
Rt Rt
with the reference speech Y , while not in the inference stage. − 12 β(s)ds − 21 β(s)ds
γt = (I − e 0 )µ + e 0 Y0 .
Our model consists of three parts: a mel-style encoder that
embeds the reference speech into the style vector [15], a hier- 2.2.2. Reverse Diffusion Process
archical transformer encoder that generates the intermediate
representations conditioned on the text and the style vector, On the other hand, the reverse diffusion process gradually
and a diffusion model that maps these representations into inverts the noise from pT into the data samples from p0 . Fol-
mel-spectrograms by denoising steps [8]. We illustrate our lowing the results from [19] and [6], reverse diffusion process
overall framework in Figure 1. of Equation 1 is given as the reverse-time SDE as follows [8]:
1
2.1. Mel-Style Encoder dYt = − β(t)(Yt − µ) − β(t)∇Yt log pt (Yt ) dt
2
As a core component for the zero-shot any-speaker adaptive
p
+ β(t)dW̃t ,
TTS, we use the mel-style encoder [15] to embed the refer-
ence speech into the latent style vector. Formally, s = hψ (Y ) where W̃t is a reverse Wiener process and ∇Yt log pt (Yt ) is
0
where s ∈ Rd is the style vector and hψ is the mel-style en- the score function of the data distribution pt (Yt ). We can apply
coder parameterized by ψ. Specifically, the mel-style encoder any numerical SDE solvers for solving reverse SDE to generate
consists of spectral and temporal processor [12], the trans- samples Y0 from the noise YT [6]. Since it is intractable to
former layer consisting of the multi-head self-attention [1], obtain the exact score during the reverse diffusion process, we
and the temporal average pooling at the end. estimate the score using the neural network θ (Yt , t, µ, s).
Table 1: Objective evaluation for zero-shot adaptation. Experimental results on LibriTTS and VCTK where the models are
not fine-tuned. We report the training dataset for a fair comparison and † indicates the model used VCTK in training.
CER (%)
Grad-TTS (any-speaker) 2.47 ± 0.20 2.03 ± 0.20 2.05 ± 0.18 1.69 ± 0.13
6 103
Grad-StyleSpeech (clean) 4.18 ± 0.18 3.83 ± 0.25 4.13 ± 0.14 3.95 ± 0.16
[12] Sercan Ömer Arik, Jitong Chen, Kainan Peng, Wei Ping, [24] Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li,
and Yanqi Zhou, “Neural voice cloning with a few sam- “Emotional voice conversion: Theory, databases and
ples,” in NeurIPS 2018, 2018, pp. 10040–10050. ESD,” Speech Commun., vol. 137, pp. 1–18, 2022.