Professional Documents
Culture Documents
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro
NVIDIA
tem which can generate speech with the proper accent while multi-language training. However, these solutions don’t sup-
retaining the characteristics of an individual voice. This is port code-switching situations where words from multiple
challenging to do because it is expensive to obtain bilingual languages appear in mixed order in the synthesis prompt.
training data in multiple languages, and the lack of such data Recently, there has been an interest in factorizing out fine-
results in strong correlations that entangle speaker, language, grained speech attributes[11, 12, 13] like F0 and energy. We
and accent, resulting in poor transfer capabilities. To overcome extend this fine-grained control by additionally factorizing out
this, we present a multilingual, multiaccented, multispeaker accent and speaker with an ability to predict frame-level F0
speech synthesis model1 based on RADTTS with explicit con- and energy for a desired combination of accent, speaker and
trol over accent, language, speaker and fine-grained F0 and language. We analyze the effects of such explicit conditioning
energy features. Our proposed model does not rely on bilingual on fine-grained speech features on the synthesized speech
training data. We demonstrate an ability to control synthesized when transferring a voice to other languages.
accent for any speaker in an open-source dataset comprising of Our goal is to synthesize speech for a target speaker in
7 accents. Human subjective evaluation demonstrates that our any language with a specified accent. Related methods in-
model can better retain a speaker’s voice and accent quality clude YourTTS[14], with a focus on zero-shot multilingual
than controlled baselines while synthesizing fluent speech in voice conversion. Although promising results are presented
all target languages and accents in our dataset. for a few language combinations, it shows limited success on
transferring from languages with limited speakers. Moreover,
1. INTRODUCTION it uses a curriculum learning approach to extend the model
Recent progress in Text-To-Speech (TTS) has achieved human- to new languages, making the training process cumbersome.
like quality in synthesized mel-spectrograms [1, 2, 3, 4] and Closest to our work is [9], which describes a multilingual and
waveforms[5, 6]. Most models support speaker selection dur- multispeaker TTS model without requiring individual speakers
ing inference by learning a speaker embedding table[1, 2, 3] with multiple language samples.
during training, while some support zero-shot speaker synthe- In this work, we (1) demonstrate effective scaling of single
sis by generating a speaker conditioning vector from a short language TTS to multiple languages using a shared alphabet
audio sample[7]. However, most models support only a single set and alignment learning framework[4, 15]; (2) introduce
language. This work focuses on factorizing out speaker and explicit accent conditioning to control the synthesized accent;
accent as controllable attributes, in order to synthesize speech (3) propose and analyze several strategies to disentangle at-
for any desired combination of speaker, language and accent tributes (speaker, accent, language and text) without relying on
present in the training dataset. parallel training data (multilingual speakers); and (4) explore
It is very expensive to obtain bilingual datasets because fine-grained control of speech attributes such as F0 and energy
most speakers are unilingual. Hence, speaker, language, and and its effects on speaker timbre retention and accent quality.
accent attributes are highly correlated in most TTS datasets.
Training models with such entangled data can result in poor 2. METHODOLOGY
language, accent and speaker transferability. Notably, every
language has its own alphabet and most TTS systems use We build upon RADTTS[4, 13] as deterministic decoders tend
different symbol sets for each language, sometimes even sep- to produce oversmoothed mels that require vocoder fine-tuning.
arate encoders[8], severely limiting representational sharing Our model synthesizes mels(X ∈ RCmel ×F ) using encoded
across languages. This aggravates entanglement of speaker, text(Φ ∈ RCtxt ×T ), accent(A ∈ RDaccent ) and speaker(S ∈
language and text, especially in datasets with very few speak- RDspeaker ) as conditioning variables, with optional condition-
ers per language. Approaches like [9] introduce an adver- ing on fundamental frequency(F0 ∈ R1 ×F ) and energy(ξ ∈
sarial loss to curb this dependence of text representations R1 ×F ), where F is the number of mel frames, T is the text
1 Samples can be found at this link
length, and energy is the per-frame mel energy average. We
propose the following novel modifications:
2.1. Shared text token set To promote disentanglement between accent and speaker em-
Our goal is to train a single model with the ability to synthesize beddings, we aim to decorrelate the following variables: (1)
a target language with desired accent for any speaker in the random variables in accent embeddings; (2) random variables
dataset. We represent phonemes with the International Pho- in speaker embeddings; (3) random variables in speaker and
netic Alphabet (IPA) to enforce a shared textual representation. accent embeddings from each other. While truly decorrelat-
A shared alphabet across languages reduces the dependence ing the information is difficult, we can promote something
of text on speaker identity, especially in low-resource settings close by using the constraints from VICReg[16]. We denote
(e.g. 1 speaker per language) and supports code-switching. E A ∈ RDa ×Na , E S ∈ RDs ×Ns as the accent and speaker
embedding tables respectively. Column vector ej ∈ E denotes
2.2. Scalable Alignment Learning
the j’th embedding in either table. Let µE and Cov(E) be the
We utilize the alignment learning framework in[4, 15], to learn
means and covariance matrices. By using VICReg, we con-
speech-text alignments Λ ∈ RT ×F without external dependen-
strain standard deviations to be at least γ and suppress the off-
cies. A shared alphabet set simplifies this since alignments
diagonal elements of the covariance matrix (γ = 1, = 1e−4):
are learnt on a single token set instead of distinct sets. How-
1 X
q
ever, when the speech has a strong accent, the same token Lvar = max 0, γ − Cov(E)i,j + (2)
can be spoken in different ways from speakers with different D i=j
accents and hence alignments can become brittle. To curb this X
Lcovar = Cov(E)2i,j (3)
multi-modality, we learn alignments between (text, accent) and
i6=j
mel-spectograms using accent A as a conditioning variable.
Next, we attempt to decorrelate accent and speaker vari-
2.3. Disentangling Factors ables from each other by minimizing the cross-correlation
We focus on non-parallel data with a speaker speaking 1 lan- matrix from batch statistics. Let E˜A and Ẽ S be the sampled
guage which typically has text Φ, accent A and speaker S en- column matrices of accent and speaker embedding vectors
tangled. We evaluate strategies to disentangle these attributes: sampled within a batch of size B. We compute the batch cross
Speaker-adversarial loss In TTS datasets, speakers typically -correlation matrix RAS as follows (µE A and µE S computed
read different text and have different prosody. Hence, there from embedding table):
can be entanglement between speaker S, text Φ and prosody. 1
Following[9], we employ domain adversarial training to dis- RAS = (Ẽ A − µE A )(Ẽ S − µE S )T (4)
B−1
entangle S and Φ by using a gradient reversal layer. We use a 1 X AS 2
speaker classification loss, and backpropagate classifier’s nega- Lxcorr = (Ri,j ) (5)
Da Ds i,j
tive gradients through the text encoder and token embeddings.
N
X 2.4. Accent conditioned speech synthesis
Ladv = P (si |φi ; θspkclassif ier ) (1)
i=1
We introduce an extra conditioning variable for accent A to
Data Augmentation Disentangling accent and speaker is chal- RADTTS [4] to allow for accent-controllable speech synthesis.
lenging, as a speaker typically has a specific way of pronounc- We call this model RADTTS-ML, a multilingual version of
ing words and phonemes, causing a strong association between RADTTS. The following equation describes the model:
speaker and accent. Straightforward approaches to learning Pradtts (X, Λ) = Pmel (X|Φ, Λ, A, S)Pdur (Λ|Φ, A, S) (6)
from non-parallel data learn entangled representations because
We refer to our conditioning as accent instead of language,
a speaker’s language and accent can be trivially learned from
because we consider language to be implicit in the phoneme
the dataset. Since our goal is to synthesize speech for a speaker
sequence. The information captured by the accent embed-
in a target language with desired accent, disentangling speaker
ding should explain the fine-grained differences between how
S and accent A is essential, otherwise either speaker identity
phonemes are pronounced in different languages.
is not preserved in the target language or the generated speech
retains the speaker’s accent from the source language. To over- 2.5. Fine-grained frame-level control of speech attributes
come this problem, we use data augmentations like formant, Fine-grained control of speech attributes like F0 and energy
F0 , and duration scaling to promote disentanglement between E can provide high-quality controllable speech synthesis[13].
speaker and accent. For a given speech sample xi with speaker We believe conditioning on such attributes can help improve
identity si and accent ai , we apply a fixed transformation accent and language transfer. During training, we condition
t ∈ {1, 2, ...τ } to construct a transformed speech sample xti our mel decoder on ground truth frame-level F0 and energy.
and assign speaker identity as si + t · Nspeakers and accent as Following [13], we train deterministic attribute predictors to
original accent ai , where τ is the number of augmentations. predict phoneme durations Λ, F0 , and energy E conditioned
This creates speech samples with variations in speaker identity on speaker S, encoded text Φ, and accent A. We standardize
and accent in order to orthogonalize these attributes. F0 using the speaker’s F0 mean and standard deviation to
Embedding Regularization Ideally, the information captured remove speaker-dependent information. This allows us to
by the speaker and accent embeddings should be uncorrelated. predict speech attributes for any speaker, accent, and language
and control mel synthesis with such features. We refer to this or slower(×[1.0 − 1.1]). We augment the dataset with trans-
model as RADMMM, which is described as: formed audio defining a new speaker identifier, but retaining
the original accent. In RT, this leads to a significant boost in
Pradmmm (X, Λ) = Pmel (X|Φ, Λ, A, S, F0 , E) speaker retention. We believe that creating more speakers per
PF0 (F0 |Φ, A, S)PE (E|Φ, A, S)Pdur (Λ|Φ, A, S) (7) accent enhances disentanglement of accent and speaker. How-
3. EXPERIMENTS ever, in RM, where F0 is predicted and the model is explicitly
conditioned on augmented F0 , we observe a significant drop
We conduct our experiments on an open source dataset2 with in both speaker retention as well as CER with augmentations,
a sampling rate of 16kHz. It contains 7 different languages likely due to conditioning on noisy augmented features.
(American English, Spanish, German, French, Hindi, Brazilian Embedding Regularization We conduct three ablations with
Portuguese, and South American Spanish). This dataset em- regularization: one that adds variance (Lvar ) and covariance
ulates low-resource scenarios with only 1 speaker per accent (Lcovar ) constraints to the baseline, and two more involving
with strong correlation between speaker, accent, and language. all three constraints (Lvar , Lcovar , Lxcorr ) with small (0.1)
We use HiFiGAN vocoders trained individually on selected and large weights (10.0). We observe an improvement in
speakers in the evaluation set. We focus on the task of trans- speaker similarity with the best speaker retention with all three
ferring the voice of 7 speakers in the dataset to the 6 other constraints in both RT and RM. Moreover, we observe similar
language and accent settings in the dataset. Herein we refer to CER to the baselines suggesting similar pronunciation quality.
RADTTS-ML as RT and RADMMM as RM for brevity. Our final models include regularization constraints, but we
don’t use augmentation and Ladv due to worse pronunciation
3.1. Ablation of Disentanglement Strategies quality and limited success on speaker timbre retention.
We evaluate the effects of disentanglement strategies on the
transfer task by measuring speaker timbre retention using 3.2. Comparing proposed models with existing methods
the cosine similarity (Cosine Sim) of synthesized samples We compare our final RT and RM with the Tacotron 2-based
to source speaker’s reference speaker embeddings obtained model described in [9], call it T2, on the transfer task. We
from the speaker recognition model Titanet[17]. We mea- reproduced the model to the best of our ability, noting that
sure character error rate (CER) with transcripts obtained from training on our data was unstable, possibly due to data qual-
Conformer[18] models trained for each language. Table 1 and ity, and that results may not be representative of the original
Figure 1 demonstrate overall and accent grouped effects of implementation. We tune denoising parameters[20] to reduce
various disentanglement strategies. The RT baseline uses the audio artifacts from T2 over-smooth generated mels[3, 21].
shared text token set, accent-conditioned alignment learning, We attempted to implement YourTTS[14] but ran into issues
and no additional constraints to disentangle speaker, text, and reproducing the results on our dataset and hence we don’t
accent. The RM baseline uses this setup with F0 and energy make a direct comparison to it.
conditioning. Speaker Adversarial Loss (Ladv ) We observe Speaker timbre retention Table 2 shows the speaker cosine
Table 1: Ablation results comparing disentanglement strate- similarity of our proposed models and T2. We observe that
gies using Cosine Sim and CER defined in 3.1 both RT and RM perform similarly in terms of speaker re-
RADTTS-ML RADMMM tention and achieve better speaker timbre retention than T2.
Disentanglement Strategy Cosine Sim CER Cosine Sim CER
Baseline (B) 0.3062 ± 0.0176 17.7 0.3438 ± 0.0138 5.1
However, our subjective human evaluation below shows that
(B) + normalized F0 pred N/A N/A 0.3946 ± 0.0143 5.3 RM samples are overall better than RT in both timbre preser-
(B) + Ladv 0.3027 ± 0.0174 39.9 N/A N/A
(B) + augmentation 0.3855 ± 0.0145 44.3 0.2174 ± 0.0131 41.7 vation and pronunciation.
(B) + Lvar and Lcovar 0.4029 ± 0.0144 13.7 N/A N/A
(B) + low weight on Lvar , Lcovar and Lxcorr 0.3784 ± 0.0112 17.0 0.3232 ± 0.0154 9.6 Table 2: Speaker timbre retention using Cosine Sim (Sec 3.1)
(B) + Lvar , Lcovar and Lxcorr 0.4217 ± 0.0156 12.2 0.4188 ± 0.0148 5.5
(B) + Lvar , Lcovar , Lxcorr and Ladv N/A N/A 0.4163 ± 0.0157 7.2 Model Cosine Similarity
that the addition of Ladv loss to RT and RM does not affect RADTTS-ML (RT) 0.4186 ± 0.0154
speaker retention when synthesizing the speaker for a target RADMMM (RM) 0.4197 ± 0.0149
language. However, we observe a drop in character error rate. Tacotron2 (T2) 0.145 ± 0.0119
We believe the gradients from the speaker classifier tend to
3.3. Subjective human evaluation
remove speaker and accent information from encoded text Φ,
which affects the encoded text representation leading to worse We conducted an internal study with native speakers to evaluate
pronunciation. accent quality and speaker timbre retention. Raters were pre-
Data Augmentation We use Pratt[19] to apply six augmenta- screened with a hearing test based on sinusoid counting. Since
tions: formant scaling down (×[0.875 − 1.0]) and up (×[1.0 − MOS is not suited for finer differences, we use comparative
1.25]), scaling F0 down (×[0.9 − 1.0]) and up (×[1.0 − 1.1]), mean opinion scores (CMOS) with a 5 point scale (-2 to 2) as
and scaling durations to make samples faster(×[0.9 − 1.0])) the evaluation metric. Given a reference sample and pairs of
synthesized samples from different models, the raters use the
2 Dataset source, metadata and filelists will be released with source code. 5 point scale to indicate which sample, if any, they believe is
RADTTS baseline RADTTS augmentation RADMMM L_var, L_covar, L_xcorr RADMMM with normalized log F0
RADTTS low weight L_var, L_covar, L_xcorr RADTTS L_var, L_covar RADMMM L_var, L_covar, L_xcorr, L_adv RADMMM augmentation L_var, L_covar
RADTTS L_adv RADTTS L_var, L_covar, L_xcorr RADMMM baseline RADMMM low weight L_var, L_covar, L_xcorr
0.6
0.5
cosine similarity
cosine similarity
0.4 0.4
0.3
0.2 0.2
0.1
0.0 Brazilian American German Spanish South American French Hindi 0.0 German Hindi French American Spanish South AmericanBrazilian
Portuguese English Spanish English Spanish Portuguese
102 102
log (CER)
log (CER)
101
101
Brazilian American German Spanish South American French Hindi German Hindi French American Spanish South American Brazilian
Portuguese English Spanish English Spanish Portuguese
(a) RADTTS-ML (RT) (b) RADMMM (RM)
Fig. 1: Comparing speaker cosine similarity and CER of considered disentanglement strategies for every accent.
American German French Hindi Brazilian doesn’t change speaker timbre in RM, thus showcasing the
English Portuguese
disentangled nature of accent and speaker. Finally, RM final
pairwise preference rating