You are on page 1of 5

MULTILINGUAL MULTIACCENTED MULTISPEAKER TTS WITH RADTTS

Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

NVIDIA

ABSTRACT on speaker. Other approaches use a union of linguistic fea-


We work to create a multilingual speech synthesis sys- ture sets of all languages[10] to simplify text processing for
arXiv:2301.10335v1 [cs.SD] 24 Jan 2023

tem which can generate speech with the proper accent while multi-language training. However, these solutions don’t sup-
retaining the characteristics of an individual voice. This is port code-switching situations where words from multiple
challenging to do because it is expensive to obtain bilingual languages appear in mixed order in the synthesis prompt.
training data in multiple languages, and the lack of such data Recently, there has been an interest in factorizing out fine-
results in strong correlations that entangle speaker, language, grained speech attributes[11, 12, 13] like F0 and energy. We
and accent, resulting in poor transfer capabilities. To overcome extend this fine-grained control by additionally factorizing out
this, we present a multilingual, multiaccented, multispeaker accent and speaker with an ability to predict frame-level F0
speech synthesis model1 based on RADTTS with explicit con- and energy for a desired combination of accent, speaker and
trol over accent, language, speaker and fine-grained F0 and language. We analyze the effects of such explicit conditioning
energy features. Our proposed model does not rely on bilingual on fine-grained speech features on the synthesized speech
training data. We demonstrate an ability to control synthesized when transferring a voice to other languages.
accent for any speaker in an open-source dataset comprising of Our goal is to synthesize speech for a target speaker in
7 accents. Human subjective evaluation demonstrates that our any language with a specified accent. Related methods in-
model can better retain a speaker’s voice and accent quality clude YourTTS[14], with a focus on zero-shot multilingual
than controlled baselines while synthesizing fluent speech in voice conversion. Although promising results are presented
all target languages and accents in our dataset. for a few language combinations, it shows limited success on
transferring from languages with limited speakers. Moreover,
1. INTRODUCTION it uses a curriculum learning approach to extend the model
Recent progress in Text-To-Speech (TTS) has achieved human- to new languages, making the training process cumbersome.
like quality in synthesized mel-spectrograms [1, 2, 3, 4] and Closest to our work is [9], which describes a multilingual and
waveforms[5, 6]. Most models support speaker selection dur- multispeaker TTS model without requiring individual speakers
ing inference by learning a speaker embedding table[1, 2, 3] with multiple language samples.
during training, while some support zero-shot speaker synthe- In this work, we (1) demonstrate effective scaling of single
sis by generating a speaker conditioning vector from a short language TTS to multiple languages using a shared alphabet
audio sample[7]. However, most models support only a single set and alignment learning framework[4, 15]; (2) introduce
language. This work focuses on factorizing out speaker and explicit accent conditioning to control the synthesized accent;
accent as controllable attributes, in order to synthesize speech (3) propose and analyze several strategies to disentangle at-
for any desired combination of speaker, language and accent tributes (speaker, accent, language and text) without relying on
present in the training dataset. parallel training data (multilingual speakers); and (4) explore
It is very expensive to obtain bilingual datasets because fine-grained control of speech attributes such as F0 and energy
most speakers are unilingual. Hence, speaker, language, and and its effects on speaker timbre retention and accent quality.
accent attributes are highly correlated in most TTS datasets.
Training models with such entangled data can result in poor 2. METHODOLOGY
language, accent and speaker transferability. Notably, every
language has its own alphabet and most TTS systems use We build upon RADTTS[4, 13] as deterministic decoders tend
different symbol sets for each language, sometimes even sep- to produce oversmoothed mels that require vocoder fine-tuning.
arate encoders[8], severely limiting representational sharing Our model synthesizes mels(X ∈ RCmel ×F ) using encoded
across languages. This aggravates entanglement of speaker, text(Φ ∈ RCtxt ×T ), accent(A ∈ RDaccent ) and speaker(S ∈
language and text, especially in datasets with very few speak- RDspeaker ) as conditioning variables, with optional condition-
ers per language. Approaches like [9] introduce an adver- ing on fundamental frequency(F0 ∈ R1 ×F ) and energy(ξ ∈
sarial loss to curb this dependence of text representations R1 ×F ), where F is the number of mel frames, T is the text
1 Samples can be found at this link
length, and energy is the per-frame mel energy average. We
propose the following novel modifications:
2.1. Shared text token set To promote disentanglement between accent and speaker em-
Our goal is to train a single model with the ability to synthesize beddings, we aim to decorrelate the following variables: (1)
a target language with desired accent for any speaker in the random variables in accent embeddings; (2) random variables
dataset. We represent phonemes with the International Pho- in speaker embeddings; (3) random variables in speaker and
netic Alphabet (IPA) to enforce a shared textual representation. accent embeddings from each other. While truly decorrelat-
A shared alphabet across languages reduces the dependence ing the information is difficult, we can promote something
of text on speaker identity, especially in low-resource settings close by using the constraints from VICReg[16]. We denote
(e.g. 1 speaker per language) and supports code-switching. E A ∈ RDa ×Na , E S ∈ RDs ×Ns as the accent and speaker
embedding tables respectively. Column vector ej ∈ E denotes
2.2. Scalable Alignment Learning
the j’th embedding in either table. Let µE and Cov(E) be the
We utilize the alignment learning framework in[4, 15], to learn
means and covariance matrices. By using VICReg, we con-
speech-text alignments Λ ∈ RT ×F without external dependen-
strain standard deviations to be at least γ and suppress the off-
cies. A shared alphabet set simplifies this since alignments
diagonal elements of the covariance matrix (γ = 1,  = 1e−4):
are learnt on a single token set instead of distinct sets. How-  
1 X
q
ever, when the speech has a strong accent, the same token Lvar = max 0, γ − Cov(E)i,j +  (2)
can be spoken in different ways from speakers with different D i=j
accents and hence alignments can become brittle. To curb this X
Lcovar = Cov(E)2i,j (3)
multi-modality, we learn alignments between (text, accent) and
i6=j
mel-spectograms using accent A as a conditioning variable.
Next, we attempt to decorrelate accent and speaker vari-
2.3. Disentangling Factors ables from each other by minimizing the cross-correlation
We focus on non-parallel data with a speaker speaking 1 lan- matrix from batch statistics. Let E˜A and Ẽ S be the sampled
guage which typically has text Φ, accent A and speaker S en- column matrices of accent and speaker embedding vectors
tangled. We evaluate strategies to disentangle these attributes: sampled within a batch of size B. We compute the batch cross
Speaker-adversarial loss In TTS datasets, speakers typically -correlation matrix RAS as follows (µE A and µE S computed
read different text and have different prosody. Hence, there from embedding table):
can be entanglement between speaker S, text Φ and prosody. 1
Following[9], we employ domain adversarial training to dis- RAS = (Ẽ A − µE A )(Ẽ S − µE S )T (4)
B−1
entangle S and Φ by using a gradient reversal layer. We use a 1 X AS 2
speaker classification loss, and backpropagate classifier’s nega- Lxcorr = (Ri,j ) (5)
Da Ds i,j
tive gradients through the text encoder and token embeddings.
N
X 2.4. Accent conditioned speech synthesis
Ladv = P (si |φi ; θspkclassif ier ) (1)
i=1
We introduce an extra conditioning variable for accent A to
Data Augmentation Disentangling accent and speaker is chal- RADTTS [4] to allow for accent-controllable speech synthesis.
lenging, as a speaker typically has a specific way of pronounc- We call this model RADTTS-ML, a multilingual version of
ing words and phonemes, causing a strong association between RADTTS. The following equation describes the model:
speaker and accent. Straightforward approaches to learning Pradtts (X, Λ) = Pmel (X|Φ, Λ, A, S)Pdur (Λ|Φ, A, S) (6)
from non-parallel data learn entangled representations because
We refer to our conditioning as accent instead of language,
a speaker’s language and accent can be trivially learned from
because we consider language to be implicit in the phoneme
the dataset. Since our goal is to synthesize speech for a speaker
sequence. The information captured by the accent embed-
in a target language with desired accent, disentangling speaker
ding should explain the fine-grained differences between how
S and accent A is essential, otherwise either speaker identity
phonemes are pronounced in different languages.
is not preserved in the target language or the generated speech
retains the speaker’s accent from the source language. To over- 2.5. Fine-grained frame-level control of speech attributes
come this problem, we use data augmentations like formant, Fine-grained control of speech attributes like F0 and energy
F0 , and duration scaling to promote disentanglement between E can provide high-quality controllable speech synthesis[13].
speaker and accent. For a given speech sample xi with speaker We believe conditioning on such attributes can help improve
identity si and accent ai , we apply a fixed transformation accent and language transfer. During training, we condition
t ∈ {1, 2, ...τ } to construct a transformed speech sample xti our mel decoder on ground truth frame-level F0 and energy.
and assign speaker identity as si + t · Nspeakers and accent as Following [13], we train deterministic attribute predictors to
original accent ai , where τ is the number of augmentations. predict phoneme durations Λ, F0 , and energy E conditioned
This creates speech samples with variations in speaker identity on speaker S, encoded text Φ, and accent A. We standardize
and accent in order to orthogonalize these attributes. F0 using the speaker’s F0 mean and standard deviation to
Embedding Regularization Ideally, the information captured remove speaker-dependent information. This allows us to
by the speaker and accent embeddings should be uncorrelated. predict speech attributes for any speaker, accent, and language
and control mel synthesis with such features. We refer to this or slower(×[1.0 − 1.1]). We augment the dataset with trans-
model as RADMMM, which is described as: formed audio defining a new speaker identifier, but retaining
the original accent. In RT, this leads to a significant boost in
Pradmmm (X, Λ) = Pmel (X|Φ, Λ, A, S, F0 , E) speaker retention. We believe that creating more speakers per
PF0 (F0 |Φ, A, S)PE (E|Φ, A, S)Pdur (Λ|Φ, A, S) (7) accent enhances disentanglement of accent and speaker. How-
3. EXPERIMENTS ever, in RM, where F0 is predicted and the model is explicitly
conditioned on augmented F0 , we observe a significant drop
We conduct our experiments on an open source dataset2 with in both speaker retention as well as CER with augmentations,
a sampling rate of 16kHz. It contains 7 different languages likely due to conditioning on noisy augmented features.
(American English, Spanish, German, French, Hindi, Brazilian Embedding Regularization We conduct three ablations with
Portuguese, and South American Spanish). This dataset em- regularization: one that adds variance (Lvar ) and covariance
ulates low-resource scenarios with only 1 speaker per accent (Lcovar ) constraints to the baseline, and two more involving
with strong correlation between speaker, accent, and language. all three constraints (Lvar , Lcovar , Lxcorr ) with small (0.1)
We use HiFiGAN vocoders trained individually on selected and large weights (10.0). We observe an improvement in
speakers in the evaluation set. We focus on the task of trans- speaker similarity with the best speaker retention with all three
ferring the voice of 7 speakers in the dataset to the 6 other constraints in both RT and RM. Moreover, we observe similar
language and accent settings in the dataset. Herein we refer to CER to the baselines suggesting similar pronunciation quality.
RADTTS-ML as RT and RADMMM as RM for brevity. Our final models include regularization constraints, but we
don’t use augmentation and Ladv due to worse pronunciation
3.1. Ablation of Disentanglement Strategies quality and limited success on speaker timbre retention.
We evaluate the effects of disentanglement strategies on the
transfer task by measuring speaker timbre retention using 3.2. Comparing proposed models with existing methods
the cosine similarity (Cosine Sim) of synthesized samples We compare our final RT and RM with the Tacotron 2-based
to source speaker’s reference speaker embeddings obtained model described in [9], call it T2, on the transfer task. We
from the speaker recognition model Titanet[17]. We mea- reproduced the model to the best of our ability, noting that
sure character error rate (CER) with transcripts obtained from training on our data was unstable, possibly due to data qual-
Conformer[18] models trained for each language. Table 1 and ity, and that results may not be representative of the original
Figure 1 demonstrate overall and accent grouped effects of implementation. We tune denoising parameters[20] to reduce
various disentanglement strategies. The RT baseline uses the audio artifacts from T2 over-smooth generated mels[3, 21].
shared text token set, accent-conditioned alignment learning, We attempted to implement YourTTS[14] but ran into issues
and no additional constraints to disentangle speaker, text, and reproducing the results on our dataset and hence we don’t
accent. The RM baseline uses this setup with F0 and energy make a direct comparison to it.
conditioning. Speaker Adversarial Loss (Ladv ) We observe Speaker timbre retention Table 2 shows the speaker cosine
Table 1: Ablation results comparing disentanglement strate- similarity of our proposed models and T2. We observe that
gies using Cosine Sim and CER defined in 3.1 both RT and RM perform similarly in terms of speaker re-
RADTTS-ML RADMMM tention and achieve better speaker timbre retention than T2.
Disentanglement Strategy Cosine Sim CER Cosine Sim CER
Baseline (B) 0.3062 ± 0.0176 17.7 0.3438 ± 0.0138 5.1
However, our subjective human evaluation below shows that
(B) + normalized F0 pred N/A N/A 0.3946 ± 0.0143 5.3 RM samples are overall better than RT in both timbre preser-
(B) + Ladv 0.3027 ± 0.0174 39.9 N/A N/A
(B) + augmentation 0.3855 ± 0.0145 44.3 0.2174 ± 0.0131 41.7 vation and pronunciation.
(B) + Lvar and Lcovar 0.4029 ± 0.0144 13.7 N/A N/A
(B) + low weight on Lvar , Lcovar and Lxcorr 0.3784 ± 0.0112 17.0 0.3232 ± 0.0154 9.6 Table 2: Speaker timbre retention using Cosine Sim (Sec 3.1)
(B) + Lvar , Lcovar and Lxcorr 0.4217 ± 0.0156 12.2 0.4188 ± 0.0148 5.5
(B) + Lvar , Lcovar , Lxcorr and Ladv N/A N/A 0.4163 ± 0.0157 7.2 Model Cosine Similarity
that the addition of Ladv loss to RT and RM does not affect RADTTS-ML (RT) 0.4186 ± 0.0154
speaker retention when synthesizing the speaker for a target RADMMM (RM) 0.4197 ± 0.0149
language. However, we observe a drop in character error rate. Tacotron2 (T2) 0.145 ± 0.0119
We believe the gradients from the speaker classifier tend to
3.3. Subjective human evaluation
remove speaker and accent information from encoded text Φ,
which affects the encoded text representation leading to worse We conducted an internal study with native speakers to evaluate
pronunciation. accent quality and speaker timbre retention. Raters were pre-
Data Augmentation We use Pratt[19] to apply six augmenta- screened with a hearing test based on sinusoid counting. Since
tions: formant scaling down (×[0.875 − 1.0]) and up (×[1.0 − MOS is not suited for finer differences, we use comparative
1.25]), scaling F0 down (×[0.9 − 1.0]) and up (×[1.0 − 1.1]), mean opinion scores (CMOS) with a 5 point scale (-2 to 2) as
and scaling durations to make samples faster(×[0.9 − 1.0])) the evaluation metric. Given a reference sample and pairs of
synthesized samples from different models, the raters use the
2 Dataset source, metadata and filelists will be released with source code. 5 point scale to indicate which sample, if any, they believe is
RADTTS baseline RADTTS augmentation RADMMM L_var, L_covar, L_xcorr RADMMM with normalized log F0
RADTTS low weight L_var, L_covar, L_xcorr RADTTS L_var, L_covar RADMMM L_var, L_covar, L_xcorr, L_adv RADMMM augmentation L_var, L_covar
RADTTS L_adv RADTTS L_var, L_covar, L_xcorr RADMMM baseline RADMMM low weight L_var, L_covar, L_xcorr
0.6
0.5
cosine similarity

cosine similarity
0.4 0.4
0.3
0.2 0.2
0.1
0.0 Brazilian American German Spanish South American French Hindi 0.0 German Hindi French American Spanish South AmericanBrazilian
Portuguese English Spanish English Spanish Portuguese
102 102
log (CER)

log (CER)
101
101
Brazilian American German Spanish South American French Hindi German Hindi French American Spanish South American Brazilian
Portuguese English Spanish English Spanish Portuguese
(a) RADTTS-ML (RT) (b) RADMMM (RM)
Fig. 1: Comparing speaker cosine similarity and CER of considered disentanglement strategies for every accent.
American German French Hindi Brazilian doesn’t change speaker timbre in RM, thus showcasing the
English Portuguese
disentangled nature of accent and speaker. Finally, RM final
pairwise preference rating

is preferred over T2 in terms of speaker timbre retention on


1
transferring speaker’s voice to target language.
0 Effects of control with F0 and E: Comparing RM final with
RT final, we see that RM is preferred for most languages
(RT Final (RM Final (RM Final (RM Final (RM Final
vs vs vs vs vs except German, indicating that explicit conditioning on F0 and
RT Base) RM Base) T2) RM Accented) RT Final)
energy results in better pronunciation and accent. Moreover,
Fig. 2: CMOS per accent for model pairs under consideration.
as illustrated in Table 1, RM final achieves a better CER than
more similar, in terms of accent or speaker timbre, to the target RT final. Table 3 demonstrates that explicit conditioning on
language pronunciation or speaker timbre in reference audio. F0 and energy in RM results in much better speaker timbre
Accent evaluation: We conduct accent evaluation with native retention compared to RT. RM results in the best speaker
speakers of every language. Fig 2 shows the preference scores retention, accent quality and pronunciation among our models.
of native speakers with 95% confidence intervals in each lan- Table 3: CMOS for speaker timbre similarity.
guage for model pairs under consideration. Positive mean Model Pair CMOS
scores imply that the top model was preferred over the bottom
RT Final vs RT Base 0.300 ± 0.200
model within the pair. Given limited access to native speakers,
RM Final vs RM Base 0.750 ± 0.189
we show results for only 5 languages. We observe that there RM Final vs RT Final 0.733 ± 0.184
is no strong preference between RT final and its baseline in RM Final vs RM Accented 0.025 ± 0.199
terms of accent quality. We find similar results for RM final RM Final vs T2 1.283 ± 0.144
and its baseline, suggesting that accent and pronunciation are
not compromised by our suggested disentanglement strategies. 4. CONCLUSION
To evaluate controllable accent, we synthesize samples from We present a multilingual, multiaccented and multispeaker
our best model (RM final) for every speaker in languages other TTS model based on RADTTS with novel modifications. We
than the speaker’s native language. Samples using non-native propose and explore several disentanglement strategies result-
language and accent are referred to as RM final, and samples ing in a model that improves speaker, accent and text disen-
with the new language but native accent are referred to as RM tanglement, allowing for synthesis of a speaker with closer
accented. Raters preferred samples using the target accent to native fluency in a desired language without multilingual
(RM final) over the source speaker’s accent (RM accented), speakers. Internal ablation studies indicate that explicitly con-
indicating the effectiveness of accent transfer. Finally, RM ditioning on fine-grained features (F0 and E) results in bet-
final is preferred over T2 in terms of accent pronunciation. ter speaker retention and pronunciation according to human
Speaker timbre evaluation: Table 3 shows CMOS scores evaluators. Our model provides an ability to predict such fine-
with 95% confidence intervals. First, we observe that in both grained features for any desired combination of speaker, accent
RT and RM, the final models with disentanglement strategies and language and user studies show that under limited data con-
applied are preferred over baseline models in terms of speaker straints, it improves pronunciation in novel languages. Scaling
timbre retention. RM accented synthesis (RM accented) is the model to large-resource conditions with more speakers per
rated as having similar speaker timbre as native accent syn- accent remains the subject of future work.
thesis with RM (RM final), indicating that changing accent
5. REFERENCES [10] Bo Li and Heiga Zen, “Multi-language multi-speaker
acoustic modeling for lstm-rnn based statistical paramet-
[1] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, ric speech synthesis,” 09 2016.
Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng
Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. [11] Adrian Łańcucki, “Fastpitch: Parallel text-to-speech
Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. with pitch prediction,” 2020.
Saurous, “Tacotron: A fully end-to-end text-to-speech [12] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
synthesis model,” CoRR, vol. abs/1703.10135, 2017. Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and
[2] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike high-quality end-to-end text to speech,” arXiv preprint
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, arXiv:2006.04558, 2020.
Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. [13] Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe
Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, Santos, and Bryan Catanzaro, “Generative modeling
“Natural TTS synthesis by conditioning wavenet on mel for low dimensional speech attributes with neural spline
spectrogram predictions,” CoRR, vol. abs/1712.05884, flows,” 2022.
2017.
[14] Edresson Casanova, Julian Weber, Christopher Shulby,
[3] Rafael Valle, Kevin J Shih, Ryan Prenger, and Bryan
Arnaldo Cândido Júnior, Eren Gölge, and Moacir An-
Catanzaro, “Flowtron: an autoregressive flow-based gen-
tonelli Ponti, “Yourtts: Towards zero-shot multi-speaker
erative network for text-to-speech synthesis,” in Interna-
TTS and zero-shot voice conversion for everyone,” CoRR,
tional Conference on Learning Representations, 2020.
vol. abs/2112.02418, 2021.
[4] Kevin J. Shih, Rafael Valle, Rohan Badlani, Adrian Lan-
[15] Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael
cucki, Wei Ping, and Bryan Catanzaro, “RAD-TTS:
Valle, Wei Ping, and Bryan Catanzaro, “One tts align-
Parallel flow-based TTS with robust alignment learning
ment to rule them all,” in ICASSP 2022 - 2022 IEEE
and diverse synthesis,” in ICML Workshop on Invert-
International Conference on Acoustics, Speech and Sig-
ible Neural Networks, Normalizing Flows, and Explicit
nal Processing (ICASSP), 2022.
Likelihood Models, 2021.
[16] Adrien Bardes, Jean Ponce, and Yann LeCun, “VICReg:
[5] Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang,
Variance-invariance-covariance regularization for self-
Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei
supervised learning,” in International Conference on
He, Frank Soong, Tao Qin, Sheng Zhao, and Tie-Yan Liu,
Learning Representations, 2022.
“Naturalspeech: End-to-end text to speech synthesis with
human-level quality,” 2022. [17] Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg,
“Titanet: Neural model for speaker representation with
[6] Jaehyeon Kim, Jungil Kong, and Juhee Son, “Conditional
1d depth-wise separable convolutions and global context,”
variational autoencoder with adversarial learning for end-
2021.
to-end text-to-speech,” in International Conference on
Machine Learning. PMLR, 2021. [18] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par-
[7] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng-
Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming dong Zhang, Yonghui Wu, and Ruoming Pang, “Con-
Pang, Ignacio Lopez-Moreno, and Yonghui Wu, “Trans- former: Convolution-augmented transformer for speech
fer learning from speaker verification to multispeaker recognition,” 2020.
text-to-speech synthesis,” CoRR, vol. abs/1806.04558, [19] Paul Boersma and David Weenink, “Praat: doing phonet-
2018. ics by computer,” in Praat: doing phonetics by computer,
[8] Eliya Nachmani and Lior Wolf, “Unsupervised polyglot 2003.
text-to-speech,” ICASSP 2019 - 2019 IEEE International [20] Ryan Prenger, Rafael Valle, and Bryan Catanzaro,
Conference on Acoustics, Speech and Signal Processing “Waveglow: A flow-based generative network for speech
(ICASSP), 2019. synthesis,” in ICASSP 2019-2019 IEEE International
[9] Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Conference on Acoustics, Speech and Signal Processing
Zhifeng Chen, R. J. Skerry-Ryan, Ye Jia, Andrew Rosen- (ICASSP). IEEE, 2019.
berg, and Bhuvana Ramabhadran, “Learning to speak [21] Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu,
fluently in a foreign language: Multilingual speech syn- “Revisiting over-smoothness in text to speech,” arXiv
thesis and cross-language voice cloning,” CoRR, vol. preprint arXiv:2202.13066, 2022.
abs/1907.04448, 2019.

You might also like