Deep Learning-Based Analysis of A Real-Time Voice Cloning System

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Deep Learning-Based Analysis of a Real-Time

Voice Cloning System
Ndjoli Isaaka Luc Emmanuel Kusha K. R
School of Computer Science and Applications Prof., School of Computer Science and Applications
Reva University Reva University
Bangalore, India Bangalore, India
Abstract:- Voice technology has emerged as a hotspot for To contribute to this field, we have reproduced the
deep learning research due to fast advancements in framework proposed by Jia et al. and made our
computer technology. The goal of human-computer implementation. We have also integrated a real-time model
interaction should be to give computers the ability to feel, based on the work of Kalchbrenner et al. into the framework.
see, hear, and speak. Voice is the most favorable In this paper, we provide a comprehensive overview of AI
approach for future interactions between humans and based TTS techniques, with a specific focus on the
computers because it offers more benefits than any other development and current state of TTS systems, emphasizing
method. One example of voice technology that is capable speech naturalness as the primary evaluation metric. We then
of imitating a particular person's voice is voice cloning. present the findings of Jia et al. (2018) and discuss our own
Real-time voice cloning with only a few samples is implementation. Finally, we conclude by presenting a
proposed as a solution to the issue of having to provide a toolbox that we have developed to interface with the
large amount of samples and having to endure a long time framework.
in the past for voice cloning. This strategy deviates from
the conventional model. A. Research Background
Our method for cloning voices in real time relies
For independent training, different databases and primarily on (Jia et al.,2018) (SV2TTS is referred to in this
models are used but for joint modeling, only three models paper). A framework for zero-shot voice cloning that only
are used. The vocoder makes use of a novel type of demands five seconds of reference speech is developed. This
LPCNET that works well on certain samples and low- paper is just one of many that Google has published in the
performance devices. Tacotron series. Interestingly, the SV2TTS paper is based on
three major Google earlier works and does not introduce
Keywords:- Real-Time, Samples, Voice Cloning. much new information. the loss of GE2E (Wang et al., 2017,
van den Oord et al., 2017), and WaveNet2016. The entire
I. INTRODUCTION architecture is a three-stage pipeline whose steps correspond
to the order in which the models were listed earlier. Many of
In recent years, deep learning models have proved to be Google's current TTS tools and features, such as Google
as a dominant force in a variety of applied machine learning Assistant and Google Cloud services, employ these same
in recent times. One such domain is text-to-speech (TTS), models. As of May 2019, we do not know of any open-source
which involves generating artificial speech from textual reimplementation of these models that include the SV2TTS
inputs. Achieving natural-sounding voice synthesis, with framework.
accurate pronunciation, lively intonation, and minimal
background noise, necessitates training data of high quality. B. System Overview
However, the amount of reference speech required to clone a This study breaks down the system into three parts:
voice varies significantly between methods, ranging from encoder, synthesizer, and vocoder as depicted in Figure 1.
half an hour per speaker to just a few seconds.
Fig. 1. Model overview. During inference, The SV2TTS framework employs a broad view of the Tacotron's design that has been
altered to support voice conditioning, as indicated by the blue blocks in the figure cited from (Jia et al., 2018).
IJISRT23JUL1648 www.ijisrt.com 2323

ISSN No:-2456-2165
C. Network structure  Generalized End-to-End loss
A speaker verification task is used to train the speaker
 Speaker Encoder encoder. One common use of biometrics is speaker
High-quality audio input is not required for this study. verification, in which a person's voice is used to verify their
So a vast corpus of several distinct speakers is used to train identity. A layout is made for an individual by getting their
the encoder. Strong anti-noise capability and ease of capture speaker inserted from a couple of expressions. The term for
of human acoustic characteristics are two advantages of this this procedure is enrolment. A user identifies himself with a
choice. Additionally, the encoder has undergone GE2E loss brief utterance at runtime, and the system compares that
training, which is utilized for tasks related to speaker utterance's embedding to the embeddings of the enrolled
verification. The method for embedding output is specified in speakers. The user is identified when their similarity reaches
the task. a certain threshold. In order to improve the model, the GE2E
loss simulates this process. During training, the model
 Model architecture calculates the embeddings 𝑒𝑖𝑗 (1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀) of M
The model is a three-layer LSTM with a projection layer utterances of fixed duration from N speakers. A speaker
of 256 units and 768 hidden nodes. Although there is no 1
mention of a projection layer in any of the papers, our embedding 𝑐𝑖 is derived for each speaker: 𝑐𝑖 = 𝑀 ∑𝑀 𝑗=1 𝑒𝑖𝑗 .
understanding is that it is merely a fully connected layer with The similarity matrix 𝑆𝑖𝑗,𝑘 is the result of the two-by-two
256 outputs per LSTM that is applied to each LSTM output comparison of all embeddings 𝑒𝑖𝑗 against every speaker
repeatedly. For the purposes of simplicity, rapid prototyping, embedding 𝑐𝑘 (1 ≤ 𝑘 ≤ 𝑁) in the batch. This measure is
and a lower training load, we used 256-unit LSTM layers the scaled cosine similarity:
directly when we first implemented the speaker encoder. This
last part is important because, despite using a larger dataset, 𝑆𝑖𝑗,𝑘 = 𝑤 · cos(𝑒𝑖𝑗 , 𝑐𝑘 ) + 𝑏 = 𝑤 · 𝑒𝑖𝑗 · ||𝑐𝑘 ||2 +
the authors' model was trained using 50 million steps, which 𝑏(1)
is technically difficult for us to replicate. We have not had the
time to train the larger model because this smaller model where w and b are parameters that can be learned.
performed exceptionally well. The model receives 40 log-mel Figure 2 depicts the full procedure. Since two L2-normed
spectrograms with a 10ms step and a 25ms window width. vectors' cosine similarity is essentially their dot product from
The output is a 256-element vector that represents the last a computing perspective, the rightmost side of equation 1.
layer's L2-normalized hidden state. Prior to normalization, our When an utterance matches the speaker, an ideal model
approach additionally includes a ReLU layer with the should produce high similarity values ( 𝑖 = 𝑘 ) and lower
intention of making embeddings sparse and hence simpler to values of similarity elsewhere ( 𝑖 ≠ 𝑘 ).The loss in this
comprehend. direction is the sum of row-wise softmax losses.
Fig. 2. The process of creating the similarity matrix during training. The data for this graph was taken from (Wan et al., 2017).
In a training batch, the utterances have a fixed duration The utterance embedding is created by averaging and
of 1.6 seconds. These are samples of partial utterances taken normalizing the resulting outputs. This is delineated in Figure
from the dataset's longer complete utterances. Although the 3. Curiously, the SV2TTS authors recommend for training
model architecture is capable of handling inputs of varying windows of 1.6 seconds but 800 milliseconds at inference
lengths, it is reasonable to anticipate that it will perform at its time. Like in GE2E, we would like to keep 1.6 seconds for
best with utterances that are the same length as those that are both.
observed during training. As a result, an utterance is divided
into 50 percent overlapped segments of 1.6 seconds at
inference time, and the encoder forwards each segment
separately.

ISSN No:-2456-2165
 Experiments
We use the webrtcvad python package for Voice
Activity Detection (VAD) to avoid prevent predominantly
quiet segments while sampling partial utterances from
complete utterances. This produces a binary flag over the
audio that indicates whether or not the segment is spoken.
Short spikes in the detection are smoothed out by applying a
moving average to this binary flag, which is then binarized
once more. Last but not least, we dilate the flag with s + 1
kernel size, where s is the longest amount of silence that can
be tolerated. The unvoiced parts are then cut out of the audio.
We chose the value s = 0.2s because it kept the natural
prosody of speech. Figure 4 demonstrates this procedure.
Fig. 3. Calculating the embedding of an entire utterance. The Normalization is the final pre-processing step that is perform
d-vectors are essentially the model's unnormalized outputs. to the audio waveforms to compensate for the dataset's
This illustration is based on (Wan et al., 2017). varying speakers' volume.
Fig. 4. The steps to silence removal using VAD, from top to bottom. The orange line is the two-voice banner, with the top value
implying that the segment is voiced and the lower value implying that it is unvoiced.
 Synthesizer from a 50-millisecond window. This module makes use of

The synthesizer utilizes an improved Tacotron 2. the thchs30 data set. Tsinghua University has created a 30-
Tacotron 2 make s use of a modified network to replace hour data set called Thchs30. The data set was picked
Wavenet. To make a single encoder frame longer, each because it has pinyin, which makes processing it easier. Each
character in the text sequence is embedded as a vector before transcript must be processed in one step to obtain alignment
being convolved. In the meantime, input the appropriate before it can be processed into libspeech format, which takes
phoneme sequence to speed up convergence and enhance a lot of time.
pronunciation. Encoder output frames are produced
bidirectional LSTM to transfer these encoder frames. In order  Vocoder
to generate the decoder input frame, the encoder output frame To transform the time-domain waveforms produced by
is the focus of the attention mechanism. The model is the synthesis network into synthesized Mel spectrograms, we
autoregressive because every single decoder input frame is employ the sample-by-sample autoregressive WaveNet as a
linked to the output of the preceding decoder frame. vocoder. The architecture, which consists of 30 dilated
convolution layers, is identical to that described in. The
This cascading vector traverses two LSTM layers speaker encoder's output has no direct influence on the
before being projected onto a single MEL spectrum frame. network. The synthesizer network's Mel spectrogram predicts
with only one direction. Frame generation is halted when all of the relevant information for high-quality voice
another projection prediction network of the same vector synthesis, making it possible to build a multispeaker vocoder
emits a value above a predetermined threshold. Before by training on data from multiple speakers.
becoming a MEL spectrum, the entire sequence of frames
goes through a residual network. Figure 5 depicts this The model is conditioned using any speech audio that
architecture. The synthesizer's target MEL spectrogram has has not been transcribed and does not need to match the text
more acoustic properties than the speaker encoder. They are in order to be synthesized during inference. It is possible to
fed into an 80-dimensional MFCC in 12.5-millisecond steps train it on audio from speakers outside the training set

ISSN No:-2456-2165
because the characteristics of the speakers that are used in be seen in the mid-frequency peaks that are present during
synthesis are inferred from the audio. In practice, we discover vowel sounds like the "i" at 0.3 seconds. The top male F2 is
that zero-shot adaptation to novel speakers can be achieved in mel channel 35, whereas the middle speaker's F2 appears
by synthesizing new speech with the corresponding speaker to be closer to channel 40. Sibilant sounds exhibit similar
characteristics using a single, brief audio clip. We assess how variations; for instance, in the male voice, the "s" at 0.4
well this procedure applies to previously unseen speakers in seconds includes more energy in lower frequencies than in
Section 3. Figure 5, which depicts spectrograms created from the female voice. The speaker embedding also partially
a variety of 5 second speaker reference utterances, provides captures the characteristic speaking rate, as indicated by the
an illustration of the inference process. The spectrogram of lower row's greater signal duration compared to the upper
the synthetic male speaker exhibits formants and a two rows. The reference utterance spectrograms are shown in
substantially lower fundamental frequency, as shown by the the right column. can be the subject of similar observations.
denser harmonic spacing at the low frequencies, which can
Fig. 5. As an example, use the suggested approach to synthesize a statement in several voices. Mel spectrograms for the speaker
embedding reference utterances (left) and the synthesizer outputs (right) are displayed. The text alignment to the spectrogram is
highlighted in red. Three speakers are kept outside the train sets: one male (top) and two female (focus and base).
II. EXPERIMENTAL RESULTS AND ANALYSIS
One important aspect of voice clone research is testing

the method and evaluating how well it works. In order to
boost the voice clone's performance, it is critical to devise a
reliable and effective evaluation strategy. Nowadays,
primarily objective and subjective methods are used to test
and evaluate the performance of the voice clone method.
A. A thorough and impartial evaluation

In terms of MFCC and spectrum, the test-generated cloned
speech was compared to the original voice:
As an example, the content of STCMD00044A is: "the man
asked me if I would like to" for male
Fig. 6. Male proto-speech and cloned voice MFCC

ISSN No:-2456-2165
B. Subjective evaluation and analysis
In the subjective evaluation, people's subjective feelings
are used to test pronunciation. The cloning effect is typically
evaluated using the subjective Mean opinion score (MOS)
method, which takes into account both the quality of the
speech and the similarity of the characteristics of the
speakers. MOS check: the primary standard of the MOS test
is to request the commentator to score the emotional
sentiments from the test discourse as indicated by five
grades, which can be utilized for the abstract assessment of
the discourse quality and the similitude of the speaker's
attributes. The MOS score is the sum of all test statements
Fig. 7. Male speech spectrum alignment and comparison and reviewers' averages. In general, it can be broken down
into five levels, with 1 point representing the least
As an example, the content of STCMD00052I is: understandable and 5 points representing the most natural.
"prepare the stock gap in advance" for women. We recruited ten Internet users to provide subjective
evaluations.
TABLE 1 SCORES FROM THE FEMALE VOICE MOS

TEST
No. 1 2 3 4 5 6 7 8 9 10
Score 3 3 5 4 3 5 4 4 5 3
TABLE 2 SCORES FROM THE MALE VOICE MOS TEST

No. 1 2 3 4 5 6 7 8 9 10
Score 4 3 3 3 4 3 4 5 4 3
The aforementioned data demonstrate that the impact of

both the cloning of male and female voices still differ. This is
due to a number of factors: 1. The female voice is typically
sharper and has greater penetration, making it easier for
computers to extract data from it. 2. Male voice cloning does
not have the same impact as female voice cloning because
Fig. 8. Female proto-speech and cloned voice MFCC there is insufficient male voice data in the study's training
database.
III. CONCLUSION
In conclusion, we have developed a real-time voice

cloning system that, at present, has not been made publicly
accessible. While we are satisfied with the overall outcomes
of our framework, we acknowledge the presence of some
artificial prosody. Comparatively, approaches utilizing more
reference speech time have demonstrated superior voice
cloning capabilities. Moving forward, we intend to enhance
our framework by incorporating recent advancements in the
field that extend beyond the scope of this research.
Looking ahead, we predict the emergence of even more

powerful voice cloning techniques in the near future. Our
ongoing efforts will focus on refining our framework, fixing
Fig. 9. Female speech spectrum alignment and comparison the limitations identified, and exploring potential avenues for
improvement. By remaining up to date on the newest
The cloned speech and the original speech are nearly advances, we aim to contribute to the continuous progress and
identical in the middle and back parts, as shown in the image availability of advanced voice cloning technologies.
above; however, the beginning part is false. In the future
improvement, this aspect can be improved further. Because ACKNOWLEDGMENT
they were trained with fewer male voice data, female voice
clones performed better than male ones. High spectral I would like to express my sincere gratitude to my guide,
similarity and speech-to-text alignment. Prof Kusha K R, for his constant encouragement, insightful
suggestions, and critical evaluation of my work. I am also

ISSN No:-2456-2165
grateful to my lecturer, Ass. Prof. Shreetha Bhat, for her [11]. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez
valuable guidance, unwavering support, and constructive Moreno. Generalized end-to-end loss for speaker
feedback. verification. In Proc. IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP),
Additionally, I would like to extend my gratitude to my 2018.
mentor, Dr. Hemanth K. S, for his invaluable support, [12]. Artificial Intelligence at Google – Our Principles.
encouragement, and expert advice. His mentorship, insights, https://ai.google/principles/, 2018.
and guidance were crucial in shaping my research and [13]. Christophe Veaux, Junichi Yamagishi, Kirsten
inspiring me to push the boundaries of my knowledge. MacDonald, et al. CSTR VCTK Corpus: English multi-
speaker corpus for CSTR voice cloning toolkit, 2017.
Finally, I would be negligent if I did not mention my [14]. Jose Sotelo, Soroush Mehri, Kundan Kumar, João
family, particularly my parents and siblings. Their confidence Felipe Santos, Kyle Kastner, Aaron Courville, and
in me has maintained motivation and spirits strong throughout Yoshua Bengio. Char2Wav: End-to-end speech
the process. synthesis. In Proc. International Conference on
Learning Representations (ICLR), 2017.
REFERENCES
[1]. Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan

Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio
Lopez Moreno, Yonghui Wu, et al. 2018. Transfer
learning from speaker veriﬁcation to multispeaker text-
to-speech synthesis. Advances in neural information
processing systems, 31.
[2]. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi,
and T. Kitamura, “Speech parameter generation
algorithms for HMM-based speech synthesis,” in Proc.
Int. Conf. Acoust., Speech, Signal Process., 2000, pp.
1315–1318.
[3]. Aaron van den Oord, Sander Dieleman, Heiga Zen, et
al. Wavenet: A Generative Model for Raw
Audio[J/OL]. arXiv Preprint arXiv:1609.03499, 2016.
[4]. Master thesis: Automatic Multispeaker Voice Cloning.
Corentin Jemine. 2019, pp. 10-29.
[5]. Sercan O Arik, Jitong Chen, Kainan Peng, Wei Ping,
and Yanqi Zhou. Neural voice cloning with a few
samples. arXiv preprint arXiv:1802.06006, 2018.
[6]. Joon Son Chung, Arsha Nagrani, and Andrew
Zisserman. VoxCeleb2: Deep speaker recognition. In
Interspeech, pages 1086–1090, 2018.
[7]. Yutian Chen, Yannis Assael, Brendan Shillingford,
David Budden, Scott Reed, Heiga Zen, Quan Wang,
Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample
efficient adaptive text-to-speech. arXiv preprint
arXiv:1809.10460, 2018.
[8]. Rama Doddipatla, Norbert Braunschweiler, and
Ranniery Maia. Speaker adaptation in dnnbased speech
synthesis using d-vectors. In Proc. Interspeech, pages
3404–3408, 2017.
[9]. Arsha Nagrani, Joon Son Chung, and Andrew
Zisserman. VoxCeleb: A large-scale speaker
identification dataset. arXiv preprint arXiv:1706.08612,
2017.
[10]. Vassil Panayotov, Guoguo Chen, Daniel Povey, and
Sanjeev Khudanpur. LibriSpeech: an ASR corpus based
on public domain audiobooks. In Acoustics, Speech and
Signal Processing (ICASSP), 2015 IEEE International
Conference on, pages 5206–5210. IEEE, 2015.

Deep Learning-Based Analysis of A Real-Time Voice Cloning System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning-Based Analysis of A Real-Time Voice Cloning System

Uploaded by

Copyright:

Available Formats

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology

Deep Learning-Based Analysis of a Real-Time

IJISRT23JUL1648 www.ijisrt.com 2323

IJISRT23JUL1648 www.ijisrt.com 2324

 Synthesizer from a 50-millisecond window. This module makes use of

IJISRT23JUL1648 www.ijisrt.com 2325

II. EXPERIMENTAL RESULTS AND ANALYSIS

One important aspect of voice clone research is testing

A. A thorough and impartial evaluation

IJISRT23JUL1648 www.ijisrt.com 2326

TABLE 1 SCORES FROM THE FEMALE VOICE MOS

TABLE 2 SCORES FROM THE MALE VOICE MOS TEST

The aforementioned data demonstrate that the impact of

In conclusion, we have developed a real-time voice

Looking ahead, we predict the emergence of even more

IJISRT23JUL1648 www.ijisrt.com 2327

[1]. Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan

IJISRT23JUL1648 www.ijisrt.com 2328

You might also like