Professional Documents
Culture Documents
ISSN No:-2456-2165
Abstract:- Voice technology has emerged as a hotspot for To contribute to this field, we have reproduced the
deep learning research due to fast advancements in framework proposed by Jia et al. and made our
computer technology. The goal of human-computer implementation. We have also integrated a real-time model
interaction should be to give computers the ability to feel, based on the work of Kalchbrenner et al. into the framework.
see, hear, and speak. Voice is the most favorable In this paper, we provide a comprehensive overview of AI
approach for future interactions between humans and based TTS techniques, with a specific focus on the
computers because it offers more benefits than any other development and current state of TTS systems, emphasizing
method. One example of voice technology that is capable speech naturalness as the primary evaluation metric. We then
of imitating a particular person's voice is voice cloning. present the findings of Jia et al. (2018) and discuss our own
Real-time voice cloning with only a few samples is implementation. Finally, we conclude by presenting a
proposed as a solution to the issue of having to provide a toolbox that we have developed to interface with the
large amount of samples and having to endure a long time framework.
in the past for voice cloning. This strategy deviates from
the conventional model. A. Research Background
Our method for cloning voices in real time relies
For independent training, different databases and primarily on (Jia et al.,2018) (SV2TTS is referred to in this
models are used but for joint modeling, only three models paper). A framework for zero-shot voice cloning that only
are used. The vocoder makes use of a novel type of demands five seconds of reference speech is developed. This
LPCNET that works well on certain samples and low- paper is just one of many that Google has published in the
performance devices. Tacotron series. Interestingly, the SV2TTS paper is based on
three major Google earlier works and does not introduce
Keywords:- Real-Time, Samples, Voice Cloning. much new information. the loss of GE2E (Wang et al., 2017,
van den Oord et al., 2017), and WaveNet2016. The entire
I. INTRODUCTION architecture is a three-stage pipeline whose steps correspond
to the order in which the models were listed earlier. Many of
In recent years, deep learning models have proved to be Google's current TTS tools and features, such as Google
as a dominant force in a variety of applied machine learning Assistant and Google Cloud services, employ these same
in recent times. One such domain is text-to-speech (TTS), models. As of May 2019, we do not know of any open-source
which involves generating artificial speech from textual reimplementation of these models that include the SV2TTS
inputs. Achieving natural-sounding voice synthesis, with framework.
accurate pronunciation, lively intonation, and minimal
background noise, necessitates training data of high quality. B. System Overview
However, the amount of reference speech required to clone a This study breaks down the system into three parts:
voice varies significantly between methods, ranging from encoder, synthesizer, and vocoder as depicted in Figure 1.
half an hour per speaker to just a few seconds.
Fig. 1. Model overview. During inference, The SV2TTS framework employs a broad view of the Tacotron's design that has been
altered to support voice conditioning, as indicated by the blue blocks in the figure cited from (Jia et al., 2018).
Fig. 2. The process of creating the similarity matrix during training. The data for this graph was taken from (Wan et al., 2017).
In a training batch, the utterances have a fixed duration The utterance embedding is created by averaging and
of 1.6 seconds. These are samples of partial utterances taken normalizing the resulting outputs. This is delineated in Figure
from the dataset's longer complete utterances. Although the 3. Curiously, the SV2TTS authors recommend for training
model architecture is capable of handling inputs of varying windows of 1.6 seconds but 800 milliseconds at inference
lengths, it is reasonable to anticipate that it will perform at its time. Like in GE2E, we would like to keep 1.6 seconds for
best with utterances that are the same length as those that are both.
observed during training. As a result, an utterance is divided
into 50 percent overlapped segments of 1.6 seconds at
inference time, and the encoder forwards each segment
separately.
Fig. 4. The steps to silence removal using VAD, from top to bottom. The orange line is the two-voice banner, with the top value
implying that the segment is voiced and the lower value implying that it is unvoiced.
Fig. 5. As an example, use the suggested approach to synthesize a statement in several voices. Mel spectrograms for the speaker
embedding reference utterances (left) and the synthesizer outputs (right) are displayed. The text alignment to the spectrogram is
highlighted in red. Three speakers are kept outside the train sets: one male (top) and two female (focus and base).
III. CONCLUSION