0% found this document useful (0 votes)

26 views45 pages

Mid Defence Clone

Uploaded by

ANIKET BUDHA MAGAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Mid Defence Clone

Uploaded by

ANIKET BUDHA MAGAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

KANTIPUR ENGINEERING COLLEGE

(Affiliated to Tribhuvan University)

Dhapakhel, Lalitpur

[Subject Code: 707]

A MAJOR PROJECT MID TERM REPORT ON

CLONE-THE-TONE: FASTSPEECH 2 AND

ECAPA-TDNN BASED VOICE CLONING SYSTEM
IN NEPALI

Submitted by:
SANDIP SHARMA [KAN078BCT076]
PRAJESH BILASH PANTA [KAN078BCT056]
SUJAN DEULA [KAN078BCT089]
RAJ KARAN SAH [KAN078BCT060]

A MAJOR PROJECT SUBMITTED IN PARTIAL

FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE
OF BACHELOR IN COMPUTER ENGINEERING

Submitted to:
Department of Computer and Electronics Engineering

July, 2025
CLONE-THE-TONE: FASTSPEECH 2 AND
ECAPA-TDNN BASED VOICE CLONING SYSTEM
IN NEPALI

Submitted by:
SANDIP SHARMA [KAN078BCT076]
PRAJESH BILASH PANTA [KAN078BCT056]
SUJAN DEULA [KAN078BCT089]
RAJ KARAN SAH [KAN078BCT060]

A MAJOR PROJECT SUBMITTED IN PARTIAL

FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE
OF BACHELOR IN COMPUTER ENGINEERING

Submitted to:
Department of Computer and Electronics Engineering
Kantipur Engineering College
Dhapakhel, Lalitpur

July, 2025
ABSTRACT

This work focuses on the task of few-shot multi-speaker, multi-style voice cloning,
where the goal is to generate speech that closely mimics both the voice and speak-
ing style of a target speaker using only a few reference samples. This is particularly
challenging because the model must generalize well from very limited data. To tackle
this, the study explores different ways to represent speaker identity known as speaker
embeddings and proposes a method that combines two types: pre-trained embed-
dings (learned from another task, such as voice conversion) and learnable embeddings
(learned directly during model training). Among the different types of embeddings
tested, those pre-trained using voice conversion techniques proved to be the most effec-
tive in capturing speaker characteristics. The researchers then integrate these embed-
dings into the FastSpeech 2 model, a fast and high-quality text-to-speech model. The
combination of pre-trained and learnable embeddings significantly improves the models
ability to generalize to new, unseen speakers even with only one or a few examples. As
a result, this approach achieved second place in the one-shot track of the ICASSP 2021
M2VoC Challenge, demonstrating its strong performance in few-shot voice cloning.

Keywords: Voice Cloning, FastSpeech 2, Few-shot Learning, Speaker Embedding

i
TABLE OF CONTENTS

Abstract i
List of Figures iv
List of Tables v
List of Abbreviations vi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Application and Scope of project . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Project Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.2 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . 5
1.6.4 Schedule Feasibility . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 8
2.1 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Theoretical Foundation 12
3.1 Text-to-Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 FastSpeech 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Variance Adaptor . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Duration/pitch/energy Predictor . . . . . . . . . . . . . . . . . 15
3.1.4 ECAPA-TDNN . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.5 Speaker Embedding . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.6 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methodology 20
4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ii
4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Audio Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Software Development Life Cycle . . . . . . . . . . . . . . . . . . . . 34
5 EPILOGUE 35
5.1 Work Completed and Work Remaining . . . . . . . . . . . . . . . . . . 35
5.1.1 Work Completed . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Work Remaining . . . . . . . . . . . . . . . . . . . . . . . . . 36
References 37

iii
LIST OF FIGURES

1.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 FastSpeech 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Variance Adaptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Duration/pitch/energy predictor . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Architecture of HiFi GAN . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Overall Architecture of Voice Cloning System . . . . . . . . . . . . . . 20
4.2 Steps invloved in Audio Preprocessing . . . . . . . . . . . . . . . . . . 23
4.3 Steps involved in Text Preprocessing . . . . . . . . . . . . . . . . . . . 32
4.4 Use Case Diagram of Voice Cloning System . . . . . . . . . . . . . . . 33
4.5 Incremental Model of Voice cloning system . . . . . . . . . . . . . . . 34
5.1 Generated log-mel spectrograms . . . . . . . . . . . . . . . . . . . . . 35

iv
LIST OF TABLES

1.1 Development Requirements . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Deployment Requirements . . . . . . . . . . . . . . . . . . . . . . . . 7

v
LIST OF ABBREVIATIONS

GTX Giga Texel Shader eXtreme

HiFi-GAN High-Fidelity Generative Adversarial Network

ICASSP International Conference on Acoustics Speech and Signal Processing

LN Layer Normalization

M2VoC Multi-Speaker Multi-Style Voice Cloning

Mc1GAN Multi-condition One-shot Generative Adversarial Network

ReLU Rectified Linear Unit

TDNN Time Delay Neural Network

TTS Text-to-Speech

VAE Variational Autoencoder

VITS Variational Inference Text-to-Speech

vi
CHAPTER 1
INTRODUCTION

1.1 Background

Neural network-based text to speech TTS has made rapid progress and attracted a lot
of attention in the machine learning and speech community in recent years. TTS has
been proven to be capable of generating high-quality and human-like speech [1]. Non-
autoregressive text to speech (TTS) models such as FastSpeech (Ren et al.,2019) can
synthesize speech significantly faster than previous autoregressive models with compa-
rable quality. Using the concept of FastSpeech and Speaker Embedding to programmed
the project of voice cloning. Voice cloning is a rapidly emerging field in artificial intel-
ligence that focuses on the replication of a human voice using computational models.
It enables the creation of synthetic speech that closely mimics the tone, pitch, accent,
speaking style, and other vocal characteristics of a specific individual. By analyzing
a relatively small amount of recorded speech data, voice cloning systems can generate
new speech that sounds as if it were spoken by the original speaker, even if the words
or sentences were never actually uttered by them. This has opened the door to a wide
range of innovative applications, including personalized virtual assistants, dubbing for
film and media, accessibility tools for individuals with speech impairments, and con-
versational AI agents.

Modern voice cloning techniques rely on advancements in deep learning, particularly

in areas like text-to-speech (TTS) synthesis and speaker embedding. Most state-of-the-
art systems use a combination of models: a speaker encoder to extract a unique voice
signature (embedding) from the target speaker’s audio, and a neural TTS model (such as
Tacotron, FastSpeech, or Variational Inference Text-to-Speech that uses this embedding
to generate natural-sounding speech from text. These systems are capable of cloning
voices with as little as a few seconds to a few minutes of reference audio, depending on
the quality of the models and the training data.

The voice cloning process can be broadly classified into two types: speaker-adaptive

1
voice cloning, which fine-tunes a base model using data from the target speaker, and
speaker-agnostic or zero-shot voice cloning, which uses pretrained models and speaker
embeddings to mimic voices without additional training. The latter is particularly at-
tractive for scalable applications due to its flexibility and efficiency.

However, while voice cloning holds enormous potential, it also brings ethical chal-
lenges, such as the risk of misuse for impersonation or generating deepfakes. Therefore,
any research or deployment of voice cloning technology must also consider responsible
AI practices, including consent, transparency, and security.

1.2 Problem Statement

Nepali, as a low-resource language, lacks high-quality speech datasets and advanced

synthesis tools, limiting the development of applications like voice assistants and TTS
systems. Existing voice cloning methods often require large datasets and are not suit-
able for few-shot learning in multilingual, multi-speaker contexts. This project aims to
build a robust Nepali voice cloning system using FastSpeech 2 and both pretrained and
learnable speaker embeddings, enabling high-quality, speaker-specific speech synthesis
from minimal data and advancing speech technology for under represented languages.

1.3 Objectives

i. As Nepali is a low-resource language, to contribute to the development of Nepali

language technologies.
ii. To implement and evaluate voice cloning techniques like FastSpeech 2 for Nepali.
iii. To make few-shot multi-speaker voice synthesis possible in Nepali using limited
data.

2
1.4 Application and Scope of project

1.4.1 Application

1. Customer Services: voice cloning can produce virtual agents that are more natu-
ral and human-like in the way they sound. Businesses can clone a real individual’s
voice, instead of robotic or generic voices, to make customer support systems
more friendly and interactive. This helps the customers feel more comfortable
and connected while interacting.
2. Healthcare: Voice cloning aids those patients who lose their voice. They can
communicate with a copy of their own voice via speech machines, which is more
reassuring and natural. Doctors can also leave patients with clear, warm voice
messages, which allow for effective communication and make it more reassuring.
3. Education and Training: Voice cloning can be used to create virtual teachers or
assistants with voices similar to real educators. This makes online learning more
engaging and convenient. It can also be used to assist students in learning correct
pronunciation during language studies using clear and natural-sounding voices.
Cloned voices can also be used by training videos or courses to educate topics in
a friendly and consistent format.

1.4.2 Scope

1. Multi-language Support: In the future, this project can be worked on to support

multiple languages. This means that the voice cloning technology will be capable
of cloning voices in different languages, and users worldwide will be able to use
the technology irrespective of the language they use.
2. Emotion Control: Another possible extension is the addition of emotion control
in the cloned voice. This would allow the voice to express different emotions like
happiness, sadness, or excitement, which would render the speech more natural
and engaging, the way real individuals speak.

3
1.5 Project Features

1. Few-Shot voice cloning: Speakers or users can clone their voices by providing
just a few minutes of samples.
2. Multi-Speaker Support: The model can handle multi-speaker voice cloning,
where in multiple users can add their voice by giving a few minutes of sample.
Then the model can generate in any of the given voices without getting trained
altogether separately for each user.
3. Custom Voice Cloning Interface: The system has a voice cloning interface per-
sonalized to an individuals voice where they can record or upload a short voice
sample of themselves. They can then type out any Nepali text, and the system
will read it out in their own voice clone. The interface is clean and user-friendly,
and it also exhibits real-time voice personalization.

1.6 Feasibility Study

A feasibility analysis identifies the capability of a project by examining various factors,

like technical, operational, and schedule-based factors. It ensures that the proposed
Clone The Tone sytem is feasible, cost-effective, and implementable within the acces-
sible resources. Here is a full feasiblity analysis of this project:

1.6.1 Economic Feasibility

The economic feasibility of this project is favorable as the models utilized pre-trained
models such as ECAPA-TDNN for extracting discriminative speaker embedding from
raw speech and Vocoder model for converting mel spectogram to audio waveform which
are publicly available and does not require licensing fees. The overall cost-effectiveness
depends on the scale of fine-tuning required, the availability of existing computational
resources and the efficiency of the training process.

4
1.6.2 Technical Feasibility

The project is technically highly feasible, as it can be developed on both Windows and
macOS using Python, a widely supported and versatile programming language. Python
offers a rich ecosystem of libraries like TensorFlow and PyTorch, which streamline
the development, training, and evaluation of deep learning models. The availability of
pre-trained models, such as ECAPA-TDNN for extracting discriminative speaker em-
bedding from raw speech and Vocoder model for converting mel spectogram to audio
waveform, allows for efficient integration by leveraging CNN-derived features to train
the Transformer model from scratch. Additionally, cloud-based platforms and GPU ac-
celeration tools, such as Google Colab (with limited free usage) and CUDA for NVIDIA
GPUs, significantly enhance computational performance. Given these readily available
technologies, the project can be implemented effectively without major compatibility
or infrastructure issues.

1.6.3 Operational Feasibility

The operational feasibility of this project ensures that it can be effectively integrated
and used within the current environment. This system is designed with a user-friendly
interface, making it accessible even to users with minimal technical skills. Implementa-
tion and maintenance will consider both hardware and software requirements to ensure
smooth operation without excessive resource consumption.

5
1.6.4 Schedule Feasibility

Figure 1.1: Gantt Chart

The Gantt chart presented in Fig 1.1 represents our projects timeline with various tasks
plotted against dates. The x-axis shows the timeline from June 2025 to february 2026,
while the y-axis list the tasks involved in this project. Documentation and Report
spanned the longest period, running throughout almost the entire timeline. Planning
Familiarization was shorter task occurring in the earlier phase. Design started after
that, followed by Development. Coding happened after data processing, while Testing
and Debugging occured toward the later stages. Documentation and Reporting extended
throughout the timeline, ensuring that project outcomes are properly recorded and doc-
umented properly.

6
1.7 System Requirement

Table 1.1: Development Requirements

Hardware Requirements Software Requirements

th
CPU: Intel Core i5 8 gen and equivalent OS: Windows, Linux, MAC OS
GPU: GTX series 3 GB VRAM equivalent Programming Language: Python
(optional if using Google Colab)
RAM: 8GB

Table 1.2: Deployment Requirements

Hardware Requirements Software Requirements

CPU: Intel Core i3 equivalent OS: Windows, Linux, MAC OS
RAM: Minimum 4 GB

7
CHAPTER 2
LITERATURE REVIEW

2.1 Related Research

With the development of deep learning, speech synthesis has made significant progress
in recent years. While recently proposed end-to-end speech synthesis systems, e.g.,
Tacotron, DurIAN and FastSpeech, are able to generate high-quality and natural sound-
ing speech, these models usually rely on a large amount of training data from a single
speaker. The speech quality, speaker similarity, expressiveness and robustness of syn-
thetic speech are still not systematically examined for different speakers and various
speaking styles, especially in real-world low-resourced conditions, e.g., each speaker
only has a few samples at hand for cloning. However, this so-called multi-speaker
multi-style voice cloning task has found significant applications on customized TTS
[2]. Imitating speaking style is one of the desired abilities of a TTS system. Several
strategies have been recently proposed to model stylistic or expressive speech for end-
to-end TTS. Speaking style comes with different patterns in prosody, such as rhythm,
pause, intonation, and stress, etc. Hence direct modeling prosodic aspects of speech is
beneficial for stylization. ing prosodic aspects of speech is beneficial for stylization.
In the paper [2] Variational Autoencoder VAE and GST are two typical models built
upon sequence-to-sequence models for style modeling. Global Style Tokens (GSTs)
is introduced for modeling style in an unsupervised way, using multi-head attention to
learn a similarity measure between the reference embedding and each token in a bank of
randomly initialized embeddings, and the Text-Predicted Global Style Token (TP-GST)
learns to predict stylistic renderings from text alone, requiring neither explicit labels
during training nor auxiliary inputs for inference. Note that these studies modeling
speaker styles are mostly based on a large amount of data.

In the Paper [3], the previous neural TTS models first generate mel-spectrograms au-
toregressively from text and then synthesize speech from the generated mel-spectrograms
using a separately trained vocoder. They usually suffer from slow inference speed and
robustness (word skipping and repeating) issues are designed to address these issues,

8
which generate mel-spectrograms with extremely fast speed and avoid robustness is-
sues, while achieving comparable voice quality with previous autoregressive models.
Among those non-autoregressive TTS methods, FastSpeech is one of the most suc-
cessful models. FastSpeech designs two ways to alleviate the one-to-many mapping
problem by Reducing data variance in the target side by using the generated mel-
spectrogram from an autoregressive teacher model as the training target (i.e., knowl-
edge distillation) and, introducing the duration information (extracted from the atten-
tion map of the teacher model) to expand the text sequence to match the length of the
mel-spectrogram sequence. While these designs in FastSpeech ease the learning of the
one-to-many mapping problem in TTS.

Neural text-to-speech (TTS) has been proven to be capable of generating high-quality

and human-like speech. Previous research shows that the quality of the utterances syn-
thesized by modern TTS models is already comparable with real human speech. How-
ever, when considering to generate speech of multiple speakers with a single model,
the performance of such multi-speaker TTS models is still inferior to the single-speaker
ones, especially when there is not enough high-quality data for any single speaker. On
the other hand, the speech generated by TTS models usually tends to be neutral and
less expressive compared to real human speech. As a result, improving the models
ability to model speaker and style variation has become an important topic in TTS re-
search. To push the frontier of TTS technology, ICASSP 2021 M2VoC challenge aims
at addressing the problem of few-shot multispeaker multi-style voice cloning. In the
challenge, a TTS system is required to generate speech with speaker identity and style
similar to a few reference speech samples. The TTS system must accurately model the
variation of speech of different speakers with different speaking styles while maintain-
ing the synthesized audio quality to achieve good results in the objective and subjective
evaluations. Under the few-shot setting, one major challenge of this task is to discover
speaker and style information from limited references efficiently. Previous research
about multi-speaker TTS typically uses a speaker representation to control the synthe-
sized utterances speaker identity. This speaker representation can be jointly learned
with the TTS model in the form of an embedding table or a speaker encoder, or can be
transferred from another pretrained model for speaker information extraction. While

9
to control the style of synthesized speech, global style token (GST) is widely used
to enable utterance-level style transfer. Some also proposed to use an auxiliary style
classification task to disentangle style information from phonetic information in the ut-
terances. Since speaker and style information is usually entangled in the training data,
it is also possible to learn a latent representation to jointly model the speaker and style
information. In this work, [1] has apply pretrained and jointly-optimized speaker rep-
resentations to multi-speaker TTS models. Two different TTS frameworks, Tacotron 2
and FastSpeech, are studied. It is shown that with the jointly-optimized speaker repre-
sentations only, the TTS models do not generalize well on the few-shot speakers. We
also demonstrate that using different pretraining tasks results in significant performance
differences. By combining both the pretrained and the learnable speaker representa-
tions, our experiments show that the audio quality and the speaker similarity of the
synthesized speech improve significantly. The synthesized samples are available online
results with the FastSpeech 2 TTS framework achieved 2nd place in the one-shot track
of the ICASSP 2021 M2VoC challenge.

Neural text-to-speech models are capable of synthesizing natural human voice after
being trained on several hours of high-quality single-speaker or multi-speaker record-
ings. However, to adapt new speaker voices, these TTS models are finetuned using a
large amount of speech data, which makes scaling TTS models to a large number of
speakers very expensive. Fine-tuning of TTS models to new speakers may be challeng-
ing for number of reasons. First,the original TTS model should be pre-trained with a
large multi-speaker corpus to make models to generalize well to new voices and dif-
ferent recording conditions. Second, fine-tuning whole TTS model is very parameter
inefficient, since one will need a new set of weights for every newly adapted speakers.
Currently, there are two approaches to make adaptation of TTS more efficient. The first
approach is to modify only parameters directly related to speaker identity. The other al-
ternative approach is based on using a light voice conversion postprocessing module to
baseline TTS model. The third challenge is to reduce amount of speech required to add
new speaker to existing TTS model. In this paper [4], propose a new parameter-efficient
method for tuning existing multi-speaker TTS for new speakers. First, we pre-train a
base multi-speaker TTS model on a large and diverse TTS dataset. To extend model for

10
new speakers, we add a few adapters small modules to the base model. We used vanilla
adapter, unified adapters, or BitFit. Then, freeze the pre-trained model and fine-tune
only adapters on new speaker data.

In recent years, x-vectors and their subsequent improvements have consistently pro-
vided state-of-the-art results on the task of speaker verification[5]. Improving upon the
original Time Delay Neural Network TDNN architecture is an active area of research.
The rising popularity of the x-vector system has resulted in significant architectural im-
provements and optimized training procedures over the original approach. The topology
of the system was improved by incorporating elements of the popular ResNet architec-
ture. Adding residual connections between the frame-level layers has been shown to en-
hance the embeddings. Additionally, residual connections enable the back-propagation
algorithm to converge faster and help avoid the vanishing gradient problem.

11
CHAPTER 3
THEORETICAL FOUNDATION

3.1 Text-to-Speech

(TTS) is a technology that converts written text into spoken words using computer sys-
tems. The concept of TTS dates back to the late 1950s and early 1960s, with one of
the earliest practical systems developed by Bell Labs in 1961. In fact, Bell Labs’ re-
searchers John Larry Kelly Jr. and Louis Gerstman created a system that made a com-
puter ”sing” Daisy Bell, which later even inspired Arthur C. Clarkes 2001: A Space
Odyssey. Text-to-Speech (TTS) is the core process that converts written Nepali text
into spoken audio, mimicking the voice of a specific speaker. The goal is not just to
generate speech, but to make it sound like it was spoken by a real person the target
speaker. To do this, the TTS system first takes the input Nepali text and converts it
into a more suitable form, such as phonemes or characters. Then, using a deep learning
model like FastSpeech 2, the system generates a sequence of mel spectrograms, which
are visual representations of how the audio should sound over time. These spectro-
grams capture information like pitch, tone, and duration of each sound. What makes
FastSpeech 2 powerful is that it is a non-autoregressive model, meaning it generates the
entire speech sequence all at once, making it faster and more stable than older models.
Additionally, to achieve voice cloning, a speaker embedding is added to the model a
numerical representation of a person’s voice style. Finally, the generated spectrogram
is passed through a vocoder (such as HiFi-GAN or WaveGlow), which converts it into
a natural-sounding waveform. The result is high-quality Nepali speech that sounds like
it was spoken by the target person, even though it was generated entirely from text.

12
3.1.1 FastSpeech 2

Figure 3.1: FastSpeech 2

Source: https://arxiv.org/pdf/2006.04558

FastSpeech 2 is an advanced, non-autoregressive text-to-speech (TTS) model devel-

oped by researchers at Microsoft in 2020. It was introduced as an improvement over
the original FastSpeech model to address its limitations and produce more natural and
expressive speech. Unlike traditional autoregressive models like Tacotron 2, which gen-
erate speech one step at a time (leading to slower processing and potential instability),
FastSpeech 2 generates all parts of the speech signal in parallel, making it significantly
faster and more robust. One of the key innovations in FastSpeech 2 is its use of addi-
tional prosodic features such as pitch, duration, and energy as part of the learning pro-
cess, which helps the model produce more natural and emotionally expressive speech.
These features are either predicted by the model or extracted from real speech during
training. FastSpeech 2 first converts text into a mel spectrogram using this parallel ar-
chitecture, and then a neural vocoder like HiFi-GAN or Mc1GAN is used to convert
that spectrogram into audible speech. Because of its high speed, quality, and flexibility,
FastSpeech 2 is widely used in real-time TTS applications and is especially useful in
projects like voice cloning, where capturing and mimicking a specific speakers voice is
important.

13
3.1.2 Variance Adaptor

Figure 3.2: Variance Adaptor

Source: https://arxiv.org/pdf/2006.04558

It illustrates a modular structure designed to fine-tune the variability in the generated

output. This module comprises three distinct predictors: the Energy Predictor, the Pitch
Predictor, and the Duration Predictor. The Energy Predictor estimates the intensity or
loudness of the audio signal, the Pitch Predictor determines the perceived frequency
or tone, and the Duration Predictor controls the length of each sound segment. These
predictors process the input data sequentially or in parallel, adjusting the respective
parameters to introduce natural variations. The outputs from these predictors are then
channeled into a higher-level ”Predictor” layer, which integrates and refines the adjust-
ments to ensure coherence and alignment with the desired output characteristics. This
hierarchical arrangement enables the model to adapt dynamically to different inputs, en-
hancing the expressiveness and realism of the synthesized audio by modulating energy,
pitch, and duration in a coordinated manner.

14
3.1.3 Duration/pitch/energy Predictor

Figure 3.3: Duration/pitch/energy predictor

Source: https://arxiv.org/pdf/2006.04558

It provides a detailed view of the internal architecture of the predictor component used
within or alongside the Variance Adaptor. This structure is a deep neural network with
a multi-layered design optimized for feature extraction and prediction. The process
begins with a Linear Layer, which transforms the input data into a suitable format for
subsequent layers. This is followed by a combination of Layer Normalization LN and
Dropout, where LN stabilizes the training process by normalizing the inputs across the
feature dimension, and Dropout randomly deactivates a subset of neurons during train-
ing to prevent overfitting. The data then passes through a Convolutional 1D (Conv1D)
layer paired with a Rectified Linear Unit ReLU activation function, which applies con-
volutional filters to extract spatial hierarchies and introduces non-linearity to model
complex patterns. This LN-Dropout-Conv1D-ReLU sequence is repeated, allowing the
network to iteratively refine the features. The architecture concludes with another Lin-
ear Layer, which produces the final predictions for duration, pitch, and energy. This
layered approach leverages the strengths of convolutional operations for local feature
detection, normalization for training stability, and dropout for regularization, resulting
in a robust and generalized model capable of accurately predicting the target audio pa-

15
rameters.

3.1.4 ECAPA-TDNN

ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation - Time

Delay Neural Network) is an advanced neural network architecture primarily designed
for speaker verification tasks. ECAPA-TDNN is currently the most popular TDNN-
series model for speaker verification, which refreshed the stateof-the-art(SOTA) perfor-
mance of TDNN models [6]. It builds upon the Time Delay Neural Network (TDNN)
framework, enhancing it with mechanisms to better capture speaker-specific charac-
teristics from audio data. Developed to improve performance in speaker recognition,
ECAPA-TDNN incorporates several innovative components that make it highly effec-
tive, particularly in applications like voice cloning project.

Key Components of ECAPA-TDNN:

1. Time Delay Neural Network (TDNN) Backbone: TDNN is a convolutional neural
network variant tailored for sequential data, such as audio spectrograms. It processes
input frames with a time delay, allowing it to capture temporal dependencies across
different time steps. This is crucial for modeling the dynamic nature of speech.

2. Emphasized Channel Attention (ECA) Mechanism: The ECA module focuses on

recalibrating channel-wise feature responses adaptively. It uses a lightweight attention
mechanism to emphasize important frequency channels in the spectrogram, enhancing
the model’s ability to focus on speaker-specific features while suppressing noise or
irrelevant information.

3. Multi-Layer Feature Aggregation: ECAPA-TDNN aggregates features from mul-

tiple TDNN layers, creating a richer representation of the input audio. This multi-scale
aggregation helps capture both local and global speaker characteristics, improving ro-
bustness and generalization.

4. Residual Connections and Squeeze-Excitation: Residual connections (skip con-

nections) facilitate deeper network training by mitigating vanishing gradient problems.

16
The squeeze-excitation block further refines the feature maps by modeling interdepen-
dencies between channels, boosting the model’s discriminative power.

5. Statistics Pooling: After processing the audio through the TDNN layers, ECAPA-
TDNN applies statistics pooling (e.g., mean and standard deviation) over the tempo-
ral dimension to create a fixed-length embedding. This embedding encapsulates the
speaker’s voice characteristics, making it suitable for comparison or synthesis tasks.

How ECAPA-TDNN Works:

input: The model takes a log-Mel spectrogram (or similar audio representation) as
input, which is derived from raw audio waveforms.

Feature Extraction: The TDNN layers with time delays extract temporal features,
while the ECA mechanism highlights relevant channels.

Aggregation: Features from different layers are combined to form a comprehensive

representation.

Pooling: Statistics pooling generates a compact speaker embedding.

Output: The embedding can be used for speaker verification (comparing with other
embeddings) or as a feature input for downstream tasks like voice cloning.

3.1.5 Speaker Embedding

Speaker embedding is a fundamental concept in modern speech and voice processing

that involves encoding a speakers unique vocal traits into a fixed-size numerical repre-
sentation, typically a vector. This vector, often called a speaker embedding, captures
key characteristics of an individuals voicesuch as pitch, accent, intonation, speaking
rate, and vocal timbrewhile discarding irrelevant information like background noise or
spoken content. The idea is to create a compact and consistent identity representation of
a speaker, even when the person says different words in different environments. These
embeddings are generated using deep learning models, often trained on large, diverse

17
datasets containing thousands of speakers. During training, the model learns to focus on
speaker-dependent features and ignore variable factors like the language or background
noise. The well-known models used for generating these embeddings is -TDNN in our
system, progressively improving in terms of accuracy and robustness.

Once a speaker embedding is obtained, it can be used in a variety of applications. In

speaker verification, the embedding helps determine whether a given voice matches a
claimed identity. In speaker identification, it is used to classify which speaker out of
a known set is speaking. In speaker diarization, embeddings help in separating and
labeling speakers in multi-speaker audio. One of the most exciting uses is in voice
cloning, where a system uses a speaker embedding to synthesize speech that sounds like
a specific person. This makes it possible to replicate a person’s voice using only a short
recording of their speech. In some systems, speaker embeddings are learned jointly
with text-to-speech (TTS) models, while in others, they are precomputed using separate
speaker encoders and then passed to the TTS model. Overall, speaker embeddings are a
powerful and flexible representation that form the backbone of many speech AI systems
today.

3.1.6 Vocoder

Figure 3.4: Architecture of HiFi GAN

Source: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/
dle/resources/hifigan_pyt

18
A vocoder is a critical module in the speech synthesis pipeline that converts interme-
diate acoustic representations, specifically Mel-spectrograms, into time-domain audio
waveforms. In this work, our project is based on HiFi-GAN, a state-of-the-art neural
vocoder based on generative adversarial networks (GANs).

HiFi-GAN learns to generate realistic speech waveforms by modeling the complex re-
lationship between Mel-spectrograms and raw audio signals. Compared to traditional
vocoders such as Griffin-Lim, HiFi-GAN achieves significantly higher audio quality
and faster inference speeds, making it suitable for real-time applications.

The vocoder is trained to minimize distortion and preserve naturalness, resulting in out-
put speech that is clear, intelligible, and natural-sounding. Using HiFi-GAN allows the
overall TTS system to efficiently produce high-fidelity audio, completing the transfor-
mation from text input to natural speech output.

19
CHAPTER 4
METHODOLOGY

4.1 Block Diagram

Figure 4.1: Overall Architecture of Voice Cloning System

This block diagram explains a block diagram representing a text-to-speech (TTS) syn-
thesis system based on a neural network architecture combining acoustic feature gener-
ation and vocoding to produce human-like speech from text input.

1. Input Stages
• Phoneme: The process begins with the input text being converted into a

20
sequence of phonemes, which are the smallest units of sound in a language.
This step is crucial for representing the linguistic content that will be syn-
thesized into speech.
• Log-Mel Spectrogram: On the acoustic side, the system starts with a log-
Mel spectrogram, which is a visual representation of the spectrum of fre-
quencies in a sound signal as it varies with time, transformed into the Mel
scale to mimic human auditory perception.
2. Feature Extraction and Embedding
• Phoneme Embedding: The phoneme sequence is passed through a phoneme
embedding layer, which converts the discrete phonemes into dense vector
representations that can be processed by neural networks. This helps cap-
ture the semantic and contextual relationships between phonemes.
• ECAPA-TDNN: The log-Mel spectrogram is processed by the ECAPA-
TDNN (Emphasized Channel Attention, Propagation, and Aggregation -
Time Delay Neural Network), a model typically used for speaker verifica-
tion. Here, it likely extracts speaker-specific features or enhances the spec-
trogram’s quality for further processing.
3. Speaker Embedding
• Speaker Embedding: This component generates a fixed-length vector that
encodes the characteristics of a specific speaker’s voice. This embedding
is crucial for multi-speaker TTS systems, allowing the model to synthesize
speech in different voices based on the input speaker identity.
4. Encoding and Transformation
• Transformer Encoder: The phoneme embeddings are fed into a Trans-
former Encoder, a type of neural network architecture known for its effec-
tiveness in sequence modeling tasks. It processes the phoneme sequence to
create a rich contextual representation.
• Linear Layer (after Speaker Embedding): The speaker embedding is
passed through a linear layer to transform it into a format compatible with
the subsequent stages.
• Variance Adapter: This component adjusts the variance (e.g., duration,
pitch, or energy) of the encoded phoneme sequence. It helps control the

21
prosody (rhythm and intonation) of the synthesized speech, making it more
natural.
5. Decoder and Reconstruction
• Transformer Decoder: The Transformer Decoder takes the encoded phoneme
representations and the variance-adapted features, along with the speaker
embedding, to generate a sequence of log-Mel spectrograms. This step
aligns the linguistic content with the acoustic features.
• Linear Layer (after Transformer Decoder): Another linear layer refines
the output of the Transformer Decoder, ensuring the generated log-Mel
spectrogram is in the correct format.
6. Spectrogram to Audio
• Log-Mel Spectrogram (Generated): The output of the Transformer De-
coder is a predicted log-Mel spectrogram, which represents the target acous-
tic features.
• Vocoder: The vocoder converts the log-Mel spectrogram back into a time-
domain audio waveform. This step involves sophisticated signal processing
to reconstruct the raw audio signal, leveraging techniques like WaveNet or
HiFi-GAN to produce high-quality speech.
7. Final Output
• Audio: The final output is the synthesized audio waveform, which is a
human-like speech signal corresponding to the input text, rendered in the
voice defined by the speaker embedding.

4.2 Data Collection

Making the TTS model from scratch for Nepali dataset . The good amount of Nepali
dataset is collected from openslr.org specifically SLR143 and SLR54.

22
4.3 Data Preprocessing

4.3.1 Audio Preprocessing

The collected data contains ’.tsv’ file and audio file. The tsv file contains information
of the data separated by tab character. Mainly it contains speaker ID, text of audio file
and relevant audio file name or path.

For audio preprocessing, the volume of the audio is normalized (not too high or low)
and the background noise should be cut off. The audio sampling rate is resampled to
16 kHz as this sampling rate is more than enough to capture the natural timbre, prosody
and clarity of human voice. The channel of audio format is also set to mono. Finally
the audio is converted to mel spectrogram.

Figure 4.2: Steps invloved in Audio Preprocessing

The detailed explanation is written below:

Volume Normalization
The volume of the audio is normalized using the RMS formula. The process is
described step by step as follows:
• Root Mean Square: First, calculate the RMS of the audio signal.
v
u
u1 X N
RMS = t x2
N i=1 i

where:
– xi is the i-th sample of the audio data,
– N is the total number of samples in the audio.
• Current Decibel: The current loudness in decibels (dB) is calculated from the

23
RMS value:
currentdB = 20 · log10 (RMS + ε)

where ε is a small constant to avoid taking the logarithm of zero.

• Gain Decibel: The required gain in decibels is computed by subtracting the
current dB from the target dB:

gaindB = targetdB − currentdB

• Gain: The gain factor is derived from the gain in dB:

gaindB
gain = 10 20

• Apply Gain: Finally, the audio data is scaled by the gain factor to normalize
its volume:
x0i = xi · gain

where x0i is the normalized i-th sample of the audio data.

Silence Trimming:

v 
u
u1 X N
dbFrames = 20 · log10 t x2 + ε
N i=1 i

Where:

• xi = The audio sample values in each frame.

• N = The number of samples per frame.
• ε = A small constant to avoid taking the logarithm of zero.
• dbFrames = The decibel value for each frame after converting the RMS.
• threshold = The decibel threshold below which frames are considered silent and
discarded. The frames with dbFrames > threshold are kept.

The ’keepFrames’ are those frames where:

24
keepFrames = {i : dbFramesi > threshold}

Short Time Fourier Transform

1. Windowing with Hamming Window

Each frame of the audio signal is multiplied by the ’Hamming window’ to reduce spec-
tral leakage. The Hamming window is defined as:

2πn
w[n] = 0.54 − 0.46 cos for n = 0, 1, . . . , N − 1
N −1

Where:

• w[n] is the Hamming window,

• N is the window length (denoted ‘winLen‘),
• n is the sample index within the window.

This window is applied to each frame of the audio signal to smooth the edges and reduce
leakage.

2. Framing the Signal

The audio signal is split into overlapping frames. For each frame, the windowed signal
is calculated. The index of the i-th frame is computed as:

framei [n] = x[(i − 1) · hopLen + n] · w[n]

Where:

• x[(i − 1) · hopLen + n] is the n-th sample of the i-th frame in the original signal,

25
• w[n] is the Hamming window applied to the frame,
• hopLen is the hop length (or step size) between frames.

This process divides the signal into smaller, overlapping segments (frames), which are
then processed individually.

3. FFT of Each Frame

For each windowed frame, the FFT (Fast Fourier Transform) is computed. The Discrete
Fourier Transform (DFT) formula is as follows:

N −1
2πkn
X
X(k) = x(n) · e−j N for k = 0, 1, . . . , N − 1
n=0

Where:

• X(k) is the FFT of the k-th frequency bin of the i-th frame,
• x(n) is the n-th sample of the windowed frame,
• N is the number of samples in the frame (‘fftLen‘),
• j is the imaginary unit.

This operation converts the time-domain signal (frame) into the frequency domain. The
resulting FFT is a complex number representing both magnitude and phase for each
frequency bin.

4. Power Spectrum Calculation

The power spectrum is the square of the magnitude of the FFT for each frame:

Pi (ω) = |Xi (ω)|2

Where:

26
• Pi (ω) is the power spectrum of the i-th frame at frequency ω,
• Xi (ω) is the complex FFT result of the i-th frame.

The power spectrum reflects how much power (energy) is present at each frequency bin
for each frame.

5. Keeping Positive Frequencies

Since the FFT result for real-valued signals is symmetric, we keep only the positive
frequencies, which are up to the Nyquist frequency. The final power spectrum is given
by:

Ppositive = Pi (ω) for ω ∈ [0, Nyquist frequency]

Where:

• Ppositive is the final power spectrum after keeping only the positive frequencies.

This step reduces redundancy and focuses on the relevant part of the spectrum.

Mel Filter Bank Creation

The Mel filter bank is commonly used in speech and audio processing to convert fre-
quency bins from a linear scale to the Mel scale. The Mel scale approximates the human
ear’s perception of pitch, which is more sensitive to lower frequencies and less sensitive
to higher frequencies.

1. Mel Scale Conversion

The Mel scale is a perceptual scale of pitches that approximates human hearing. The
formula to convert a frequency f in Hz to the Mel scale is:

27

f
Mel(f ) = 2595 · log10 1 +
700

Where:

• f is the frequency in Hz,

• Mel(f ) is the Mel scale value corresponding to the frequency f .

For the given signal, we start by converting the minimum and maximum frequencies
into the Mel scale.

2. Mel Scale Range

The minimum and maximum frequencies are given as:

fs
fmin = 0, fmax =
2

Where:

• fs is the sample rate of the signal,

• fmax is the Nyquist frequency, i.e., half of the sample rate.

The corresponding Mel values for the minimum and maximum frequencies are:

fmin
Melmin = 2595 · log10 (1 + )
700
fmax
Melmax = 2595 · log10 (1 + )
700

We then generate numMel + 2 equally spaced points on the Mel scale between Melmin
and Melmax .

28
3. Convert Mel Points Back to Hz

Next, we convert these Mel points back to the linear frequency scale (Hz) using the
inverse Mel formula:

Mel(f )
f = 700 · 10 2595 − 1

Where:

• f is the frequency in Hz,

• Mel(f ) is the Mel scale value.

The resulting hzPoints correspond to the frequency locations of the Mel bins.

4. Map Mel Bins to FFT Bin Indices

We now convert the Mel frequency points into corresponding FFT bin indices. The FFT
bin index for a frequency f is given by:

(fftLen + 1) · f
bin =
fs

Where:

• fftLen is the FFT length,

• fs is the sampling frequency.

This formula converts each Mel frequency point into a corresponding FFT bin index.

5. Constructing the Mel Filter Bank

For each Mel bin, we create a triangular filter. Each filter linearly increases from the
previous bin’s index to the center of the Mel bin, and then linearly decreases to the next

29
bin’s index. For the m-th Mel bin, the filter is defined as:


k−fm−1
for fm−1 ≤ k < fm


fm −fm−1
filter[k] =
fm+1 −k
for fm ≤ k < fm+1


fm+1 −fm

Where:

• fm−1 is the previous Mel bin’s FFT bin index,

• fm is the current Mel bin’s FFT bin index,
• fm+1 is the next Mel bin’s FFT bin index,
• k is the FFT bin index.

Log Mel Spectrogram and Normalization

The process of calculating the Log Mel Spectrogram in your code involves three steps:
computing the Mel spectrogram, applying a logarithmic transformation, and normaliz-
ing the result.

1. Mel Spectrogram Calculation

The Mel spectrogram is computed by applying a Mel filter bank to the magnitude spec-
trogram of the signal. The Mel spectrogram at time frame t and Mel bin m is given
by:

N
X −1
MelSpec(t, m) = |FFT(x(t))|f · Hm (f )
f =0

Where:

• MelSpec(t, m) is the Mel spectrogram value at time frame t and Mel bin m,
• |FFT(x(t))|f is the magnitude of the FFT of the signal x(t) at frequency bin f ,
• Hm (f ) is the Mel filter bank at frequency bin f and Mel bin m,

30
• N is the number of FFT bins.

2. Logarithmic Transformation

After computing the Mel spectrogram, a logarithmic transformation is applied to com-

press the dynamic range and make the features more suitable for machine learning. The
transformation is given by:

logMel(t, m) = log(MelSpec(t, m) + )

Where:

• logMel(t, m) is the log Mel spectrogram value at time frame t and Mel bin m,
• MelSpec(t, m) is the Mel spectrogram value at time frame t and Mel bin m,
• = 1 × 10−10 is a small constant added to avoid computing log(0), which is
undefined.

3. Normalization

Finally, the log Mel spectrogram is normalized to have zero mean and unit variance
across the entire utterance. The formula for normalization is:

logMel(t, m) − µ
normLogMel(t, m) =
σ

Where:

• normLogMel(t, m) is the normalized log Mel spectrogram value at time frame t

and Mel bin m,
• µ = mean(logMel) is the mean value of all log Mel spectrogram values,
• σ = std(logMel) is the standard deviation of all log Mel spectrogram values.

31
4.3.2 Text Preprocessing

Figure 4.3: Steps involved in Text Preprocessing

The text preprocessing pipeline for FastSpeech2, a non-autoregressive text-to-speech

system with speaker embedding capabilities, consists of four key stages:

1. Text Normalization This step standardizes the input text by converting all characters
to lowercase, removing punctuation marks, and eliminating special characters to ensure
consistency for phonetic processing.

2. Tokenization or Grapheme Segmentation The normalized text is segmented into

units, either through tokenization into word-level tokens or grapheme segmentation into
individual characters, preparing it for phoneme conversion.

3. Grapheme-to-Phoneme Conversion This stage maps the segmented text units to

their corresponding phonemes, the basic sound units of speech, using linguistic rules or
trained models to enable accurate pronunciation.

4. Phoneme Encoding The phonemes are encoded into a numerical representation,

integrated with speaker embeddings, to generate a format suitable for the FastSpeech2
model to synthesize speaker-specific speech.

32
4.4 Overview of the System

Figure 4.4: Use Case Diagram of Voice Cloning System

Our system allows users to clone a voice in the Nepali language. To do this, the user
must upload a reference Nepali audio sample either by recording it using a microphone
or by uploading a pre-recorded audio file. After that, the user provides a text in Nepali,
and the system will generate speech in the cloned Nepali voice for the given text.

33
4.5 Software Development Life Cycle

Figure 4.5: Incremental Model of Voice cloning system

It is the process of development of the software in which it works module by module.

After taking the requirements from customer, it takes that module 1(Initial) as a mini
project and builds the mini project by using modified water fall approach. After building
module 1 we give the software to the customer and the customer gives us the feedback
and based on feedback we start working on module 2 according to the feedback. This
process can be repeated n times.

This project is being developed using incremental methodology since it offers a func-
tioning prototype at an early stage of development. As we know in the incremental
model, after getting feedback from users we can update our system in future. This fea-
ture is very beneficial for our system because our system is based on taking feedback
from user. So, we can update our system which can do other operations. Also, it can be
transformed to a system which can recognize characters.

34
CHAPTER 5
EPILOGUE

5.1 Work Completed and Work Remaining

5.1.1 Work Completed

Audio pre-processing :

Figure 5.1: Generated log-mel spectrograms

The conversion of audio signal to log-Mel spectrogram is achieved. The vocoder model
is ready.

Text pre-processing : As part of text pre-processing, normalization was done to write

numbers in words (with phone numbers and general numbers having different rules),
removed unwanted symbols like dashes, commas, and other punctuation, and normal-
ized whitespaces. Then tokenized the cleaned text into individual Nepali graphemes
and developed a text-to-sequence system that converts each character into a unique nu-
meric ID. The character set includes consonants, vowels, matras, and silence tokens so
that the text is properly formatted for speech synthesis.

35
5.1.2 Work Remaining

The phoneme conversion for text is not written yet. The data pipeline for processing au-
dio and text inputs needs to be implemented. Additionally, models for speaker embed-
ding and log-Mel spectrogram generation must be developed. Once model training and
evaluation are complete, a proper user interface should be designed and implemented.

36
REFERENCES

[1] C.-M. Chien, J.-H. Lin, C.-y. Huang, P.-c. Hsu, and H.-y. Lee, “Investigating on
incorporating pretrained and learnable speaker representations for multi-speaker
multi-style text-to-speech,” in ICASSP 2021-2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8588–
8592.

[2] Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong et al.,
“The multi-speaker multi-style voice cloning challenge 2021,” in ICASSP 2021-
2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2021, pp. 8613–8617.

[3] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2:
Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558,
2020.

[4] C.-P. Hsieh, S. Ghosh, and B. Ginsburg, “Adapter-based extension of multi-speaker

text-to-speech model for new speakers,” arXiv preprint arXiv:2211.00585, 2022.

[5] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized chan-

nel attention, propagation and aggregation in tdnn based speaker verification,” arXiv
preprint arXiv:2005.07143, 2020.

[6] E. Zhang, Y. Wu, and Z. Tang, “Sc-ecapatdnn: Ecapa-tdnn with separable convolu-
tional for speaker recognition,” in International Conference on Intelligence Science.
Springer, 2024, pp. 286–297.

Deepfake Voice Synthesis Framework
No ratings yet
Deepfake Voice Synthesis Framework
24 pages
Thesis
No ratings yet
Thesis
37 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Suoni
No ratings yet
Suoni
38 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
AI Based Presentation Creator With Customized Audio Content Delivery
No ratings yet
AI Based Presentation Creator With Customized Audio Content Delivery
5 pages
Speech Models for Synthesis & Recognition
No ratings yet
Speech Models for Synthesis & Recognition
63 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Deep Learning for Portuguese ASR
No ratings yet
Deep Learning for Portuguese ASR
103 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
DB Report Low Resource Text To Speech Synthesis
No ratings yet
DB Report Low Resource Text To Speech Synthesis
18 pages
Preprints202306 0223 v1
No ratings yet
Preprints202306 0223 v1
20 pages
Adobe Scan 18 Mar 2025
No ratings yet
Adobe Scan 18 Mar 2025
3 pages
Advancements in Voice Cloning - A Machine Learning Approach To Synthetic Speech Generation
No ratings yet
Advancements in Voice Cloning - A Machine Learning Approach To Synthetic Speech Generation
6 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Context Analysis for Speech Recognition
No ratings yet
Context Analysis for Speech Recognition
218 pages
Keywords
No ratings yet
Keywords
4 pages
Review of Text-to-Speech Technologies
No ratings yet
Review of Text-to-Speech Technologies
4 pages
Mohitmajor
No ratings yet
Mohitmajor
10 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Real-Time Voice Cloning with Deep Learning
No ratings yet
Real-Time Voice Cloning with Deep Learning
6 pages
Automatic Speech Recognition with DNN
No ratings yet
Automatic Speech Recognition with DNN
26 pages
Text-to-Audio Conversion with OpenVoice
No ratings yet
Text-to-Audio Conversion with OpenVoice
48 pages
VALL-E: Zero-Shot TTS with Language Models
No ratings yet
VALL-E: Zero-Shot TTS with Language Models
16 pages
Vocal Enhancement for Cochlear Implants
No ratings yet
Vocal Enhancement for Cochlear Implants
73 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Multilingual TTS via Voice Conversion
No ratings yet
Multilingual TTS via Voice Conversion
5 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Neural Voice Cloning With A Few Samples
No ratings yet
Neural Voice Cloning With A Few Samples
18 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Amharic Speech Recognition System
No ratings yet
Amharic Speech Recognition System
67 pages
AI Speech Synthesis for Indian Accents
No ratings yet
AI Speech Synthesis for Indian Accents
24 pages
The Future of
No ratings yet
The Future of
25 pages
Neural Voice Cloning from Few Samples
No ratings yet
Neural Voice Cloning from Few Samples
11 pages
U 4
No ratings yet
U 4
8 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
Unit 3 NMU
No ratings yet
Unit 3 NMU
4 pages
Deep Neural Networks for Speech Synthesis
No ratings yet
Deep Neural Networks for Speech Synthesis
26 pages
AI Based Voice Cloning System: From Text To Speech
No ratings yet
AI Based Voice Cloning System: From Text To Speech
9 pages
FastPitch TTS Model
No ratings yet
FastPitch TTS Model
5 pages
Real Time Voice Translator
No ratings yet
Real Time Voice Translator
28 pages
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
No ratings yet
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
58 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
Voice Conversion Tech for Engineers
No ratings yet
Voice Conversion Tech for Engineers
48 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Henmnath
No ratings yet
Henmnath
4 pages
Zero Shot Voice Cloning Guide
No ratings yet
Zero Shot Voice Cloning Guide
2 pages
Thesis-Speech Recognition Markov
No ratings yet
Thesis-Speech Recognition Markov
65 pages
Topic ApprovalBEA13
No ratings yet
Topic ApprovalBEA13
6 pages
《元宇宙导论与实践》report
No ratings yet
《元宇宙导论与实践》report
31 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Disaster Bce4.2
No ratings yet
Disaster Bce4.2
16 pages
08 Tist V5N1
No ratings yet
08 Tist V5N1
14 pages
EACE Vol15 June2024 Paper1 EhitabezahuN
No ratings yet
EACE Vol15 June2024 Paper1 EhitabezahuN
16 pages
Plant Propagation
100% (1)
Plant Propagation
10 pages
Signal Numerical
No ratings yet
Signal Numerical
7 pages
Stair Design: Types and Requirements
No ratings yet
Stair Design: Types and Requirements
46 pages
Ch03 01
No ratings yet
Ch03 01
35 pages
CH-8 (Limit State of Serviceability)
No ratings yet
CH-8 (Limit State of Serviceability)
15 pages
Subrat Mani Dixit1, HariBahadurDura2, HemantaDulal
No ratings yet
Subrat Mani Dixit1, HariBahadurDura2, HemantaDulal
10 pages
Chapter 10 - Causes and Prevention of Cracks
No ratings yet
Chapter 10 - Causes and Prevention of Cracks
48 pages
Bar Bending Schedule of 2 Way Slab
No ratings yet
Bar Bending Schedule of 2 Way Slab
2 pages
Chapter 3 PPT Slides - ccdf03c0 1ac2 4129 92c4 C63b1f78e233
No ratings yet
Chapter 3 PPT Slides - ccdf03c0 1ac2 4129 92c4 C63b1f78e233
114 pages
Building Functional Requirements Overview
No ratings yet
Building Functional Requirements Overview
50 pages
Applsci 13 07064
No ratings yet
Applsci 13 07064
15 pages
4.0 Roof Complete
No ratings yet
4.0 Roof Complete
68 pages
Cladding and Partitioning Techniques
No ratings yet
Cladding and Partitioning Techniques
83 pages
Building Acoustics and Noise Control
No ratings yet
Building Acoustics and Noise Control
13 pages
Significance of Hill Roads in Nepal
No ratings yet
Significance of Hill Roads in Nepal
82 pages
Chap 3 Gep VERY IM (P
100% (1)
Chap 3 Gep VERY IM (P
11 pages
Minimum Gradient for Road Drainage
No ratings yet
Minimum Gradient for Road Drainage
29 pages
CH 2 Foundation
No ratings yet
CH 2 Foundation
36 pages
Document Scanned with CamScanner
No ratings yet
Document Scanned with CamScanner
5 pages
Building Ventilation and AC Requirements
No ratings yet
Building Ventilation and AC Requirements
22 pages
Highway Drainage Systems Explained
No ratings yet
Highway Drainage Systems Explained
20 pages
Highway Geometric Design Principles
No ratings yet
Highway Geometric Design Principles
43 pages
Relationship Between Gradation Density of Biofilm Bacteria With Tonsillar Hypertrophy On Patients 7912
No ratings yet
Relationship Between Gradation Density of Biofilm Bacteria With Tonsillar Hypertrophy On Patients 7912
5 pages
Formula Sheet Chem 205
No ratings yet
Formula Sheet Chem 205
2 pages
Esab Weld 71T - 1 PDF
No ratings yet
Esab Weld 71T - 1 PDF
1 page
Mami 2023
No ratings yet
Mami 2023
294 pages
Orissa Finance Circulars 2000-01
No ratings yet
Orissa Finance Circulars 2000-01
206 pages
PEDH 2122 WEEK 1 10 Keenplify
No ratings yet
PEDH 2122 WEEK 1 10 Keenplify
7 pages
Delhi Chapter Membership
No ratings yet
Delhi Chapter Membership
1 page
Loyola High School, Patna: Syllabus For Class Vii - 2022-2023 Social Science Civics History Geography
No ratings yet
Loyola High School, Patna: Syllabus For Class Vii - 2022-2023 Social Science Civics History Geography
3 pages
Structural Engineer with Metro Design Expertise
No ratings yet
Structural Engineer with Metro Design Expertise
2 pages
Presentation 2
No ratings yet
Presentation 2
14 pages
Wave Motion: E Section 14.1 Waves and Their Properties
No ratings yet
Wave Motion: E Section 14.1 Waves and Their Properties
27 pages
Porcelain Tiles Catalogue RE
No ratings yet
Porcelain Tiles Catalogue RE
22 pages
Managing Disputes for Cash Flow Success
No ratings yet
Managing Disputes for Cash Flow Success
8 pages
Ap11 FRQ English Language Formb
No ratings yet
Ap11 FRQ English Language Formb
7 pages
Hdpe Tank Size - Prices
No ratings yet
Hdpe Tank Size - Prices
1 page
Story Splitting Cheat Sheet
No ratings yet
Story Splitting Cheat Sheet
1 page
Harry Potter - Chapter 3 Worksheet
No ratings yet
Harry Potter - Chapter 3 Worksheet
5 pages
Book1 For Creativity PDF
No ratings yet
Book1 For Creativity PDF
93 pages
Motocross Pitch-Heave Model Analysis
No ratings yet
Motocross Pitch-Heave Model Analysis
3 pages
Cohesity
100% (1)
Cohesity
17 pages
Winstrol and Oral Oxandrolone Cycle
No ratings yet
Winstrol and Oral Oxandrolone Cycle
9 pages
Understanding Ocean and Rainforest Habitats
No ratings yet
Understanding Ocean and Rainforest Habitats
5 pages
Turn-to-Turn Fault Protection for Transformers
No ratings yet
Turn-to-Turn Fault Protection for Transformers
12 pages
NARI PCS-985G Generator Relay
67% (3)
NARI PCS-985G Generator Relay
316 pages
GATE CSE & Placement Prep Guide
No ratings yet
GATE CSE & Placement Prep Guide
26 pages
Reported Speech Exercises
No ratings yet
Reported Speech Exercises
3 pages
SL150 Tape Library Features & Monitoring
No ratings yet
SL150 Tape Library Features & Monitoring
14 pages
DSP500 Manual
No ratings yet
DSP500 Manual
35 pages
U4RTD31 Spring Break Sadako Questions Extra Credit CHMS
No ratings yet
U4RTD31 Spring Break Sadako Questions Extra Credit CHMS
9 pages
Organic Compound Identification Using Infrared Spectros
No ratings yet
Organic Compound Identification Using Infrared Spectros
34 pages