Mid Defence Clone
Mid Defence Clone
Submitted by:
SANDIP SHARMA [KAN078BCT076]
PRAJESH BILASH PANTA [KAN078BCT056]
SUJAN DEULA [KAN078BCT089]
RAJ KARAN SAH [KAN078BCT060]
Submitted to:
Department of Computer and Electronics Engineering
July, 2025
CLONE-THE-TONE: FASTSPEECH 2 AND
ECAPA-TDNN BASED VOICE CLONING SYSTEM
IN NEPALI
Submitted by:
SANDIP SHARMA [KAN078BCT076]
PRAJESH BILASH PANTA [KAN078BCT056]
SUJAN DEULA [KAN078BCT089]
RAJ KARAN SAH [KAN078BCT060]
Submitted to:
Department of Computer and Electronics Engineering
Kantipur Engineering College
Dhapakhel, Lalitpur
July, 2025
ABSTRACT
This work focuses on the task of few-shot multi-speaker, multi-style voice cloning,
where the goal is to generate speech that closely mimics both the voice and speak-
ing style of a target speaker using only a few reference samples. This is particularly
challenging because the model must generalize well from very limited data. To tackle
this, the study explores different ways to represent speaker identity known as speaker
embeddings and proposes a method that combines two types: pre-trained embed-
dings (learned from another task, such as voice conversion) and learnable embeddings
(learned directly during model training). Among the different types of embeddings
tested, those pre-trained using voice conversion techniques proved to be the most effec-
tive in capturing speaker characteristics. The researchers then integrate these embed-
dings into the FastSpeech 2 model, a fast and high-quality text-to-speech model. The
combination of pre-trained and learnable embeddings significantly improves the models
ability to generalize to new, unseen speakers even with only one or a few examples. As
a result, this approach achieved second place in the one-shot track of the ICASSP 2021
M2VoC Challenge, demonstrating its strong performance in few-shot voice cloning.
i
TABLE OF CONTENTS
Abstract i
List of Figures iv
List of Tables v
List of Abbreviations vi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Application and Scope of project . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Project Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.2 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . 5
1.6.4 Schedule Feasibility . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 8
2.1 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Theoretical Foundation 12
3.1 Text-to-Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 FastSpeech 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Variance Adaptor . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Duration/pitch/energy Predictor . . . . . . . . . . . . . . . . . 15
3.1.4 ECAPA-TDNN . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.5 Speaker Embedding . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.6 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methodology 20
4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ii
4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Audio Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Software Development Life Cycle . . . . . . . . . . . . . . . . . . . . 34
5 EPILOGUE 35
5.1 Work Completed and Work Remaining . . . . . . . . . . . . . . . . . . 35
5.1.1 Work Completed . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Work Remaining . . . . . . . . . . . . . . . . . . . . . . . . . 36
References 37
iii
LIST OF FIGURES
iv
LIST OF TABLES
v
LIST OF ABBREVIATIONS
LN Layer Normalization
TTS Text-to-Speech
vi
CHAPTER 1
INTRODUCTION
1.1 Background
Neural network-based text to speech TTS has made rapid progress and attracted a lot
of attention in the machine learning and speech community in recent years. TTS has
been proven to be capable of generating high-quality and human-like speech [1]. Non-
autoregressive text to speech (TTS) models such as FastSpeech (Ren et al.,2019) can
synthesize speech significantly faster than previous autoregressive models with compa-
rable quality. Using the concept of FastSpeech and Speaker Embedding to programmed
the project of voice cloning. Voice cloning is a rapidly emerging field in artificial intel-
ligence that focuses on the replication of a human voice using computational models.
It enables the creation of synthetic speech that closely mimics the tone, pitch, accent,
speaking style, and other vocal characteristics of a specific individual. By analyzing
a relatively small amount of recorded speech data, voice cloning systems can generate
new speech that sounds as if it were spoken by the original speaker, even if the words
or sentences were never actually uttered by them. This has opened the door to a wide
range of innovative applications, including personalized virtual assistants, dubbing for
film and media, accessibility tools for individuals with speech impairments, and con-
versational AI agents.
The voice cloning process can be broadly classified into two types: speaker-adaptive
1
voice cloning, which fine-tunes a base model using data from the target speaker, and
speaker-agnostic or zero-shot voice cloning, which uses pretrained models and speaker
embeddings to mimic voices without additional training. The latter is particularly at-
tractive for scalable applications due to its flexibility and efficiency.
However, while voice cloning holds enormous potential, it also brings ethical chal-
lenges, such as the risk of misuse for impersonation or generating deepfakes. Therefore,
any research or deployment of voice cloning technology must also consider responsible
AI practices, including consent, transparency, and security.
1.3 Objectives
2
1.4 Application and Scope of project
1.4.1 Application
1. Customer Services: voice cloning can produce virtual agents that are more natu-
ral and human-like in the way they sound. Businesses can clone a real individual’s
voice, instead of robotic or generic voices, to make customer support systems
more friendly and interactive. This helps the customers feel more comfortable
and connected while interacting.
2. Healthcare: Voice cloning aids those patients who lose their voice. They can
communicate with a copy of their own voice via speech machines, which is more
reassuring and natural. Doctors can also leave patients with clear, warm voice
messages, which allow for effective communication and make it more reassuring.
3. Education and Training: Voice cloning can be used to create virtual teachers or
assistants with voices similar to real educators. This makes online learning more
engaging and convenient. It can also be used to assist students in learning correct
pronunciation during language studies using clear and natural-sounding voices.
Cloned voices can also be used by training videos or courses to educate topics in
a friendly and consistent format.
1.4.2 Scope
3
1.5 Project Features
1. Few-Shot voice cloning: Speakers or users can clone their voices by providing
just a few minutes of samples.
2. Multi-Speaker Support: The model can handle multi-speaker voice cloning,
where in multiple users can add their voice by giving a few minutes of sample.
Then the model can generate in any of the given voices without getting trained
altogether separately for each user.
3. Custom Voice Cloning Interface: The system has a voice cloning interface per-
sonalized to an individuals voice where they can record or upload a short voice
sample of themselves. They can then type out any Nepali text, and the system
will read it out in their own voice clone. The interface is clean and user-friendly,
and it also exhibits real-time voice personalization.
The economic feasibility of this project is favorable as the models utilized pre-trained
models such as ECAPA-TDNN for extracting discriminative speaker embedding from
raw speech and Vocoder model for converting mel spectogram to audio waveform which
are publicly available and does not require licensing fees. The overall cost-effectiveness
depends on the scale of fine-tuning required, the availability of existing computational
resources and the efficiency of the training process.
4
1.6.2 Technical Feasibility
The project is technically highly feasible, as it can be developed on both Windows and
macOS using Python, a widely supported and versatile programming language. Python
offers a rich ecosystem of libraries like TensorFlow and PyTorch, which streamline
the development, training, and evaluation of deep learning models. The availability of
pre-trained models, such as ECAPA-TDNN for extracting discriminative speaker em-
bedding from raw speech and Vocoder model for converting mel spectogram to audio
waveform, allows for efficient integration by leveraging CNN-derived features to train
the Transformer model from scratch. Additionally, cloud-based platforms and GPU ac-
celeration tools, such as Google Colab (with limited free usage) and CUDA for NVIDIA
GPUs, significantly enhance computational performance. Given these readily available
technologies, the project can be implemented effectively without major compatibility
or infrastructure issues.
The operational feasibility of this project ensures that it can be effectively integrated
and used within the current environment. This system is designed with a user-friendly
interface, making it accessible even to users with minimal technical skills. Implementa-
tion and maintenance will consider both hardware and software requirements to ensure
smooth operation without excessive resource consumption.
5
1.6.4 Schedule Feasibility
The Gantt chart presented in Fig 1.1 represents our projects timeline with various tasks
plotted against dates. The x-axis shows the timeline from June 2025 to february 2026,
while the y-axis list the tasks involved in this project. Documentation and Report
spanned the longest period, running throughout almost the entire timeline. Planning
Familiarization was shorter task occurring in the earlier phase. Design started after
that, followed by Development. Coding happened after data processing, while Testing
and Debugging occured toward the later stages. Documentation and Reporting extended
throughout the timeline, ensuring that project outcomes are properly recorded and doc-
umented properly.
6
1.7 System Requirement
7
CHAPTER 2
LITERATURE REVIEW
With the development of deep learning, speech synthesis has made significant progress
in recent years. While recently proposed end-to-end speech synthesis systems, e.g.,
Tacotron, DurIAN and FastSpeech, are able to generate high-quality and natural sound-
ing speech, these models usually rely on a large amount of training data from a single
speaker. The speech quality, speaker similarity, expressiveness and robustness of syn-
thetic speech are still not systematically examined for different speakers and various
speaking styles, especially in real-world low-resourced conditions, e.g., each speaker
only has a few samples at hand for cloning. However, this so-called multi-speaker
multi-style voice cloning task has found significant applications on customized TTS
[2]. Imitating speaking style is one of the desired abilities of a TTS system. Several
strategies have been recently proposed to model stylistic or expressive speech for end-
to-end TTS. Speaking style comes with different patterns in prosody, such as rhythm,
pause, intonation, and stress, etc. Hence direct modeling prosodic aspects of speech is
beneficial for stylization. ing prosodic aspects of speech is beneficial for stylization.
In the paper [2] Variational Autoencoder VAE and GST are two typical models built
upon sequence-to-sequence models for style modeling. Global Style Tokens (GSTs)
is introduced for modeling style in an unsupervised way, using multi-head attention to
learn a similarity measure between the reference embedding and each token in a bank of
randomly initialized embeddings, and the Text-Predicted Global Style Token (TP-GST)
learns to predict stylistic renderings from text alone, requiring neither explicit labels
during training nor auxiliary inputs for inference. Note that these studies modeling
speaker styles are mostly based on a large amount of data.
In the Paper [3], the previous neural TTS models first generate mel-spectrograms au-
toregressively from text and then synthesize speech from the generated mel-spectrograms
using a separately trained vocoder. They usually suffer from slow inference speed and
robustness (word skipping and repeating) issues are designed to address these issues,
8
which generate mel-spectrograms with extremely fast speed and avoid robustness is-
sues, while achieving comparable voice quality with previous autoregressive models.
Among those non-autoregressive TTS methods, FastSpeech is one of the most suc-
cessful models. FastSpeech designs two ways to alleviate the one-to-many mapping
problem by Reducing data variance in the target side by using the generated mel-
spectrogram from an autoregressive teacher model as the training target (i.e., knowl-
edge distillation) and, introducing the duration information (extracted from the atten-
tion map of the teacher model) to expand the text sequence to match the length of the
mel-spectrogram sequence. While these designs in FastSpeech ease the learning of the
one-to-many mapping problem in TTS.
9
to control the style of synthesized speech, global style token (GST) is widely used
to enable utterance-level style transfer. Some also proposed to use an auxiliary style
classification task to disentangle style information from phonetic information in the ut-
terances. Since speaker and style information is usually entangled in the training data,
it is also possible to learn a latent representation to jointly model the speaker and style
information. In this work, [1] has apply pretrained and jointly-optimized speaker rep-
resentations to multi-speaker TTS models. Two different TTS frameworks, Tacotron 2
and FastSpeech, are studied. It is shown that with the jointly-optimized speaker repre-
sentations only, the TTS models do not generalize well on the few-shot speakers. We
also demonstrate that using different pretraining tasks results in significant performance
differences. By combining both the pretrained and the learnable speaker representa-
tions, our experiments show that the audio quality and the speaker similarity of the
synthesized speech improve significantly. The synthesized samples are available online
results with the FastSpeech 2 TTS framework achieved 2nd place in the one-shot track
of the ICASSP 2021 M2VoC challenge.
Neural text-to-speech models are capable of synthesizing natural human voice after
being trained on several hours of high-quality single-speaker or multi-speaker record-
ings. However, to adapt new speaker voices, these TTS models are finetuned using a
large amount of speech data, which makes scaling TTS models to a large number of
speakers very expensive. Fine-tuning of TTS models to new speakers may be challeng-
ing for number of reasons. First,the original TTS model should be pre-trained with a
large multi-speaker corpus to make models to generalize well to new voices and dif-
ferent recording conditions. Second, fine-tuning whole TTS model is very parameter
inefficient, since one will need a new set of weights for every newly adapted speakers.
Currently, there are two approaches to make adaptation of TTS more efficient. The first
approach is to modify only parameters directly related to speaker identity. The other al-
ternative approach is based on using a light voice conversion postprocessing module to
baseline TTS model. The third challenge is to reduce amount of speech required to add
new speaker to existing TTS model. In this paper [4], propose a new parameter-efficient
method for tuning existing multi-speaker TTS for new speakers. First, we pre-train a
base multi-speaker TTS model on a large and diverse TTS dataset. To extend model for
10
new speakers, we add a few adapters small modules to the base model. We used vanilla
adapter, unified adapters, or BitFit. Then, freeze the pre-trained model and fine-tune
only adapters on new speaker data.
In recent years, x-vectors and their subsequent improvements have consistently pro-
vided state-of-the-art results on the task of speaker verification[5]. Improving upon the
original Time Delay Neural Network TDNN architecture is an active area of research.
The rising popularity of the x-vector system has resulted in significant architectural im-
provements and optimized training procedures over the original approach. The topology
of the system was improved by incorporating elements of the popular ResNet architec-
ture. Adding residual connections between the frame-level layers has been shown to en-
hance the embeddings. Additionally, residual connections enable the back-propagation
algorithm to converge faster and help avoid the vanishing gradient problem.
11
CHAPTER 3
THEORETICAL FOUNDATION
3.1 Text-to-Speech
(TTS) is a technology that converts written text into spoken words using computer sys-
tems. The concept of TTS dates back to the late 1950s and early 1960s, with one of
the earliest practical systems developed by Bell Labs in 1961. In fact, Bell Labs’ re-
searchers John Larry Kelly Jr. and Louis Gerstman created a system that made a com-
puter ”sing” Daisy Bell, which later even inspired Arthur C. Clarkes 2001: A Space
Odyssey. Text-to-Speech (TTS) is the core process that converts written Nepali text
into spoken audio, mimicking the voice of a specific speaker. The goal is not just to
generate speech, but to make it sound like it was spoken by a real person the target
speaker. To do this, the TTS system first takes the input Nepali text and converts it
into a more suitable form, such as phonemes or characters. Then, using a deep learning
model like FastSpeech 2, the system generates a sequence of mel spectrograms, which
are visual representations of how the audio should sound over time. These spectro-
grams capture information like pitch, tone, and duration of each sound. What makes
FastSpeech 2 powerful is that it is a non-autoregressive model, meaning it generates the
entire speech sequence all at once, making it faster and more stable than older models.
Additionally, to achieve voice cloning, a speaker embedding is added to the model a
numerical representation of a person’s voice style. Finally, the generated spectrogram
is passed through a vocoder (such as HiFi-GAN or WaveGlow), which converts it into
a natural-sounding waveform. The result is high-quality Nepali speech that sounds like
it was spoken by the target person, even though it was generated entirely from text.
12
3.1.1 FastSpeech 2
13
3.1.2 Variance Adaptor
14
3.1.3 Duration/pitch/energy Predictor
It provides a detailed view of the internal architecture of the predictor component used
within or alongside the Variance Adaptor. This structure is a deep neural network with
a multi-layered design optimized for feature extraction and prediction. The process
begins with a Linear Layer, which transforms the input data into a suitable format for
subsequent layers. This is followed by a combination of Layer Normalization LN and
Dropout, where LN stabilizes the training process by normalizing the inputs across the
feature dimension, and Dropout randomly deactivates a subset of neurons during train-
ing to prevent overfitting. The data then passes through a Convolutional 1D (Conv1D)
layer paired with a Rectified Linear Unit ReLU activation function, which applies con-
volutional filters to extract spatial hierarchies and introduces non-linearity to model
complex patterns. This LN-Dropout-Conv1D-ReLU sequence is repeated, allowing the
network to iteratively refine the features. The architecture concludes with another Lin-
ear Layer, which produces the final predictions for duration, pitch, and energy. This
layered approach leverages the strengths of convolutional operations for local feature
detection, normalization for training stability, and dropout for regularization, resulting
in a robust and generalized model capable of accurately predicting the target audio pa-
15
rameters.
3.1.4 ECAPA-TDNN
16
The squeeze-excitation block further refines the feature maps by modeling interdepen-
dencies between channels, boosting the model’s discriminative power.
5. Statistics Pooling: After processing the audio through the TDNN layers, ECAPA-
TDNN applies statistics pooling (e.g., mean and standard deviation) over the tempo-
ral dimension to create a fixed-length embedding. This embedding encapsulates the
speaker’s voice characteristics, making it suitable for comparison or synthesis tasks.
Feature Extraction: The TDNN layers with time delays extract temporal features,
while the ECA mechanism highlights relevant channels.
Output: The embedding can be used for speaker verification (comparing with other
embeddings) or as a feature input for downstream tasks like voice cloning.
17
datasets containing thousands of speakers. During training, the model learns to focus on
speaker-dependent features and ignore variable factors like the language or background
noise. The well-known models used for generating these embeddings is -TDNN in our
system, progressively improving in terms of accuracy and robustness.
3.1.6 Vocoder
18
A vocoder is a critical module in the speech synthesis pipeline that converts interme-
diate acoustic representations, specifically Mel-spectrograms, into time-domain audio
waveforms. In this work, our project is based on HiFi-GAN, a state-of-the-art neural
vocoder based on generative adversarial networks (GANs).
HiFi-GAN learns to generate realistic speech waveforms by modeling the complex re-
lationship between Mel-spectrograms and raw audio signals. Compared to traditional
vocoders such as Griffin-Lim, HiFi-GAN achieves significantly higher audio quality
and faster inference speeds, making it suitable for real-time applications.
The vocoder is trained to minimize distortion and preserve naturalness, resulting in out-
put speech that is clear, intelligible, and natural-sounding. Using HiFi-GAN allows the
overall TTS system to efficiently produce high-fidelity audio, completing the transfor-
mation from text input to natural speech output.
19
CHAPTER 4
METHODOLOGY
This block diagram explains a block diagram representing a text-to-speech (TTS) syn-
thesis system based on a neural network architecture combining acoustic feature gener-
ation and vocoding to produce human-like speech from text input.
1. Input Stages
• Phoneme: The process begins with the input text being converted into a
20
sequence of phonemes, which are the smallest units of sound in a language.
This step is crucial for representing the linguistic content that will be syn-
thesized into speech.
• Log-Mel Spectrogram: On the acoustic side, the system starts with a log-
Mel spectrogram, which is a visual representation of the spectrum of fre-
quencies in a sound signal as it varies with time, transformed into the Mel
scale to mimic human auditory perception.
2. Feature Extraction and Embedding
• Phoneme Embedding: The phoneme sequence is passed through a phoneme
embedding layer, which converts the discrete phonemes into dense vector
representations that can be processed by neural networks. This helps cap-
ture the semantic and contextual relationships between phonemes.
• ECAPA-TDNN: The log-Mel spectrogram is processed by the ECAPA-
TDNN (Emphasized Channel Attention, Propagation, and Aggregation -
Time Delay Neural Network), a model typically used for speaker verifica-
tion. Here, it likely extracts speaker-specific features or enhances the spec-
trogram’s quality for further processing.
3. Speaker Embedding
• Speaker Embedding: This component generates a fixed-length vector that
encodes the characteristics of a specific speaker’s voice. This embedding
is crucial for multi-speaker TTS systems, allowing the model to synthesize
speech in different voices based on the input speaker identity.
4. Encoding and Transformation
• Transformer Encoder: The phoneme embeddings are fed into a Trans-
former Encoder, a type of neural network architecture known for its effec-
tiveness in sequence modeling tasks. It processes the phoneme sequence to
create a rich contextual representation.
• Linear Layer (after Speaker Embedding): The speaker embedding is
passed through a linear layer to transform it into a format compatible with
the subsequent stages.
• Variance Adapter: This component adjusts the variance (e.g., duration,
pitch, or energy) of the encoded phoneme sequence. It helps control the
21
prosody (rhythm and intonation) of the synthesized speech, making it more
natural.
5. Decoder and Reconstruction
• Transformer Decoder: The Transformer Decoder takes the encoded phoneme
representations and the variance-adapted features, along with the speaker
embedding, to generate a sequence of log-Mel spectrograms. This step
aligns the linguistic content with the acoustic features.
• Linear Layer (after Transformer Decoder): Another linear layer refines
the output of the Transformer Decoder, ensuring the generated log-Mel
spectrogram is in the correct format.
6. Spectrogram to Audio
• Log-Mel Spectrogram (Generated): The output of the Transformer De-
coder is a predicted log-Mel spectrogram, which represents the target acous-
tic features.
• Vocoder: The vocoder converts the log-Mel spectrogram back into a time-
domain audio waveform. This step involves sophisticated signal processing
to reconstruct the raw audio signal, leveraging techniques like WaveNet or
HiFi-GAN to produce high-quality speech.
7. Final Output
• Audio: The final output is the synthesized audio waveform, which is a
human-like speech signal corresponding to the input text, rendered in the
voice defined by the speaker embedding.
Making the TTS model from scratch for Nepali dataset . The good amount of Nepali
dataset is collected from openslr.org specifically SLR143 and SLR54.
22
4.3 Data Preprocessing
The collected data contains ’.tsv’ file and audio file. The tsv file contains information
of the data separated by tab character. Mainly it contains speaker ID, text of audio file
and relevant audio file name or path.
For audio preprocessing, the volume of the audio is normalized (not too high or low)
and the background noise should be cut off. The audio sampling rate is resampled to
16 kHz as this sampling rate is more than enough to capture the natural timbre, prosody
and clarity of human voice. The channel of audio format is also set to mono. Finally
the audio is converted to mel spectrogram.
Volume Normalization
The volume of the audio is normalized using the RMS formula. The process is
described step by step as follows:
• Root Mean Square: First, calculate the RMS of the audio signal.
v
u
u1 X N
RMS = t x2
N i=1 i
where:
– xi is the i-th sample of the audio data,
– N is the total number of samples in the audio.
• Current Decibel: The current loudness in decibels (dB) is calculated from the
23
RMS value:
currentdB = 20 · log10 (RMS + ε)
gaindB
gain = 10 20
• Apply Gain: Finally, the audio data is scaled by the gain factor to normalize
its volume:
x0i = xi · gain
Silence Trimming:
v
u
u1 X N
dbFrames = 20 · log10 t x2 + ε
N i=1 i
Where:
24
keepFrames = {i : dbFramesi > threshold}
Each frame of the audio signal is multiplied by the ’Hamming window’ to reduce spec-
tral leakage. The Hamming window is defined as:
2πn
w[n] = 0.54 − 0.46 cos for n = 0, 1, . . . , N − 1
N −1
Where:
This window is applied to each frame of the audio signal to smooth the edges and reduce
leakage.
The audio signal is split into overlapping frames. For each frame, the windowed signal
is calculated. The index of the i-th frame is computed as:
Where:
• x[(i − 1) · hopLen + n] is the n-th sample of the i-th frame in the original signal,
25
• w[n] is the Hamming window applied to the frame,
• hopLen is the hop length (or step size) between frames.
This process divides the signal into smaller, overlapping segments (frames), which are
then processed individually.
For each windowed frame, the FFT (Fast Fourier Transform) is computed. The Discrete
Fourier Transform (DFT) formula is as follows:
N −1
2πkn
X
X(k) = x(n) · e−j N for k = 0, 1, . . . , N − 1
n=0
Where:
• X(k) is the FFT of the k-th frequency bin of the i-th frame,
• x(n) is the n-th sample of the windowed frame,
• N is the number of samples in the frame (‘fftLen‘),
• j is the imaginary unit.
This operation converts the time-domain signal (frame) into the frequency domain. The
resulting FFT is a complex number representing both magnitude and phase for each
frequency bin.
The power spectrum is the square of the magnitude of the FFT for each frame:
Where:
26
• Pi (ω) is the power spectrum of the i-th frame at frequency ω,
• Xi (ω) is the complex FFT result of the i-th frame.
The power spectrum reflects how much power (energy) is present at each frequency bin
for each frame.
Since the FFT result for real-valued signals is symmetric, we keep only the positive
frequencies, which are up to the Nyquist frequency. The final power spectrum is given
by:
Where:
• Ppositive is the final power spectrum after keeping only the positive frequencies.
This step reduces redundancy and focuses on the relevant part of the spectrum.
The Mel filter bank is commonly used in speech and audio processing to convert fre-
quency bins from a linear scale to the Mel scale. The Mel scale approximates the human
ear’s perception of pitch, which is more sensitive to lower frequencies and less sensitive
to higher frequencies.
The Mel scale is a perceptual scale of pitches that approximates human hearing. The
formula to convert a frequency f in Hz to the Mel scale is:
27
f
Mel(f ) = 2595 · log10 1 +
700
Where:
For the given signal, we start by converting the minimum and maximum frequencies
into the Mel scale.
fs
fmin = 0, fmax =
2
Where:
The corresponding Mel values for the minimum and maximum frequencies are:
fmin
Melmin = 2595 · log10 (1 + )
700
fmax
Melmax = 2595 · log10 (1 + )
700
We then generate numMel + 2 equally spaced points on the Mel scale between Melmin
and Melmax .
28
3. Convert Mel Points Back to Hz
Next, we convert these Mel points back to the linear frequency scale (Hz) using the
inverse Mel formula:
Mel(f )
f = 700 · 10 2595 − 1
Where:
The resulting hzPoints correspond to the frequency locations of the Mel bins.
We now convert the Mel frequency points into corresponding FFT bin indices. The FFT
bin index for a frequency f is given by:
(fftLen + 1) · f
bin =
fs
Where:
This formula converts each Mel frequency point into a corresponding FFT bin index.
For each Mel bin, we create a triangular filter. Each filter linearly increases from the
previous bin’s index to the center of the Mel bin, and then linearly decreases to the next
29
bin’s index. For the m-th Mel bin, the filter is defined as:
k−fm−1
for fm−1 ≤ k < fm
fm −fm−1
filter[k] =
fm+1 −k
for fm ≤ k < fm+1
fm+1 −fm
Where:
The process of calculating the Log Mel Spectrogram in your code involves three steps:
computing the Mel spectrogram, applying a logarithmic transformation, and normaliz-
ing the result.
The Mel spectrogram is computed by applying a Mel filter bank to the magnitude spec-
trogram of the signal. The Mel spectrogram at time frame t and Mel bin m is given
by:
N
X −1
MelSpec(t, m) = |FFT(x(t))|f · Hm (f )
f =0
Where:
• MelSpec(t, m) is the Mel spectrogram value at time frame t and Mel bin m,
• |FFT(x(t))|f is the magnitude of the FFT of the signal x(t) at frequency bin f ,
• Hm (f ) is the Mel filter bank at frequency bin f and Mel bin m,
30
• N is the number of FFT bins.
2. Logarithmic Transformation
logMel(t, m) = log(MelSpec(t, m) + )
Where:
• logMel(t, m) is the log Mel spectrogram value at time frame t and Mel bin m,
• MelSpec(t, m) is the Mel spectrogram value at time frame t and Mel bin m,
• = 1 × 10−10 is a small constant added to avoid computing log(0), which is
undefined.
3. Normalization
Finally, the log Mel spectrogram is normalized to have zero mean and unit variance
across the entire utterance. The formula for normalization is:
logMel(t, m) − µ
normLogMel(t, m) =
σ
Where:
31
4.3.2 Text Preprocessing
1. Text Normalization This step standardizes the input text by converting all characters
to lowercase, removing punctuation marks, and eliminating special characters to ensure
consistency for phonetic processing.
32
4.4 Overview of the System
Our system allows users to clone a voice in the Nepali language. To do this, the user
must upload a reference Nepali audio sample either by recording it using a microphone
or by uploading a pre-recorded audio file. After that, the user provides a text in Nepali,
and the system will generate speech in the cloned Nepali voice for the given text.
33
4.5 Software Development Life Cycle
This project is being developed using incremental methodology since it offers a func-
tioning prototype at an early stage of development. As we know in the incremental
model, after getting feedback from users we can update our system in future. This fea-
ture is very beneficial for our system because our system is based on taking feedback
from user. So, we can update our system which can do other operations. Also, it can be
transformed to a system which can recognize characters.
34
CHAPTER 5
EPILOGUE
Audio pre-processing :
The conversion of audio signal to log-Mel spectrogram is achieved. The vocoder model
is ready.
35
5.1.2 Work Remaining
The phoneme conversion for text is not written yet. The data pipeline for processing au-
dio and text inputs needs to be implemented. Additionally, models for speaker embed-
ding and log-Mel spectrogram generation must be developed. Once model training and
evaluation are complete, a proper user interface should be designed and implemented.
36
REFERENCES
[1] C.-M. Chien, J.-H. Lin, C.-y. Huang, P.-c. Hsu, and H.-y. Lee, “Investigating on
incorporating pretrained and learnable speaker representations for multi-speaker
multi-style text-to-speech,” in ICASSP 2021-2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8588–
8592.
[2] Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong et al.,
“The multi-speaker multi-style voice cloning challenge 2021,” in ICASSP 2021-
2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2021, pp. 8613–8617.
[3] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2:
Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558,
2020.
[6] E. Zhang, Y. Wu, and Z. Tang, “Sc-ecapatdnn: Ecapa-tdnn with separable convolu-
tional for speaker recognition,” in International Conference on Intelligence Science.
Springer, 2024, pp. 286–297.
37