Professional Documents
Culture Documents
Tribhuvan University
Institute of Science and Technology
A PROJECT REPORT
ON
Submitted to
Institute of Science and Information Technology
Tribhuvan University
April 2022
SUPERVISOR’S RECOMMENDATION
I hereby recommend that this report has been prepared under my supervision by Prajin
Khadka [15150/074], Ankit Raya [15131/074], Krishna Rijal [15144/074] entitled
“Nepali Automatic Speech Recognition” in partial fulfillment of this requirement for
the degree of B.Sc. In Computer Science and Information Technology (B. Sc. CSIT) be
processed for evaluation.
…………………………
Asst. Prof Surya Bam
Supervisor
Department of Computer Science and Information Technology
Bhaktapur Multiple Campus
Dudhpati, Bhaktapur
LETTER OF APPROVAL
This is to certify that this project Prajin Khadka [15150/074], Ankit Raya
[15131/074], Krishna Rijal [15144/074] entitled “Nepali Automatic Speech
Recognition” in partial fulfillment of the requirement for the degree of Bachelor of
Science in Computer Science and Information Technology (B.Sc. CSIT) has been well
studied. In our opinion, it is satisfactory in the scope and quality for the required degree.
…………………………………… ……………..……………………..
Asst. Prof Surya Bam Mr. Sushant Poudel
Bhaktapur Multiple Campus Bhaktapur Multiple Campus
……………..…………………….. ……………..………………….
Arjun Singh Saud
Bhaktapur Multiple Campus IOST, Tribhuvan University
ACKNOWLEDGEMENT
The successful completion of this project work would not have been possible without
the support and assistance of many individuals and organizations. We feel immensely
blessed to have gotten this during our study. We want to take this opportunity to offer
earnest admiration to every one of them.
First and foremost, we would like to acknowledge our mentor and our project
supervisor Asst. Prof Surya Bam, whose guidance helped us with different views of
development and research in the project. His guidance has been a cornerstone in
developing the ASR module in our project. We were able to accomplish this project
with the help of his helpful directions and suggestions.
We want to express our special thanks to the Department of Computer Science and
Information Technology for providing us with an environment to dig the latest
technology, investigate it, and research those areas as the major project. We are grateful
to Mr. Sushant Poudel, coordinator of the Department of Computer Science and
Information Technology, for presenting valuable suggestions regarding the queries put
forward regarding the project.
Last but not least, we would appreciate all our teachers, seniors, and friends for their
support, irrespective of the situation, and constant inspiration, which helped us
withstand our dreams and make our project come to an end.
i
ABSTRACT
Nepali Automatic Speech Recognition is a system whose main motive is to transcribe
the Nepali audio into text form. Additionally, a web interface system can record the
audio in real-time and transcribe it into the text. The ASR system uses the concept of
the transformer-based neural network, Convolutional Neural Network, and Natural
Language processing. Here we use the concept of self-supervised pre-training and fine-
tuning the pre-trained network with supervised learning.
ii
Table of Content
ACKNOWLEDGEMENT……..…………………………………………………….. i
ABSTRACT………………………………………………………………………..... ii
LIST OF ABBREVIATIONS………………………………………………………... v
LIST OF FIGURES…………………………………………………………………. vi
LIST OF TABLES………………………………………………………………….. vii
CHAPTER 1: INTRODUCTION ............................................................................... 1
1.4.2. Limitation........................................................................................................... 2
3.3. Analysis................................................................................................................... 10
iii
3.3.1. Object modeling using Class and Object Diagrams ........................................ 10
4.2.2. Transformer...................................................................................................... 18
5.2. Testing..................................................................................................................... 29
References ………………………………………………………………………….. 36
Appendices ………………………………………………………………………….. 37
Logs …………………………………………………………………………………. 40
iv
List of Abbreviations
v
List of Figures
Figure 1.5: Agile Methodology ……………………………………………………... 3
Figure 3.1.1.3: Use Case Diagram of ASR Module ……………………………….... 8
Figure 3.2.4: Schedule of project organization ………………...…………………... 10
Figure 3.3.1: Class diagram of ASR Module ……………..……...………………… 10
Figure 3.3.2: Sequence Diagram of ASR Module ……………………..………........ 11
Figure 3.3.3: Activity Diagram of ASR System ………………...………………..... 12
Figure 4.1.1: ASR System Architecture ……………………………………………. 14
Figure 4.1.2: Deployment Diagram of ASR Model.……………………………....... 15
Figure 4.1.3: Component Diagram of ASR Model ………………………………… 15
Figure 4.2.1: Architecture of a CNN ……………………………………………….. 16
Figure 4.2.1.3: Convolution Operation …………………………………………….. 17
Figure 4.2.1.4: Max pooling ………………………………………………………... 18
Figure 4.2.2.1: Architecture of Transformer ………………………………..……… 19
Figure 4.2.2.2. Scaled dot-product attention ……………………………………...... 20
Figure 4.2.2.3: Multi-head attention ………………………………………………... 21
Figure 4.2.3: Feed Forward Neural Network ……...…..…………………………... 22
Figure 5.2.1: Self Supervised pre-training using unlabeled speech data ………........ 24
Figure 5.1.2.2: Feed-Forward Neural Network ……...……………………………... 26
Figure 5.1.2.3.1.3: Feed-forward neural network for fine-tuning ….....................…. 28
vi
List of Tables
Table 5.2.1: Test cases for unit testing ………………………………………….. 30-31
Table 5.2.2: Test cases for system testing ………………………………………….. 32
Table 5.3: Obtained result analysis ……………………………………………... 33-34
vii
Chapter 1: Introduction
1.1. Introduction
Translating the Nepali language from any spoken word into a textual form is relatively
difficult. The literature is very vast and complex. Because of complexity, there is not
enough data to work on as less data is being generated and collected. The scarcity of readily
available data makes the problem more difficult. Another problem that we encounter is the
way of speech. People have various tones and speak in various frequencies, leading to
difficulty in recognition.
1
1.3. Project Objectives
The objective of the project is to translate Nepali audio speech to text. The audio is given
as an input in various formats, i.e., .wav, .mp3, etc.
1.4.1. Scope
The project's primary focus is to build a robust Nepali ASR system capable of transcribing
a given audio segment with low resource consumption and fast inference speed in mobile
and web apps.
The system will be helpful for any company or product that deals with Nepali speech, and
also, it will be helpful, especially for disabled people.
With the addition of a language model for decoding, the prediction can be more
grammatically accurate.
1.4.2. Limitation
Despite proper training and testing, the N-ASR has the following limitations:
• The result of the system will be data-dependent and works only with everyday clear
speech.
• The model is not able to generate transcriptions for numbers and punctuations.
• Since users provide the input audio, the system's accuracy depends on the quality
of the input audio.
In the case of an algorithm, we decided to build a neural network that learns contextualized
speech representations from unlabeled speech data by randomly masking feature vectors
2
obtained from Convolutional Neural Network before passing them to a transformer neural
network during self-supervised pre-training. Here, only the unlabeled speech data is used,
which are pre-processed by removing noise and silence and sampled at 16 KHz.
Moreover, in the second step, linear layer is added on the top of the pre-trained network to
train the model on labeled audio data for automatic speech recognition. The task of the
Linear is a multi-class classification problem to classify into 62 classes, i.e., 62-character
tokens. This approach is mainly helpful for low resource languages such as Nepali as there
are no big labeled audio datasets, so leveraging the power of self-supervised pre-training,
the project aims to build a robust Nepali ASR system.
Due to the iterative and incremental nature of the system, the agile approach best describes
the development methodology. The model is passed through multiple designing, training,
and testing iterations.
This report is separated into various chapters which describe the organization and structure,
which is obtained as follows:
3
The first chapter includes overall details and an introduction to the project. The introduction
contains information about general introductions, objectives, project scope and limitation,
and development methodology.
Chapter 2 of this report consists of the literature review. Here we have summarized the
study of other similar systems. Also, we have mentioned how our system is different from
others and how our system is an improvement from the existing systems.
Results of the system analysis are elaborated in chapter 3. In this section, we have discussed
the requirements for the system, how the system is feasible, and the system model
corresponds to the approach we have used.
The fourth chapters contain all diagrams, including system architecture, component design
and details of algorithms used in the project.
Implementation and testing details are summarized in the fifth chapter. Details also include
tools used for the project and implementation details and tests of the module. Result
analysis is the essential part of this chapter and contains the overall progress of the system.
Final chapter includes the conclusion and further improvements that could be achieved.
4
Chapter 2: Background Study and Literature Review
2.1. Background Study:
ASR has numerous applications in the healthcare system, banking, marketing, home
automation, etc. Researchers have obtained impressive improvements in performance,
especially in the English language. They have created cutting-edge models that outperform
their forerunners.
The journey from simple frequency analysis for speaker recognition to intricate end-to-end
online speech recognition may seem fantastic, but English has always been the primary
attention language. Less commonly spoken languages, such as Nepali, receive little
attention, partially because they are not widely spoken and, more importantly, because there
are insufficient resources for those working in the subject.
There have been a few attempts to break the Nepali languages standard of an ineffective
speech recognition system, but none have yielded satisfactory results. This technology is
available to benefit speakers of all languages, including Nepali. This service is in high
demand among general people, large corporations, and students. So it will only be a matter
of time before we master it. We do not set an unreasonable goal of completely overhauling
Nepal’s current ASR system; instead, we demonstrate our methods by merging cutting-
edge technologies to create a prototype application. We believe that by using our
methodologies, we will be able to help this field progress and perform better.
Automatic speech recognition for the English language has progressed with an abundance
of data and research going on. The accuracy of automatic speech recognition (ASR) has
been significantly boosted since the evolution of deep neural network (DNN) based hybrid
modeling was adopted a decade ago. Both supervised and self-supervised approaches have
made massive progress in this field.
5
The approach uses deep learning techniques like CNN, RNN, and different variants of
RNN. The Hidden Markov Model and Gaussian Mixture Model with hands-on feature
engineering are traditional approaches, but more sophisticated deep learning methods have
surpassed them in performance by quite a significant margin.
End-to-End (E2E) systems have outperformed traditional hybrid models in academics and
the industry. The most popular E2E models are, i.e., Attention-based Encoder-Decoder
(AED) [1], and Recurrent Neural Network Transducer (RNN-T) [2].
In the domain of Nepali Literature, HMM-based isolated word Nepali Speech Recognition
[3] implements HMM (Hidden Markov Model) based speaker-independent isolated word
Automatic Speech Recognition (ASR) system for the Nepali Language.
Nepali Speech Recognition using RNN-CTC Model [4], the paper presents a Neural
Network-based Nepali Speech Recognition model. RNN (Recurrent Neural Networks) is
used for processing sequential audio data. CTC (Connectionist RNN to train over audio
data. CTC is a probabilistic approach to maximizing the desired labels' occurrence
probability from RNN output). After processing through RNN and CTC layers, Nepali text
is obtained as output. This paper also defines a character set of 67 Nepali characters
required for transcription of Nepali speech to text.
These E2E models are data-hungry hybrid models. If the data size is small, performance
drops significantly. Data augmentation techniques are helpful but not that much to prevent
overfitting. So, the recent advancement has shown the use of self-supervised models where
it is helpful first to pre-train the E2E models either with unlabeled data and then fine-tune
on low-resource labeled data [5]. [6] This pre-training technique has not been applied in
the Nepali language, which we plan to implement in the project.
6
Chapter 3: System Analysis
3.1. Requirement Analysis
Functional requirement of a system specifies how the system should react on certain
situation that it is put on and how the system comes up with the output to the given
input. Following are the functional requirements for Nepali ASR system:
1. Record the audio.
2. Upload the pre-recorded audio file.
3. Trim the audio file.
4. Clear the audio file (recorded/uploaded)
5. Start processing and show the Devanagari transcript as output when the audio file
is submitted.
6. Save audio file and transcript inside a folder in a server when flag is raised.
3.1.1.1. Use case Diagram
It shows the interaction between the system and the user in a particular environment. The
use case model contains actors and the use cases. The actors are the external entities, and
the use cases are the system's functions.
7
Figure 3.1.1.3: Use Case Diagram of ASR Module
3.1.2.1. Accuracy
The model should have high accuracy for clear audio segments without noise. The accuracy
metric we will be using will be word error rate and character error rate. The system's
accuracy will depend on how accurately it will be able to transcript the audio. During
inference, the accuracy of the model also depends upon the audio from the user.
3.1.2.2. Efficiency
The end system is accessed through an API and a web-based system, which should have
low latency, i.e., inference time.
8
3.1.2.3. Availability
The end system will be running with no as much downtime as possible. The web-based
system should work on all major web browsers.
3.1.2.4. Reliability
The model should have fewer errors and no drastic significant errors. Also, the user would
be able to flag the cases where transcriptions are incorrect. The system will save the flagged
audio and transcription in a file system, which is reviewed later, and with proper processing,
it is being used as a training dataset.
Here, we have studied all the feasibility aspects of the project under consideration to check
out if the project is feasible with the decided requirements and availability of information,
technologies, and budget.
3.2.1. Technical
The first step will be data preprocessing and cleaning techniques. The project will use deep
learning architecture, the combination of CNN and transformer for self-supervised pre-
training using unlabeled data. A feed-forward neural network is used to fine-tune the pre-
trained model for ASR. For training the system, we will be leveraging the GPU in the
personal laptop and the cloud GPUS if available.
3.2.2. Operational
Operational feasibility refers to solving problems and building new systems with the help
of a new proposed system. It takes the ideas and opportunities developed during the initial
phase and the insights from requirement gathering to build a new system. The proposed
system can be used in many different applications where voice to text will be applicable.
3.2.3. Economic
The project will only use the usual laptop specification for building the system and cloud
GPUS, which are pay-as-you-go, making this economically feasible. The heavy computing
resources will not be needed after the system is trained. The inference can be carried out in
smaller computing devices too.
9
3.2.4. Schedule
In the object-oriented approach, a class diagram defines and provides the overview and
structure of a system in terms of classes, objects, attributes, and methods and their
relationship. The class diagram can also be termed as a type of structure diagram which
provides a conceptual model and architecture of the system being developed.
The sequence diagram shows the interaction between objects in sequential order. It shows
how an object operates with one another and in what order. The following sequence
diagram depicts the flow of information in Nepali Automatic Speech Representation.
The sequence diagram above describes the sequential interaction of the System from input
to output generated. Firstly, raw audio is given input by a user to the ASR system. The
11
system processes the input in sequential order. The System passes the audio waveform
vector to the CNN model, which generates the audio's latent speech representation. The
CNN model acts as a feature encoder. Thus generated latent speech is passed into the
Transformer model, which generates the context speech representation and the Feed-
forward model generates the final output. The feedforward model generates the probability
distribution over vocabulary characters, where greedy decoding is used to select the highest
probable character.
3.3.3. Process modeling using Activity diagram
An activity diagram is essentially an advanced form of a flow chart that generally describes
the model's flow. The activity diagram follows a behavioral approach which shows the flow
from one activity to another from start to end.
12
The activity diagram elaborates the flow of the whole system from the starting state to the
ending state. The activity starts with the input that the user provides to the system. This
system takes raw audio as input. This input is then fed to the model of the Nepali ASR
system. Different types of neural networks are used for different purposes within this
model. The input is firstly processed by 1D-CNN, which provides the latent representation
of the audio. It is then passed to the transformer, which generates the context representation
of the audio from the quantized latent speech representation. After the generation of context
speech representation, a softmax layer in the simple feed-forward linear neural network is
used to calculate the probability of each character. The character with the highest
probability is generated as the output of the whole model. And if the system predicts the
output without any errors (except prediction error), then the Devanagari transcript of the
corresponding audio is generated. This leads to the end state of the system.
13
Chapter 4: System Design
4.1. Design
The object-oriented approach is being used for system design. We have developed an
architecture for the system with a class diagram, sequence diagram, and activity diagram
to demonstrate how different models in the system interact to provide collective
functionalities.
The system takes raw spoken audio as input and processes the audio to generate the output
in the Nepali Devanagari script. The raw audio is fetched by the CNN, which acts as a
feature encoder to extract a feature vector which is the latent speech representation. The
transformer model is responsible for generating context vectors capable of incorporating
the sequence information. The final feed-forward layer gives the probability distribution
over the Nepali character tokens. The final output is selected using a greedy method that
predicts the highest probable character.
This shows the deployment architecture of ASR system. The user communicates with the
system through a web browser where the user has the capability to upload an audio file or
record audio directly from the browser. The ASR system is containerized with Docker and
deployed in Azure. Also, an API is provided to communicate from the external application
14
program. The recorded audio and the predicted transcript are saved in the file system for
analysis later.
The raw audio is fed into the system which is then converted into the vector to feed into the
ASR model. The ASR module passes the vector through a series of operations which finally
yields the transcript as result.
15
4.2. Algorithms Details
The input layer represents the input sequence into the CNN. In case of image, it would be
of 3 dimension i.e. for RGB image, in case of audio data it would be 1 dimension.
The convolutional layers are the foundation of CNN, as they contain the learned kernels
(weights), which extract features that distinguish different inputs from one other. A unique
kernel is used for the convolution operation to produce the current convolutional neuron’s
output or activation map.
The convolutional neuron performs an elementwise dot product with a unique kernel and
the previous layer’s corresponding neuron output. This yields as many intermediate results
as there are individual kernels—the convolutional neuron results from all of the
intermediate results summed together with the learned bias.
16
Since this features sparse interaction, the matrix multiplication of parameters does not have
to describe the interaction between input and output units. This is obtained by making
kernel size smaller than the size of the input. This feature can be applied to extract the
features in data with a lot less computation. Moreover, the parameter sharing feature of
CNN is beneficial.
There are multiple hyper parameters in the convolution layer. They are:
1. Padding is often necessary when the kernel extends beyond the activation map.
Padding conserves data at the borders of activation maps, which leads to better
performance, and it can help preserve the input's spatial size, which allows an
architecture designer to build deeper higher-performing networks.
2. Kernel size, often also referred to as filter size, refers to the dimensions of the
sliding window over the input.
3. The stride indicates how many pixels the kernel should be shifted over at a time.
The pooling layer replaces the network's output at specific locations by deriving a summary
statistic of the nearby outputs. This helps to reduce the spatial size of the representation,
which decreases the required amount of computation and weights. The pooling operation
is processed on every slice of the representation individually.
17
Figure 4.2.1.4: Max pooling
This layer converts a three-dimensional layer in the network into a one-dimensional vector
to fit the input of a fully-connected layer for classification.
4.2.2. Transformer
18
Figure 4.2.2.1: Architecture of Transformer
The Encoder is on the left, and the Decoder is on the right. Both Encoder and Decoder are
composed of modules that can be stacked on top of each other multiple times, as Nx
described in the figure. The modules consist mainly of Multi-Head Attention and Feed
Forward layers. The inputs and outputs (target sequences) are first embedded into an n-
dimensional space.
The critical part of the model is the positional encoding. Since there are no recurrent
networks that can remember how sequences are fed into a model, there is a need to
somehow give every part in our sequence a relative position since a sequence depends on
19
the order of its elements. These positions are added to each word's embedded representation
(n-dimensional vector).
Q is a matrix that contains the query (vector representation of one part in the sequence), K
is all the keys (vector representations of all the parts in the sequence), and V is the values,
which are again the vector representations of all the parts in the sequence.
For the encoder and decoder, multi-head attention modules, V consists of the same part
sequence as Q. However, for the attention module that considers the encoder and the
decoder sequences, V is different from the sequence represented by Q.
The values in V are multiplied and summed with some attention-weights a, where the
weights are defined by:
20
The weights a are defined by how each word of the sequence (represented by Q) is
influenced by all the other words in the sequence (represented by K).
The figure above describes how this attention mechanism can be parallelized into multiple
mechanisms that can be used. The attention mechanism is repeated multiple times with
linear projections of Q, K, and V. This allows the system to learn from different Q, K, and
V representations, which is beneficial to the model. These linear representations are
multiplying Q, K, and V by weight matrices W that are learned during the training.
Those matrices Q, K, and V are different for each position of the attention modules in the
structure depending on whether they are in the encoder, decoder, or in-between encoder
and decoder. We want to attend to either the whole encoder input sequence or a part of the
decoder input sequence. The multi-head attention module that connects the encoder and
decoder will ensure that the encoder input-sequence is taken into account together with the
decoder input-sequence up to a given position.
We have a pointwise feed-forward layer after the multi-attention heads in both the encoder
and decoder. The feed-forward network has identical parameters for each position, which
can be described as a separate, identical linear transformation of each element from the
given sequence.
21
4.2.3. Feedforward Neural network
A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes do not form a cycle.
This layer comprises neurons that receive the input and transfer it to the network's different
layers. The number of neurons in the input layer must be the same as the number of the
features or attributes in the dataset.
This layer is the forecasted feature that depends on the type of model being built.
The hidden layers are positioned between the input and the output layer. The number of
hidden layers depends on the type of model. Hidden layers have several neurons that
impose transformations on the input before transferring. The weights in the network are
constantly updated to make it easily predictable.
22
4.2.3.4. Neuron weights:
The strength or the magnitude of connection between two neurons is called weights. The
input weights can be compared just as coefficients in linear regression. The value of the
weights is usually small and falls within the range of 0 to 1.
4.2.3.5. Neurons:
The feedforward network has artificial neurons, which adapt to biological neurons.
Artificial neurons are the building blocks of the neural network. The neurons work in two
ways: first, they determine the sum of the weighted inputs, and second, they initiate an
activation process to normalize the sum.
The activation function can be either linear or nonlinear. Weights are related to each input
of the neuron. The network studies these weights during the learning phase.
This is the decision-making center at the neuron output. The neurons finalize linear or
nonlinear decisions based on the activation function. It prevents the enlargement of neuron
outputs due to cascading effect because of passing through many layers. The three most
important activation functions are sigmoid, Tanh, and Rectified Linear Unit (ReLu).
CTC Loss is used to aligning the input and output sequences when the input is continuous,
the output is discrete, and there are no clear element boundaries that can be used to map the
input to the elements of the output sequence.
• CTC Loss (during Training): It has a ground truth target transcript and tries to train
the network to maximize the probability of outputting that correct transcript.
• CTC Decoding (during Inference): Here, we don't have a target transcript to refer
to and have to predict the most likely sequence of characters
23
Chapter 5: Implementation and Testing
5.1. Implementation
The multi-layer convolutional feature encoder f: X → Z which takes as input raw Audio X
and outputs latent speech representation z_1, z_2, …, z_T. They are then fed to a Transformer g:
Z → C to build c_1, c_2, …, c_T capturing information from the entire sequence.
24
The transformer neural network uses a contrastive loss function. The loss is defined as:
The contrastive loss encourages high similarity with the positive vectors, and penalizes
high similarity scores of negative vectors.
OpenSpeech has collected 402 hours of unlabeled data [8] extracted from YouTube, which
we used to pre-train the Facebook wav2vec2 open-source model trained on multilingualism
data. Also, we collected our data samples.
Storing audio in the raw waveform is computationally expensive and takes a large storage
size. So, we converted the audio dataset into array format and stored it in apache arrow,
which takes less storage and is efficient in accessing the data.
• The audio arrays are fed to the feature encoder in the training step, i.e., the CNN
network. The feature encoder contains seven blocks and the temporal convolutions
in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths
(10,3,3,3,3,2,2). The encoder network gives 1024-dimensional quantized speech
representation.
• The 50% of the quantized speech representation is masked, then fed to a transformer
network whose output dimension is also 1024 as it is an encoder-decoder model.
The task of the transformer network is to predict the masked quantized vectors. Here
the contrastive loss function is used. The architecture of the transformer network is
that it contains 12 transformer blocks, model dimension 71024, and 16 attention
heads. The parameters and hyperparameters for these models are set to default as in
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [7].
25
• With batch size 16, the model is trained for two days in V100 GPU provided by
HuggingFace. Adam optimizer is used to train the network, also implementing the
learning rate warmup method where the starting learning rate is 5 * 10^-4.
5.1.2.3. Fine-tuning
After pre-training the CNN + transformer network with masking, a feedforward neural
network is added and trained in a supervised learning fashion with labeled data audio and
their transcription. The objective of the feedforward neural network is a multi-class
classification problem to classify into 62 classes, i.e., Nepali character tokens.
The output from the transformer network c_1, c_2, …, c_T which is contextualized speech
representation acts as an input to the feed forward network.
The Loss function used in the feedforward network is the CTC loss.
5.1.2.3.1.1 Dataset
Here, we combine multiple datasets from open-source OpenSLR High-quality TTS data for
Nepali [9] and Large Nepali ASR training data set [6], labeled ASR. Also, we use the
crowdsource collected data by OpenSpeech [8]. The total amount of data is around 400
hours.
26
5.1.2.3.1.2 Data pre-processing
Storing the audio data takes vast amounts of storage and memory overhead to convert to an
array every time.
We convert all the raw audio waveform to array and stare in apache arrow format. Also,
we ignore the special characters such as: '[\,\?\.\!\-\;\:\ "\ "\%\ '\" \�\']' by replacing them
with blank space.
• The vocabulary character is already defined in JSON format in the vocab JSON file;
these are the 62 Nepali character tokens. They are :
{"ड": 0, "इ": 1, "ढ": 2, "फ": 3, "ठ": 4, "ृ": 5, "ृ ": 6, "औ": 8, "द": 9, "ञ": 10, "ृ": 11,
"ऋ": 12, "घ": 13, "अ": 14, "ई": 15, "ट": 16, "ग": 17, "ृ": 18, "झ": 19, "ृ": 20, "िृ":
21, "ह": 22, "ृ": 23, "छ": 24, "ष": 25, "ङ": 26, "प": 27, "ऐ": 28, "र": 29, "ृ": 30, "ऊ":
31, "ब": 32, "थ": 33, "व": 34, "उ": 35, "भ": 36, "ृ ": 37, "ज": 38, "ए": 39, "ृ ": 40, "त":
41, "आ": 42, "ख": 43, "ल": 44, "ृ ": 45, "ृ": 46, "क": 47, "स": 48, "ओ": 49, "ध": 50,
"ण": 51, "म": 52, "श": 53, "न": 54, "ृ": 55, "ृ ": 56, "च": 57, "य": 58, "ॠ": 59, "|": 7,
"[UNK]": 60, "[PAD]": 61}
The [UNK] is the unknown token, and [PAD] is for padding used by CTC loss
whole decoding.
27
Figure 5.1.2.3.1.3: Feed-forward neural network for fine-tuning
28
• 10% of the total data is used as a validation set. With a batch size of 16, the model
is fine-tuned for two days in V100 GPU. Adam optimizer is used as an optimization
algorithm. The model is trained just for one epoch because of a lack of computing
resources and ample time. The Connectionist Temporal Classification (CTC) loss
is used while fine tuning the model. The softmax activation function is applied to
get a vector of probabilities over the 62 vocabulary characters. The prediction is
made greedily i.e., the highest probable character is selected.
5.2. Testing
Testing is the methodology for identifying and determining if the system works as intended.
Efficiency and effectiveness of the system is tested in the testing phase. Through testing,
bugs, errors, and progress of projects can be identified and checked for quality assurance,
validation, and verification.
Unit testing deals with the functional correctness of the system. Unit testing splits into
individual modules and tests the system's overall effectiveness. Unit testing helps the
system maintain an early identification of the system's flaws and errors.
The following table contains the test cases for unit testing:
29
Test cases Steps to be Expected results Obtained results Pass/Fail
executed
Generate context
speech
representation.
30
FFNN Take in context Highest probable Highest probable
processing speech character should be character is
representation generated. generated.
vector.
Pass
Generate
probability for
each character.
Output highest
probable
character.
Edit the Trim the input Using the slider user User is able to
input (audio). should be able to trim trim the audio.
the audio. Pass
31
5.2.2. Test Cases for System Testing
System testing is performed through the web interface, where we recorded multiple audio
samples and examined the predicted output. The significant findings are:
1. The transcribed text is not entirely grammatically correct.
2. The model does not transcribe numbers of Devanagari script.
3. The inference speed is around 10 – 15 seconds for a single sentence.
िनलिबबत गभनार अिधक र क पक्षम क णसभ , िनलिबबत गवनार अिधक र क पक्षम क ण सभ सरक र
सरक रिवरुद्ध क ल ब्य नर िवरुद्ध क ल प्य नर
The word error rate of the model on OpenSLR43 corpus is 27% while character error rate
is 8.3%. Also, while training 10000 samples are separated as a test set with a 34% word
error rate.
Following are the hyper-parameters used:
learning_rate: 6e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas= (0.9,0.999) and epsilon= 1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 1
32
Following table summarizes the results obtained:
33
0.71 0.2639 7200 0.2601 0.3480
34
Chapter 6: Conclusion and Future Recommendations
6.1. Conclusion
The primary intention of the project is to build a working Nepali Automatic Speech
recognition system. We successfully used the self-supervised pre-training methodology
with unlabeled data that helped to build a robust ASR system for low resource languages
like Nepali.
Since the start of the project, we researched several methodologies to build ASR systems,
the supervised learning methods were data-hungry, needed a lot of labeled datasets, and
performance was not very accurate. So, we planned to go with a self-supervised learning
methodology. As per our plan, we developed a system that operates as expected.
However, our system is not 100% accurate. Our system is affected by noise, very different
frequency of speech, the system cannot predict the numbers, and punctuations and the
transcriptions are also not grammatically correct. Another significant issue is that because
of a lack of computing resources, we could not train the system longer as we wanted, not
able to tune hyper parameters and leverage all the available datasets, which has
undoubtedly affected the system's performance.
The system can continually be improved. We can build a language model for decoding,
which should be able to transcribe the text with fewer grammatical errors. With sufficient
compute resources, we can perform hyper parameter tuning, which should increase the
system's performance.
In addition to this, since we are using self-supervised learning, the possibility to collect
more diverse unlabeled audio datasets and use them for training the model is always an
option.
Furthermore, for using the model in a real-world application, the inference time and model
size are reduced, which can be done with techniques like Quantization and pruning without
hurting the system's performance.
35
References
36
Appendices
Screenshots:
Convolution Layer
Transformer
37
Feed-Forward
Training Script
38
Interface
Output
39
Logs of visit to Supervisor
17th December
Proposal writing discussion Demonstration for the project proposal and
methodology discussion.
21st January
Proposal discussion and Change in project proposal as supervisor’s
correction of changes needed advice.
22nd January
Project proposal review Necessary changes needed in the project
proposal
2nd February
Discussion regarding the Self-supervised learning could be best
methodologies to be approach
followed.
Collect open source data, record own voice
Supervised learning or Self- samples.
supervised learning.
Addition of Language could be helpful to
Data collection strategy. improve performance
Performance improvement
strategies.
8th February Progress report observation Based on the project progress report,
Object Oriented Approach should be used.
10th February
Mid defense presentation and Necessary requirement amendment for
progress report. mid-term defense and demo presentation
18th April
Final defense project review. Changes in implementation methodology,
adding appendix.
40