You are on page 1of 50

Bhaktapur Multiple Campus

Tribhuvan University
Institute of Science and Technology

A PROJECT REPORT
ON

Nepali Automatic Speech Recognition using Self-Supervised


Learning
Under the supervision of
Asst. Prof. Surya Bam

Submitted to
Institute of Science and Information Technology
Tribhuvan University

In Partial Fulfillment of the Requirement for the Bachelor’s Degree in Computer


Science and Information Technology

Prajin Khadka (15150/074)


Ankit Raya (15131/074)
Krishna Rijal (15144/074)

April 2022
SUPERVISOR’S RECOMMENDATION

I hereby recommend that this report has been prepared under my supervision by Prajin
Khadka [15150/074], Ankit Raya [15131/074], Krishna Rijal [15144/074] entitled
“Nepali Automatic Speech Recognition” in partial fulfillment of this requirement for
the degree of B.Sc. In Computer Science and Information Technology (B. Sc. CSIT) be
processed for evaluation.

…………………………
Asst. Prof Surya Bam
Supervisor
Department of Computer Science and Information Technology
Bhaktapur Multiple Campus
Dudhpati, Bhaktapur
LETTER OF APPROVAL

This is to certify that this project Prajin Khadka [15150/074], Ankit Raya
[15131/074], Krishna Rijal [15144/074] entitled “Nepali Automatic Speech
Recognition” in partial fulfillment of the requirement for the degree of Bachelor of
Science in Computer Science and Information Technology (B.Sc. CSIT) has been well
studied. In our opinion, it is satisfactory in the scope and quality for the required degree.

Signature of Supervisor Signature of HOD

…………………………………… ……………..……………………..
Asst. Prof Surya Bam Mr. Sushant Poudel
Bhaktapur Multiple Campus Bhaktapur Multiple Campus

Signature of Internal Examiner Signature of External Examiner

……………..…………………….. ……………..………………….
Arjun Singh Saud
Bhaktapur Multiple Campus IOST, Tribhuvan University
ACKNOWLEDGEMENT
The successful completion of this project work would not have been possible without
the support and assistance of many individuals and organizations. We feel immensely
blessed to have gotten this during our study. We want to take this opportunity to offer
earnest admiration to every one of them.

First and foremost, we would like to acknowledge our mentor and our project
supervisor Asst. Prof Surya Bam, whose guidance helped us with different views of
development and research in the project. His guidance has been a cornerstone in
developing the ASR module in our project. We were able to accomplish this project
with the help of his helpful directions and suggestions.

We want to express our special thanks to the Department of Computer Science and
Information Technology for providing us with an environment to dig the latest
technology, investigate it, and research those areas as the major project. We are grateful
to Mr. Sushant Poudel, coordinator of the Department of Computer Science and
Information Technology, for presenting valuable suggestions regarding the queries put
forward regarding the project.

Last but not least, we would appreciate all our teachers, seniors, and friends for their
support, irrespective of the situation, and constant inspiration, which helped us
withstand our dreams and make our project come to an end.

i
ABSTRACT
Nepali Automatic Speech Recognition is a system whose main motive is to transcribe
the Nepali audio into text form. Additionally, a web interface system can record the
audio in real-time and transcribe it into the text. The ASR system uses the concept of
the transformer-based neural network, Convolutional Neural Network, and Natural
Language processing. Here we use the concept of self-supervised pre-training and fine-
tuning the pre-trained network with supervised learning.

Keywords: Recognition, Speech, transformer, CNN, NLP

ii
Table of Content
ACKNOWLEDGEMENT……..…………………………………………………….. i
ABSTRACT………………………………………………………………………..... ii
LIST OF ABBREVIATIONS………………………………………………………... v
LIST OF FIGURES…………………………………………………………………. vi
LIST OF TABLES………………………………………………………………….. vii
CHAPTER 1: INTRODUCTION ............................................................................... 1

1.1. Introduction ............................................................................................................... 1

1.2. Problem Statements .................................................................................................. 1

1.3. Project Objectives ..................................................................................................... 2

1.4. Scope and Limitations............................................................................................... 2

1.4.1. Scope .................................................................................................................. 2

1.4.2. Limitation........................................................................................................... 2

1.5. Development Methodology ...................................................................................... 2

1.6. Report Organization .................................................................................................. 3

CHAPTER 2: BACKGROUND STUDY AND LITERATURE REVIEW ............ 5

2.1. Background Study:.................................................................................................... 5

2.2. Literature Review...................................................................................................... 5

CHAPTER 3: SYSTEM ANALYSIS ......................................................................... 7

3.1. Requirement Analysis ............................................................................................... 7

3.1.1. Functional Requirement ..................................................................................... 7

3.1.2. Non-Functional Requirement............................................................................. 8

3.2. Feasibility Analysis ................................................................................................... 9

3.2.1. Technical ............................................................................................................ 9

3.2.2. Operational ......................................................................................................... 9

3.2.3. Economic ........................................................................................................... 9

3.2.4. Schedule ........................................................................................................... 10

3.3. Analysis................................................................................................................... 10

iii
3.3.1. Object modeling using Class and Object Diagrams ........................................ 10

3.3.2. Dynamic Modeling using sequence diagram ................................................... 11

3.3.3. Process modeling using Activity diagram ....................................................... 12

CHAPTER 4: SYSTEM DESIGN ............................................................................ 14

4.1. Design ..................................................................................................................... 14

4.1.1. System Architecture and overview .................................................................. 14

4.1.2. Deployment Diagram ....................................................................................... 14

4.1.3. Component Diagram ........................................................................................ 15

4.2. Algorithms Details .................................................................................................. 16

4.2.1. Convolutional Neural Network ........................................................................ 16

4.2.2. Transformer...................................................................................................... 18

4.2.3. Feedforward Neural network ........................................................................... 22

4.2.4. CTC Loss ......................................................................................................... 23

CHAPTER 5: IMPLEMENTATION AND TESTING .......................................... 24

5.1. Implementation ....................................................................................................... 24

5.1.1. Tools Used ....................................................................................................... 24

5.1.2. Implementation Details of Modules................................................................. 24

5.2. Testing..................................................................................................................... 29

5.2.1. Test Cases for Unit Testing.............................................................................. 29

5.2.2. Test Cases for System Testing ......................................................................... 32

5.2.3. Result Analysis ................................................................................................ 32

CHAPTER 6: CONCLUSION AND FUTURE RECOMMENDATIONS .......... 35

6.1. Conclusion .............................................................................................................. 35

6.2. Future Recommendations ....................................................................................... 35

References ………………………………………………………………………….. 36
Appendices ………………………………………………………………………….. 37
Logs …………………………………………………………………………………. 40

iv
List of Abbreviations

API: Application Program Interface


ASR: Automatic Speech Recognition
CER: Character Error Rate
CNN: Convolutional Neural Network
CTC: Connectionist Temporal Classification
E2E: End to End
GPU: Graphical Processing Unit
IDE: Interactive Development Environment
NASR: Nepali Automatic Speech Recognition
NLP: Natural Language Processing
WER: Word Error Rate

v
List of Figures
Figure 1.5: Agile Methodology ……………………………………………………... 3
Figure 3.1.1.3: Use Case Diagram of ASR Module ……………………………….... 8
Figure 3.2.4: Schedule of project organization ………………...…………………... 10
Figure 3.3.1: Class diagram of ASR Module ……………..……...………………… 10
Figure 3.3.2: Sequence Diagram of ASR Module ……………………..………........ 11
Figure 3.3.3: Activity Diagram of ASR System ………………...………………..... 12
Figure 4.1.1: ASR System Architecture ……………………………………………. 14
Figure 4.1.2: Deployment Diagram of ASR Model.……………………………....... 15
Figure 4.1.3: Component Diagram of ASR Model ………………………………… 15
Figure 4.2.1: Architecture of a CNN ……………………………………………….. 16
Figure 4.2.1.3: Convolution Operation …………………………………………….. 17
Figure 4.2.1.4: Max pooling ………………………………………………………... 18
Figure 4.2.2.1: Architecture of Transformer ………………………………..……… 19
Figure 4.2.2.2. Scaled dot-product attention ……………………………………...... 20
Figure 4.2.2.3: Multi-head attention ………………………………………………... 21
Figure 4.2.3: Feed Forward Neural Network ……...…..…………………………... 22
Figure 5.2.1: Self Supervised pre-training using unlabeled speech data ………........ 24
Figure 5.1.2.2: Feed-Forward Neural Network ……...……………………………... 26
Figure 5.1.2.3.1.3: Feed-forward neural network for fine-tuning ….....................…. 28

vi
List of Tables
Table 5.2.1: Test cases for unit testing ………………………………………….. 30-31
Table 5.2.2: Test cases for system testing ………………………………………….. 32
Table 5.3: Obtained result analysis ……………………………………………... 33-34

vii
Chapter 1: Introduction
1.1. Introduction

Automatic speech recognition or ASR refers to the problem of getting a program to


transcribe a spoken language into a text automatically (speech-to-text). In more simple
terms, ASR is the technology that allows human beings to use their voice to speak and
transcribe the spoken words into text.
ASR technologies have been advancing rapidly, and the most advanced version of currently
developed ASR technologies revolves around the extensive use of Natural Language
Processing NLP. NLP allows the closest real conversation between machines and human
beings.
Nepali Automatic Speech Recognition (N-ASR) is a platform that mainly works on
transcribing Nepali spoken language into respective words. N-ASR allows users to input
audio in .wav or .mp3, and upon processing, the output is generated, which is then
transcribed text. The main focus will be reducing the WER (Word Error Rate). After
numerous training and testing, the WER should be as minimum as possible.
This project aims to generate output in textual form, where input is in any audio file of any
format. The model works with the combination of CNN and Transformer, i.e., an attention-
based neural network for self-supervised pre-training, and a feed-forward neural network
added to the pre-trained self-supervised model for supervised training with input as audio
segment and its respective transcription. The framework inputs raw audio and generates the
output from the model [1].

1.2. Problem Statements

Translating the Nepali language from any spoken word into a textual form is relatively
difficult. The literature is very vast and complex. Because of complexity, there is not
enough data to work on as less data is being generated and collected. The scarcity of readily
available data makes the problem more difficult. Another problem that we encounter is the
way of speech. People have various tones and speak in various frequencies, leading to
difficulty in recognition.

A fully connected network for speech recognition is a computationally challenging task.


Generally, speech consists of noise and external factors that affect the quality of the speech.
The degraded quality of speech can cause a problem in generating transcription.

1
1.3. Project Objectives

The objective of the project is to translate Nepali audio speech to text. The audio is given
as an input in various formats, i.e., .wav, .mp3, etc.

The milestone of the project is as follows:


• To provide an easy-to-use API for integration in other applications.
• Platform to input audio and get grammatically correct output as much as
possible.
• To improve the current Nepali ASR system.

1.4. Scope and Limitations

1.4.1. Scope

The project's primary focus is to build a robust Nepali ASR system capable of transcribing
a given audio segment with low resource consumption and fast inference speed in mobile
and web apps.

The system will be helpful for any company or product that deals with Nepali speech, and
also, it will be helpful, especially for disabled people.

With the addition of a language model for decoding, the prediction can be more
grammatically accurate.

1.4.2. Limitation

Despite proper training and testing, the N-ASR has the following limitations:

• The result of the system will be data-dependent and works only with everyday clear
speech.
• The model is not able to generate transcriptions for numbers and punctuations.
• Since users provide the input audio, the system's accuracy depends on the quality
of the input audio.

1.5. Development Methodology

In the case of an algorithm, we decided to build a neural network that learns contextualized
speech representations from unlabeled speech data by randomly masking feature vectors

2
obtained from Convolutional Neural Network before passing them to a transformer neural
network during self-supervised pre-training. Here, only the unlabeled speech data is used,
which are pre-processed by removing noise and silence and sampled at 16 KHz.
Moreover, in the second step, linear layer is added on the top of the pre-trained network to
train the model on labeled audio data for automatic speech recognition. The task of the
Linear is a multi-class classification problem to classify into 62 classes, i.e., 62-character
tokens. This approach is mainly helpful for low resource languages such as Nepali as there
are no big labeled audio datasets, so leveraging the power of self-supervised pre-training,
the project aims to build a robust Nepali ASR system.

Figure 1.5: Agile Methodology

Due to the iterative and incremental nature of the system, the agile approach best describes
the development methodology. The model is passed through multiple designing, training,
and testing iterations.

1.6. Report Organization

This report is separated into various chapters which describe the organization and structure,
which is obtained as follows:

3
The first chapter includes overall details and an introduction to the project. The introduction
contains information about general introductions, objectives, project scope and limitation,
and development methodology.

Chapter 2 of this report consists of the literature review. Here we have summarized the
study of other similar systems. Also, we have mentioned how our system is different from
others and how our system is an improvement from the existing systems.

Results of the system analysis are elaborated in chapter 3. In this section, we have discussed
the requirements for the system, how the system is feasible, and the system model
corresponds to the approach we have used.

The fourth chapters contain all diagrams, including system architecture, component design
and details of algorithms used in the project.

Implementation and testing details are summarized in the fifth chapter. Details also include
tools used for the project and implementation details and tests of the module. Result
analysis is the essential part of this chapter and contains the overall progress of the system.

Final chapter includes the conclusion and further improvements that could be achieved.

4
Chapter 2: Background Study and Literature Review
2.1. Background Study:

Automatic Speech Recognition (ASR) is a technology under Natural Language Processing


(NLP) in computer science that transcribes speech to its equivalent to written text.
ASR has captured the attention of the artificial intelligence community over the last few
decades.

ASR has numerous applications in the healthcare system, banking, marketing, home
automation, etc. Researchers have obtained impressive improvements in performance,
especially in the English language. They have created cutting-edge models that outperform
their forerunners.

The journey from simple frequency analysis for speaker recognition to intricate end-to-end
online speech recognition may seem fantastic, but English has always been the primary
attention language. Less commonly spoken languages, such as Nepali, receive little
attention, partially because they are not widely spoken and, more importantly, because there
are insufficient resources for those working in the subject.

There have been a few attempts to break the Nepali languages standard of an ineffective
speech recognition system, but none have yielded satisfactory results. This technology is
available to benefit speakers of all languages, including Nepali. This service is in high
demand among general people, large corporations, and students. So it will only be a matter
of time before we master it. We do not set an unreasonable goal of completely overhauling
Nepal’s current ASR system; instead, we demonstrate our methods by merging cutting-
edge technologies to create a prototype application. We believe that by using our
methodologies, we will be able to help this field progress and perform better.

2.2. Literature Review

Automatic speech recognition for the English language has progressed with an abundance
of data and research going on. The accuracy of automatic speech recognition (ASR) has
been significantly boosted since the evolution of deep neural network (DNN) based hybrid
modeling was adopted a decade ago. Both supervised and self-supervised approaches have
made massive progress in this field.

5
The approach uses deep learning techniques like CNN, RNN, and different variants of
RNN. The Hidden Markov Model and Gaussian Mixture Model with hands-on feature
engineering are traditional approaches, but more sophisticated deep learning methods have
surpassed them in performance by quite a significant margin.

End-to-End (E2E) systems have outperformed traditional hybrid models in academics and
the industry. The most popular E2E models are, i.e., Attention-based Encoder-Decoder
(AED) [1], and Recurrent Neural Network Transducer (RNN-T) [2].

In the domain of Nepali Literature, HMM-based isolated word Nepali Speech Recognition
[3] implements HMM (Hidden Markov Model) based speaker-independent isolated word
Automatic Speech Recognition (ASR) system for the Nepali Language.

Nepali Speech Recognition using RNN-CTC Model [4], the paper presents a Neural
Network-based Nepali Speech Recognition model. RNN (Recurrent Neural Networks) is
used for processing sequential audio data. CTC (Connectionist RNN to train over audio
data. CTC is a probabilistic approach to maximizing the desired labels' occurrence
probability from RNN output). After processing through RNN and CTC layers, Nepali text
is obtained as output. This paper also defines a character set of 67 Nepali characters
required for transcription of Nepali speech to text.

These E2E models are data-hungry hybrid models. If the data size is small, performance
drops significantly. Data augmentation techniques are helpful but not that much to prevent
overfitting. So, the recent advancement has shown the use of self-supervised models where
it is helpful first to pre-train the E2E models either with unlabeled data and then fine-tune
on low-resource labeled data [5]. [6] This pre-training technique has not been applied in
the Nepali language, which we plan to implement in the project.

The self-supervised learning of speech representations masks latent representations of the


raw waveform and solves a contrastive task over quantized speech representation [7].

6
Chapter 3: System Analysis
3.1. Requirement Analysis

3.1.1. Functional Requirement

Functional requirement of a system specifies how the system should react on certain
situation that it is put on and how the system comes up with the output to the given
input. Following are the functional requirements for Nepali ASR system:
1. Record the audio.
2. Upload the pre-recorded audio file.
3. Trim the audio file.
4. Clear the audio file (recorded/uploaded)
5. Start processing and show the Devanagari transcript as output when the audio file
is submitted.
6. Save audio file and transcript inside a folder in a server when flag is raised.
3.1.1.1. Use case Diagram

It shows the interaction between the system and the user in a particular environment. The
use case model contains actors and the use cases. The actors are the external entities, and
the use cases are the system's functions.

7
Figure 3.1.1.3: Use Case Diagram of ASR Module

3.1.2. Non-Functional Requirement

3.1.2.1. Accuracy

The model should have high accuracy for clear audio segments without noise. The accuracy
metric we will be using will be word error rate and character error rate. The system's
accuracy will depend on how accurately it will be able to transcript the audio. During
inference, the accuracy of the model also depends upon the audio from the user.

3.1.2.2. Efficiency

The end system is accessed through an API and a web-based system, which should have
low latency, i.e., inference time.

8
3.1.2.3. Availability

The end system will be running with no as much downtime as possible. The web-based
system should work on all major web browsers.

3.1.2.4. Reliability

The model should have fewer errors and no drastic significant errors. Also, the user would
be able to flag the cases where transcriptions are incorrect. The system will save the flagged
audio and transcription in a file system, which is reviewed later, and with proper processing,
it is being used as a training dataset.

3.2. Feasibility Analysis

Here, we have studied all the feasibility aspects of the project under consideration to check
out if the project is feasible with the decided requirements and availability of information,
technologies, and budget.

3.2.1. Technical

The first step will be data preprocessing and cleaning techniques. The project will use deep
learning architecture, the combination of CNN and transformer for self-supervised pre-
training using unlabeled data. A feed-forward neural network is used to fine-tune the pre-
trained model for ASR. For training the system, we will be leveraging the GPU in the
personal laptop and the cloud GPUS if available.

3.2.2. Operational

Operational feasibility refers to solving problems and building new systems with the help
of a new proposed system. It takes the ideas and opportunities developed during the initial
phase and the insights from requirement gathering to build a new system. The proposed
system can be used in many different applications where voice to text will be applicable.

3.2.3. Economic

The project will only use the usual laptop specification for building the system and cloud
GPUS, which are pay-as-you-go, making this economically feasible. The heavy computing
resources will not be needed after the system is trained. The inference can be carried out in
smaller computing devices too.

9
3.2.4. Schedule

Figure 3.2.4: Schedule of project organization


3.3. Analysis

3.3.1. Object modeling using Class and Object Diagrams

In the object-oriented approach, a class diagram defines and provides the overview and
structure of a system in terms of classes, objects, attributes, and methods and their
relationship. The class diagram can also be termed as a type of structure diagram which
provides a conceptual model and architecture of the system being developed.

Figure 3.3.1: Class Diagram of ASR Module


10
The class diagram shows that we have classes like users, ASR Model, CNN Model,
Transformer Model, and Feedforward NN model. Each of the classes has its individual
properties and functionalities. An individual user feeds raw Audio as an input, and the
model generates the respective Devanagari script of the audio feed. The Audio passes
through three different models.
Individual models have their respective responsibility and their respective output, which is
taken as input by the subsequent model. First, the CNN Model takes raw audio waveform
as an input and generates Latent Speech representation as an output fed to the Transformer
model, which generates Context Speech Representation vectors. Thus, the obtained vector
is fed to the Feedforward NN Model, generating the highest probable characters as an
output.

3.3.2. Dynamic Modeling using sequence diagram

The sequence diagram shows the interaction between objects in sequential order. It shows
how an object operates with one another and in what order. The following sequence
diagram depicts the flow of information in Nepali Automatic Speech Representation.

Figure 3.3.2: Sequence Diagram of ASR Module

The sequence diagram above describes the sequential interaction of the System from input
to output generated. Firstly, raw audio is given input by a user to the ASR system. The

11
system processes the input in sequential order. The System passes the audio waveform
vector to the CNN model, which generates the audio's latent speech representation. The
CNN model acts as a feature encoder. Thus generated latent speech is passed into the
Transformer model, which generates the context speech representation and the Feed-
forward model generates the final output. The feedforward model generates the probability
distribution over vocabulary characters, where greedy decoding is used to select the highest
probable character.
3.3.3. Process modeling using Activity diagram

An activity diagram is essentially an advanced form of a flow chart that generally describes
the model's flow. The activity diagram follows a behavioral approach which shows the flow
from one activity to another from start to end.

Figure 3.3.3: Activity Diagram of ASR System

12
The activity diagram elaborates the flow of the whole system from the starting state to the
ending state. The activity starts with the input that the user provides to the system. This
system takes raw audio as input. This input is then fed to the model of the Nepali ASR
system. Different types of neural networks are used for different purposes within this
model. The input is firstly processed by 1D-CNN, which provides the latent representation
of the audio. It is then passed to the transformer, which generates the context representation
of the audio from the quantized latent speech representation. After the generation of context
speech representation, a softmax layer in the simple feed-forward linear neural network is
used to calculate the probability of each character. The character with the highest
probability is generated as the output of the whole model. And if the system predicts the
output without any errors (except prediction error), then the Devanagari transcript of the
corresponding audio is generated. This leads to the end state of the system.

13
Chapter 4: System Design
4.1. Design

The object-oriented approach is being used for system design. We have developed an
architecture for the system with a class diagram, sequence diagram, and activity diagram
to demonstrate how different models in the system interact to provide collective
functionalities.

4.1.1. System Architecture and overview

The system takes raw spoken audio as input and processes the audio to generate the output
in the Nepali Devanagari script. The raw audio is fetched by the CNN, which acts as a
feature encoder to extract a feature vector which is the latent speech representation. The
transformer model is responsible for generating context vectors capable of incorporating
the sequence information. The final feed-forward layer gives the probability distribution
over the Nepali character tokens. The final output is selected using a greedy method that
predicts the highest probable character.

Figure 4.1.1: ASR System Architecture

4.1.2. Deployment Diagram

This shows the deployment architecture of ASR system. The user communicates with the
system through a web browser where the user has the capability to upload an audio file or
record audio directly from the browser. The ASR system is containerized with Docker and
deployed in Azure. Also, an API is provided to communicate from the external application

14
program. The recorded audio and the predicted transcript are saved in the file system for
analysis later.

Figure 4.1.2: Deployment Diagram of ASR Model

4.1.3. Component Diagram

The raw audio is fed into the system which is then converted into the vector to feed into the
ASR model. The ASR module passes the vector through a series of operations which finally
yields the transcript as result.

Figure 4.1.3: Component Diagram of ASR Model

15
4.2. Algorithms Details

4.2.1. Convolutional Neural Network

A convolutional neural network is a feed-forward neural network that is generally used to


analyze visual images by processing data with a grid-like topology. It can also be used for
1-dimensional data such as speech data or time-series data to extract features from
sequences of observations.

Figure 4.2.1: Architecture of a CNN

CNN is composed of multiple layers. They are:

4.2.1.1. Input Layer

The input layer represents the input sequence into the CNN. In case of image, it would be
of 3 dimension i.e. for RGB image, in case of audio data it would be 1 dimension.

4.2.1.2. Convolutional Layers

The convolutional layers are the foundation of CNN, as they contain the learned kernels
(weights), which extract features that distinguish different inputs from one other. A unique
kernel is used for the convolution operation to produce the current convolutional neuron’s
output or activation map.

The convolutional neuron performs an elementwise dot product with a unique kernel and
the previous layer’s corresponding neuron output. This yields as many intermediate results
as there are individual kernels—the convolutional neuron results from all of the
intermediate results summed together with the learned bias.

16
Since this features sparse interaction, the matrix multiplication of parameters does not have
to describe the interaction between input and output units. This is obtained by making
kernel size smaller than the size of the input. This feature can be applied to extract the
features in data with a lot less computation. Moreover, the parameter sharing feature of
CNN is beneficial.

The size of kernels is a hyper-parameter specified by the designers of the network


architecture. In order to produce the output of the convolutional neuron (activation map),
we must perform an elementwise dot product with the output of the previous layer and the
unique kernel learned by the network.

Figure 4.2.1.3: Convolution Operation

There are multiple hyper parameters in the convolution layer. They are:

1. Padding is often necessary when the kernel extends beyond the activation map.
Padding conserves data at the borders of activation maps, which leads to better
performance, and it can help preserve the input's spatial size, which allows an
architecture designer to build deeper higher-performing networks.

2. Kernel size, often also referred to as filter size, refers to the dimensions of the
sliding window over the input.

3. The stride indicates how many pixels the kernel should be shifted over at a time.

4.2.1.3. Pooling Layer

The pooling layer replaces the network's output at specific locations by deriving a summary
statistic of the nearby outputs. This helps to reduce the spatial size of the representation,
which decreases the required amount of computation and weights. The pooling operation
is processed on every slice of the representation individually.

17
Figure 4.2.1.4: Max pooling

4.2.1.4. Flatten Layer

This layer converts a three-dimensional layer in the network into a one-dimensional vector
to fit the input of a fully-connected layer for classification.

4.2.2. Transformer

The transformer is an encoder-decoder-based neural network that uses an attention


mechanism to focus on a specific part of the input sequence while predicting the output
sequence by the decoder. It differs from the existing sequence-to-sequence models because
it does not imply any Recurrent Networks (GRU, LSTM)

18
Figure 4.2.2.1: Architecture of Transformer

The Encoder is on the left, and the Decoder is on the right. Both Encoder and Decoder are
composed of modules that can be stacked on top of each other multiple times, as Nx
described in the figure. The modules consist mainly of Multi-Head Attention and Feed
Forward layers. The inputs and outputs (target sequences) are first embedded into an n-
dimensional space.
The critical part of the model is the positional encoding. Since there are no recurrent
networks that can remember how sequences are fed into a model, there is a need to
somehow give every part in our sequence a relative position since a sequence depends on

19
the order of its elements. These positions are added to each word's embedded representation
(n-dimensional vector).

The multi-head attention is defined as:

Figure 4.2.2.2. Scaled dot-product attention

It describes the attention-mechanism by the following equation:

Q is a matrix that contains the query (vector representation of one part in the sequence), K
is all the keys (vector representations of all the parts in the sequence), and V is the values,
which are again the vector representations of all the parts in the sequence.

For the encoder and decoder, multi-head attention modules, V consists of the same part
sequence as Q. However, for the attention module that considers the encoder and the
decoder sequences, V is different from the sequence represented by Q.

The values in V are multiplied and summed with some attention-weights a, where the
weights are defined by:

20
The weights a are defined by how each word of the sequence (represented by Q) is
influenced by all the other words in the sequence (represented by K).

Figure 4.2.2.3: Multi-head attention

The figure above describes how this attention mechanism can be parallelized into multiple
mechanisms that can be used. The attention mechanism is repeated multiple times with
linear projections of Q, K, and V. This allows the system to learn from different Q, K, and
V representations, which is beneficial to the model. These linear representations are
multiplying Q, K, and V by weight matrices W that are learned during the training.

Those matrices Q, K, and V are different for each position of the attention modules in the
structure depending on whether they are in the encoder, decoder, or in-between encoder
and decoder. We want to attend to either the whole encoder input sequence or a part of the
decoder input sequence. The multi-head attention module that connects the encoder and
decoder will ensure that the encoder input-sequence is taken into account together with the
decoder input-sequence up to a given position.

We have a pointwise feed-forward layer after the multi-attention heads in both the encoder
and decoder. The feed-forward network has identical parameters for each position, which
can be described as a separate, identical linear transformation of each element from the
given sequence.

21
4.2.3. Feedforward Neural network

A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes do not form a cycle.

Figure 4.2.3: Feed forward neural network

The feedforward neural networks comprise the following components:


4.2.3.1. Input layer:

This layer comprises neurons that receive the input and transfer it to the network's different
layers. The number of neurons in the input layer must be the same as the number of the
features or attributes in the dataset.

4.2.3.2. Output layer

This layer is the forecasted feature that depends on the type of model being built.

4.2.3.3. Hidden layer

The hidden layers are positioned between the input and the output layer. The number of
hidden layers depends on the type of model. Hidden layers have several neurons that
impose transformations on the input before transferring. The weights in the network are
constantly updated to make it easily predictable.
22
4.2.3.4. Neuron weights:

The strength or the magnitude of connection between two neurons is called weights. The
input weights can be compared just as coefficients in linear regression. The value of the
weights is usually small and falls within the range of 0 to 1.

4.2.3.5. Neurons:

The feedforward network has artificial neurons, which adapt to biological neurons.
Artificial neurons are the building blocks of the neural network. The neurons work in two
ways: first, they determine the sum of the weighted inputs, and second, they initiate an
activation process to normalize the sum.

The activation function can be either linear or nonlinear. Weights are related to each input
of the neuron. The network studies these weights during the learning phase.

4.2.3.6. Activation Function:

This is the decision-making center at the neuron output. The neurons finalize linear or
nonlinear decisions based on the activation function. It prevents the enlargement of neuron
outputs due to cascading effect because of passing through many layers. The three most
important activation functions are sigmoid, Tanh, and Rectified Linear Unit (ReLu).

• Sigmoid: It maps the input values within the range of 0 to 1.


• Tanh: It maps the input values between -1 and 1.
• Rectified linear Unit: This function allows only the positive values to flow
through. The negative values are mapped at 0.

4.2.4. CTC Loss

CTC Loss is used to aligning the input and output sequences when the input is continuous,
the output is discrete, and there are no clear element boundaries that can be used to map the
input to the elements of the output sequence.

CTC works in two modes:

• CTC Loss (during Training): It has a ground truth target transcript and tries to train
the network to maximize the probability of outputting that correct transcript.

• CTC Decoding (during Inference): Here, we don't have a target transcript to refer
to and have to predict the most likely sequence of characters

23
Chapter 5: Implementation and Testing
5.1. Implementation

5.1.1. Tools Used

Various tools used in the project are given as follows:


Frontend: HTML, CSS
Backend: Python, PyTorch, Gradio, Docker, Azure App Service, HuggingFace
IDE: Jupyter lab, Visual Studio Code, FFmpeg

5.1.2. Implementation Details of Modules

5.1.2.1. Self-Supervised Pre-training


The pre-training process uses a contrastive task to train on unlabeled speech data. A mask
is first randomly applied in the latent space representation, i.e., the output of 1D-CNN,
where ~50% of the projected latent feature vectors. The masked latent speech
representation is quantized and passed into a transformer network whose objective is to
reconstruct the masked vectors as close as possible.

Figure 5.1.2.1: Self supervised pre-training using unlabeled speech data

The multi-layer convolutional feature encoder f: X → Z which takes as input raw Audio X
and outputs latent speech representation z_1, z_2, …, z_T. They are then fed to a Transformer g:
Z → C to build c_1, c_2, …, c_T capturing information from the entire sequence.

24
The transformer neural network uses a contrastive loss function. The loss is defined as:

where similarity is cosine similarity:

The contrastive loss encourages high similarity with the positive vectors, and penalizes
high similarity scores of negative vectors.

5.1.2.2. Training Phase

5.1.2.2.1. Data Selection

OpenSpeech has collected 402 hours of unlabeled data [8] extracted from YouTube, which
we used to pre-train the Facebook wav2vec2 open-source model trained on multilingualism
data. Also, we collected our data samples.

5.1.2.2.2. Data pre-processing:

Storing audio in the raw waveform is computationally expensive and takes a large storage
size. So, we converted the audio dataset into array format and stored it in apache arrow,
which takes less storage and is efficient in accessing the data.

5.1.2.2.3. Training step:

• The audio arrays are fed to the feature encoder in the training step, i.e., the CNN
network. The feature encoder contains seven blocks and the temporal convolutions
in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths
(10,3,3,3,3,2,2). The encoder network gives 1024-dimensional quantized speech
representation.

• The 50% of the quantized speech representation is masked, then fed to a transformer
network whose output dimension is also 1024 as it is an encoder-decoder model.
The task of the transformer network is to predict the masked quantized vectors. Here
the contrastive loss function is used. The architecture of the transformer network is
that it contains 12 transformer blocks, model dimension 71024, and 16 attention
heads. The parameters and hyperparameters for these models are set to default as in
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [7].

25
• With batch size 16, the model is trained for two days in V100 GPU provided by
HuggingFace. Adam optimizer is used to train the network, also implementing the
learning rate warmup method where the starting learning rate is 5 * 10^-4.

5.1.2.3. Fine-tuning

After pre-training the CNN + transformer network with masking, a feedforward neural
network is added and trained in a supervised learning fashion with labeled data audio and
their transcription. The objective of the feedforward neural network is a multi-class
classification problem to classify into 62 classes, i.e., Nepali character tokens.

Figure 5.1.2.2: Feed-Forward Neural Network

The output from the transformer network c_1, c_2, …, c_T which is contextualized speech
representation acts as an input to the feed forward network.

For final prediction, which is multi-class classification, a softmax activation function is


used that outputs the predicted probability distribution for every character token and the
probability with the highest token is selected.

The Loss function used in the feedforward network is the CTC loss.

5.1.2.3.1. Training Phase

5.1.2.3.1.1 Dataset
Here, we combine multiple datasets from open-source OpenSLR High-quality TTS data for
Nepali [9] and Large Nepali ASR training data set [6], labeled ASR. Also, we use the
crowdsource collected data by OpenSpeech [8]. The total amount of data is around 400
hours.

26
5.1.2.3.1.2 Data pre-processing
Storing the audio data takes vast amounts of storage and memory overhead to convert to an
array every time.

We convert all the raw audio waveform to array and stare in apache arrow format. Also,
we ignore the special characters such as: '[\,\?\.\!\-\;\:\ "\ "\%\ '\" \�\']' by replacing them
with blank space.

5.1.2.3.1.3 Training step:


• After pre-training, we fine-tune the learned representation on labeled data and add
a randomly initialized output layer to predict characters on top of the transformer.
The input dimension to the linear layer is 1024, which is the output of the
transformer network. The hidden layer's size is 4096, and the output dimension is
the number of Nepali vocabulary characters.

• The vocabulary character is already defined in JSON format in the vocab JSON file;
these are the 62 Nepali character tokens. They are :
{"ड": 0, "इ": 1, "ढ": 2, "फ": 3, "ठ": 4, "ृ": 5, "ृ ": 6, "औ": 8, "द": 9, "ञ": 10, "ृ": 11,
"ऋ": 12, "घ": 13, "अ": 14, "ई": 15, "ट": 16, "ग": 17, "ृ": 18, "झ": 19, "ृ": 20, "िृ":
21, "ह": 22, "ृ": 23, "छ": 24, "ष": 25, "ङ": 26, "प": 27, "ऐ": 28, "र": 29, "ृ": 30, "ऊ":
31, "ब": 32, "थ": 33, "व": 34, "उ": 35, "भ": 36, "ृ ": 37, "ज": 38, "ए": 39, "ृ ": 40, "त":
41, "आ": 42, "ख": 43, "ल": 44, "ृ ": 45, "ृ": 46, "क": 47, "स": 48, "ओ": 49, "ध": 50,
"ण": 51, "म": 52, "श": 53, "न": 54, "ृ": 55, "ृ ": 56, "च": 57, "य": 58, "ॠ": 59, "|": 7,
"[UNK]": 60, "[PAD]": 61}
The [UNK] is the unknown token, and [PAD] is for padding used by CTC loss
whole decoding.

27
Figure 5.1.2.3.1.3: Feed-forward neural network for fine-tuning

28
• 10% of the total data is used as a validation set. With a batch size of 16, the model
is fine-tuned for two days in V100 GPU. Adam optimizer is used as an optimization
algorithm. The model is trained just for one epoch because of a lack of computing
resources and ample time. The Connectionist Temporal Classification (CTC) loss
is used while fine tuning the model. The softmax activation function is applied to
get a vector of probabilities over the 62 vocabulary characters. The prediction is
made greedily i.e., the highest probable character is selected.

5.1.2.4. Prediction Phase


• Input: Input from the user is the raw audio segment in any supported
format.
• Pre-processing: To feed input to the model, the audio is sampled to 16 KHz
and converted to an audio array. The array is then fed to the trained model,
using a greedy decoding method to select the highest probable character.
• Prediction: The prediction from the model is a 62-dimensional vector,
[0.0001, 0.7, 0.1, …]1*62, with greedy decoding, the highest probable token
is selected, and reverse mapping is done to get Nepali vocabulary character
from the vocab JSON file.

5.2. Testing

Testing is the methodology for identifying and determining if the system works as intended.
Efficiency and effectiveness of the system is tested in the testing phase. Through testing,
bugs, errors, and progress of projects can be identified and checked for quality assurance,
validation, and verification.

5.2.1. Test Cases for Unit Testing

Unit testing deals with the functional correctness of the system. Unit testing splits into
individual modules and tests the system's overall effectiveness. Unit testing helps the
system maintain an early identification of the system's flaws and errors.

The following table contains the test cases for unit testing:

29
Test cases Steps to be Expected results Obtained results Pass/Fail
executed

Take audio Record raw Audio should record Audio is recorded


file as input audio. if the user picks to when pressed
record audio and “record” button.
audio file should be
uploaded if the user
Upload raw picks to upload the
audio. audio file. Audio file is
successfully
uploaded.

Submit the audio On submitting, the Pass


model should start
processing. Processing is
started on pressing
submit button.

CNN Take in audio as Latent speech Latent speech


processing input. representation should representation is
be generated generated
automatically automatically.
Pass
Generate latent
speech
representation

Transformer Take in Context speech Context speech


processing quantized latent representation should representation is
speech be generated generated
representation. automatically. automatically.
Pass

Generate context
speech
representation.

30
FFNN Take in context Highest probable Highest probable
processing speech character should be character is
representation generated. generated.
vector.

Pass
Generate
probability for
each character.

Output highest
probable
character.

Output the Show the Devanagari script of Devanagari script


transcript corresponding the audio should be of the audio is
output to the generated. generated. Pass
audio input.

Flag the Flag the wrong Flagged output Flagged output is


output output. should be saved in a saved in a folder
folder within the within the server
server with its with its
corresponding input. corresponding Pass
input.

Clear the Clear the On pressing the Previously


input previously “Clear” or “x” recorded or
provided input. button, the uploaded audio is
application should cleared.
clear the previously Pass
recorded or uploaded
audio.

Edit the Trim the input Using the slider user User is able to
input (audio). should be able to trim trim the audio.
the audio. Pass

Table 5.2.1: Test cases for unit testing

31
5.2.2. Test Cases for System Testing

System testing is performed through the web interface, where we recorded multiple audio
samples and examined the predicted output. The significant findings are:
1. The transcribed text is not entirely grammatically correct.
2. The model does not transcribe numbers of Devanagari script.
3. The inference speed is around 10 – 15 seconds for a single sentence.

Audio Input Output


न गढङग मलख सडक िवस्त रक ठक्क ल ग्य -, क म न गढङग मलख सडक िवस्त रक ठक्क ल ग्य क म
गद ा ज म बढन गद ा ज म बढन
िवस्फ ृटक पद था िनय ातम भ रतक र क, नप लम िवस्फ टक पद था िनय ातम भ रतक र ग नप लम िवक स
िवक स पररय जन प्रभ िवत हुन पररय जन प्रभ िवत हुन

िनलिबबत गभनार अिधक र क पक्षम क णसभ , िनलिबबत गवनार अिधक र क पक्षम क ण सभ सरक र
सरक रिवरुद्ध क ल ब्य नर िवरुद्ध क ल प्य नर

प खर क मयरम एकीकत सम जव द ब ट १३ जन प खर क महरम एकीकत सम जव द ब ट तह्रजन


िसफ ररस िसफ ररस

Table 5.2.2: Test cases for system testing


5.2.3. Result Analysis

The word error rate of the model on OpenSLR43 corpus is 27% while character error rate
is 8.3%. Also, while training 10000 samples are separated as a test set with a 34% word
error rate.
Following are the hyper-parameters used:
learning_rate: 6e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas= (0.9,0.999) and epsilon= 1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 1

32
Following table summarizes the results obtained:

Epoch Training Loss Step Validation Loss WER

0.04 0.3468 400 0.2624 0.3479

0.08 0.2792 800 0.2696 0.3490

0.12 0.245 1200 0.2751 0.3502

0.16 0.2453 1600 0.2754 0.3523

0.2 0.2332 2000 0.2779 0.3517

0.24 0.2321 2400 0.2775 0.3528

0.28 0.2708 2800 0.2764 0.3533

0.32 0.2709 3200 0.2723 0.3544

0.36 0.2715 3600 0.2739 0.3545

0.4 0.2732 4000 0.2707 0.3498

0.44 0.2643 4400 0.2696 0.3499

0.47 0.2682 4800 0.2672 0.3492

0.51 0.2687 5200 0.2644 0.3474

0.55 0.269 5600 0.2619 0.3502

0.59 0.2675 6000 0.2606 0.3477

0.63 0.2656 6400 0.2597 0.3463

0.67 0.2667 6800 0.2607 0.3458

33
0.71 0.2639 7200 0.2601 0.3480

0.75 0.2631 7600 0.2582 0.3447

0.79 0.2589 8000 0.2577 0.3438

0.83 0.2554 8400 0.2557 0.3439

0.87 0.2687 8800 0.2546 0.3438

0.91 0.2574 9200 0.2537 0.3434

0.95 0.2623 9600 0.2530 0.3433

0.99 0.2675 10000 0.2530 0.3426

Table 5.2.3: Obtained result analysis

34
Chapter 6: Conclusion and Future Recommendations
6.1. Conclusion

The primary intention of the project is to build a working Nepali Automatic Speech
recognition system. We successfully used the self-supervised pre-training methodology
with unlabeled data that helped to build a robust ASR system for low resource languages
like Nepali.

Since the start of the project, we researched several methodologies to build ASR systems,
the supervised learning methods were data-hungry, needed a lot of labeled datasets, and
performance was not very accurate. So, we planned to go with a self-supervised learning
methodology. As per our plan, we developed a system that operates as expected.

However, our system is not 100% accurate. Our system is affected by noise, very different
frequency of speech, the system cannot predict the numbers, and punctuations and the
transcriptions are also not grammatically correct. Another significant issue is that because
of a lack of computing resources, we could not train the system longer as we wanted, not
able to tune hyper parameters and leverage all the available datasets, which has
undoubtedly affected the system's performance.

6.2. Future Recommendations

The system can continually be improved. We can build a language model for decoding,
which should be able to transcribe the text with fewer grammatical errors. With sufficient
compute resources, we can perform hyper parameter tuning, which should increase the
system's performance.

In addition to this, since we are using self-supervised learning, the possibility to collect
more diverse unlabeled audio datasets and use them for training the model is always an
option.

Furthermore, for using the model in a real-world application, the inference time and model
size are reduced, which can be done with techniques like Quantization and pruning without
hurting the system's performance.
35
References

[1] M. Zeineldeen, "Cornell University," 12 April 2021. [Online]. Available:


https://arxiv.org/abs/2104.05544v2.
[2] T. Makino, "Cornell University," 8 November 2019. [Online]. Available:
https://arxiv.org/abs/1911.04890.
[3] M. K. SHARMA, A. GAJUREL, A. POKHREL and B. JOSHI, "HMM
BASED ISOLATED WORD NEPALI SPEECH RECOGNITION".
[4] P. Regmi, A. Dahal and B. Joshi, "Nepali Speech Recognition using RNN-
CTC Model," International Journal of Computer Application, 2019.
[5] M. Aul, "Cornell University," 20 June 2020. [Online]. Available:
https://arxiv.org/abs/2006.11477.
[6] K. Sodimana, K. Pipatsrisawat, L. Ha, M. Jansche, O. Kjartansson, P. D. Silva
and S. Sarin, "Open Speech and Language Resources," Aug 2018. [Online].
Available: http://dx.doi.org/10.21437/SLTU.2018-14.
[7] A. Conneau, A. Baevski, R. Collobert, A.-r. Mohamed and M. Auli,
"Unsupervised Cross-lingual Representation Learning for Speech
Recognition," ArXiv, vol. abs/2006.13979, p. 23, 2021.
[8] A. Dhuriya, soujyo and R. Gaur, "Open-Speech-EkStep," 2021. [Online].
Available: https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-
corpus.
[9] O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche and L. Ha, "Crowd-
Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and
Bangladeshi Bengali," in Proc. The 6th Intl. Workshop on Spoken Language
Technologies for Under-Resourced Languages (SLTU), Gurugram, 2018, pp.
52-55.
[10] A. Conneau, A. Baevski, R. Collobert, A.-r. Mohamed and M. Auli,
"Unsupervised Cross-lingual Representation Learning for Speech
Recognition," ArXiv, vol. abs/2006.13979, p. 23, 2021.

36
Appendices
Screenshots:
Convolution Layer

Transformer

37
Feed-Forward

Training Script

38
Interface

Output

39
Logs of visit to Supervisor

Date Agenda Conclusion

17th December
Proposal writing discussion Demonstration for the project proposal and
methodology discussion.

21st January
Proposal discussion and Change in project proposal as supervisor’s
correction of changes needed advice.

22nd January
Project proposal review Necessary changes needed in the project
proposal

Changes needed in Problem Statement and


methodologies

2nd February
Discussion regarding the Self-supervised learning could be best
methodologies to be approach
followed.
Collect open source data, record own voice
Supervised learning or Self- samples.
supervised learning.
Addition of Language could be helpful to
Data collection strategy. improve performance

Performance improvement
strategies.

8th February Progress report observation Based on the project progress report,
Object Oriented Approach should be used.

10th February
Mid defense presentation and Necessary requirement amendment for
progress report. mid-term defense and demo presentation

18th April
Final defense project review. Changes in implementation methodology,
adding appendix.

40

You might also like