You are on page 1of 8

MAJOR PROJECT

Name: Shashank Holla S


Roll No: 181EC243
Guide: Prof. A.V. Narasimhadhan
Automatic Speech Recognition for
Samskruta using Self-supervised
learning

Abstract:

In this report I will be presenting my work on building a speaker


independent, continuous speech recognition system for Samskruta using
self-supervised learning. Pretrained model from the vakyansh team
(CLSRIL-23) is used and it is finetuned on a dataset containing ~78 hours of
Samskruta audio along with transcription (Vāksañcayaḥ - Sanskrit Speech
Corpus by IIT Bombay). Acoustic representations are learnt in an end to
end deep learning approach using the wav2vec2.0 architecture from
Fairseq. On top of this acoustic model, a language model is used to
increase the performance. Word error rate (WER) is used as the metric and
I got 5.1 on test data and 2.9 on train data. A graphical user interface in the
form of a webpage is built using flask.

Introduction:
Automatic Speech Recognition is vastly used in English, but for low
resource Indian languages it is still a domain to be explored. Samskruta is
debatably the oldest language in the world and probably the most
structured language. Many of the Indian and foreign languages have taken
inspiration from Samskruta. Building ASR for Samskruta will help in
reviving this great language. This tool can be used in digitizing the old
manuscripts, it can also be used as a teaching aid.
Since Samaskruta is a low resource language, self-supervised
learning is the best approach as it reduces the data required by a huge
margin. I have used a model pretrained on 1000s of hrs of Indian
languages and finetuned it using a Samskruta dataset. The ASR model has
two parts i.e., acoustic model and language model. Language model is
used on top of acoustic model to improve the performance.
I have used the pretrained model from the vakyansh team [2]. The
dataset used for finetuning is taken from the IITB team vaksanchaya [1].
In this report I have included 5 sections namely, related work,
methodology, results, conclusion and future work and at last references.

Related Work:

There are few papers[1][3][4] available which discuss ASR for


Samsruta. But none of them have a complete end to end deep learning
approach. These papers discuss conventional methods involving
GMM-HMM and [3] involves CTC criterion along with mel spectrum
co-efficients. Hence to the best of my knowledge, this is the first work in
Samskruta ASR with end-to-end Deep leaning and self-supervised
learning. In terms of WER, my model outperformes all the other available
approaches.

Methodology:

There are 4 steps involved, namely data preparation, finetuning,


building language model and inferencing, and building a flask application.
These are explained in the following subsections.

Data Preparation

The pretrained model I am using is built upon the wav2vec2.0


architecture by Fairseq[5]. Hence the data I am using should be of the
format required by fairseq, which is audio should have sampling rate of
16000 Hz, mono channel and 16-bit PCM encoding. The length of the
audio file is kept between 1 sec to 1min. Transcript for every audio file is
named with the corresponding wav file name. Python and bash scripts are
written to perform the above operations. Once the data is ready in the
required format, pretrained model is downloaded and finetuning will be
performed

Finetuning

The flow for pretrained model is shown in the following fig1.

Fig 1

Fig 2
Fig 2 shows the wav2vec2.0 architecture. I have taken the pretrained
model from vakyansh[2] and finetuned it using CTC (Connectionist
Temporal Classification) [6]. Along with CTC criterion, adam optimizer is
used. 2 GPUs each of 12 GB were used for finetuning. Out of the available
78 hrs of data, ~70 hrs of data is used for training, rest is used for testing.

Language model and Inferencing

I have built a n-gram language model using the kenlm library[7].


Since Samsruta is a very structured language, the language model
significantly increases the performance. Only the transcript data in the
training set is used to build the language model.
Inferencing is done in two ways, without language model and with
language model. Result of both the cases is presented in the result section.

Flask Application

A web application is built where the user can record his voice and a
transcript is generated in real time. Frontend is developed using JavaScript
and HTML. Backend is taken care of by Flask. Media Recorder API is used
to record and save the audio file. XMLHttpRequest is used to send the data
from the web page to the flask application.

Results:

Graphical representation of WER, loss and GPU parameters is


shown in the following figures. Without language model on the validation
set, WER is 12.5 With language model, on the train set WER is 2.4 and on
the test set WER is 5.1
Conclusion and Future work:

In the previous sections, the entire methodology of building the ASR


model and the result is discussed. It can be seen from the results that my
model outperformed the existing solutions. Still there is a lot of scope to
improve my approach. They are listed below:
1. Deploying the Flask app on Azure
2. Adding Reverberation effects
3. Improving the language model
4. Adding Shlokas and Bhajans to the dataset
5. Final goal is to achieve Speech-to-Speech translation from
Samskruta to other Indian Languages

References:

1. Devaraj Adiga, Rishabh Kumar, Amrith Krishna, Preethi Jyothi,


Ganesh Ramakrishnan, and Pawan Goyal. "Automatic Speech
Recognition in Sanskrit: A New Speech Corpus and Modelling
Insights". Proceedings of The 59th Annual Meeting of the Association
for Computational Linguistics (ACL Findings), 2021.
2. CLSRIL-23: CROSS LINGUAL SPEECH REPRESENTATIONS FOR
INDIC LANGUAGES Anirudh Gupta , Harveen Singh Chadha,
Priyanshi Shah, Neeraj Chimmwal, Ankur Dhuriya, Rishabh Gaur,
and Vivek Raghavan
3. . A. C. S. and A. G. Ramakrishnan, "CTC-Based End-To-End ASR for
the Low Resource Sanskrit Language with Spectrogram
Augmentation," 2021 National Conference on Communications
(NCC), 2021, pp. 1-6, doi: 10.1109/NCC52529.2021.9530162.
4. Sanskrit Speech Recognition using Hidden Markov Model Toolkit -
https://www.ijert.org/research/sanskrit-speech-recognition-using-hidd
en-markov-model-toolkit-IJERTV3IS100141.pdf
5. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations ,Alexei Baevski, Henry Zhou, Abdelrahman
Mohamed, Michael Auli
6. Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks -
https://www.cs.toronto.edu/~graves/icml_2006.pdf
7. KenLM: Faster and Smaller Language Model Queries -
https://kheafield.com/papers/avenue/kenlm.pdf

You might also like