Professional Documents
Culture Documents
Abstract:
Introduction:
Automatic Speech Recognition is vastly used in English, but for low
resource Indian languages it is still a domain to be explored. Samskruta is
debatably the oldest language in the world and probably the most
structured language. Many of the Indian and foreign languages have taken
inspiration from Samskruta. Building ASR for Samskruta will help in
reviving this great language. This tool can be used in digitizing the old
manuscripts, it can also be used as a teaching aid.
Since Samaskruta is a low resource language, self-supervised
learning is the best approach as it reduces the data required by a huge
margin. I have used a model pretrained on 1000s of hrs of Indian
languages and finetuned it using a Samskruta dataset. The ASR model has
two parts i.e., acoustic model and language model. Language model is
used on top of acoustic model to improve the performance.
I have used the pretrained model from the vakyansh team [2]. The
dataset used for finetuning is taken from the IITB team vaksanchaya [1].
In this report I have included 5 sections namely, related work,
methodology, results, conclusion and future work and at last references.
Related Work:
Methodology:
Data Preparation
Finetuning
Fig 1
Fig 2
Fig 2 shows the wav2vec2.0 architecture. I have taken the pretrained
model from vakyansh[2] and finetuned it using CTC (Connectionist
Temporal Classification) [6]. Along with CTC criterion, adam optimizer is
used. 2 GPUs each of 12 GB were used for finetuning. Out of the available
78 hrs of data, ~70 hrs of data is used for training, rest is used for testing.
Flask Application
A web application is built where the user can record his voice and a
transcript is generated in real time. Frontend is developed using JavaScript
and HTML. Backend is taken care of by Flask. Media Recorder API is used
to record and save the audio file. XMLHttpRequest is used to send the data
from the web page to the flask application.
Results:
References: