You are on page 1of 1

SIP PROJECT REPORT

Rishi Gupta M.T echAI , 19729

I. A BSTRACT III. R ESULTS


When we fed our test data to wav2vec2 model then there
In this project we have perform speech to text conversion
are some word which are mismatched with original data
using deep learning model Wav2vec2.Since output of model
that is - 1. Doctor I felt weakness in my body from several
wav2vec2 is not very accurate, hence we have implemented
days. This sentence when passed through wav2vec2 without
Language model over wav2vec2 using KenLM for improving
Language model gives output as-
accuracy. In this we have used medical domain dataset for
Doctor i affect to be ness in my body from several days.
training language model.
This model can only predict normal text but not able to
predict keyword of certain domain .
II. T ECHNICAL DETAILS Now, when this test data is passed through wav2vec2 with
Language model then model decode this as-
We have used deep learning model wav2vec2 which
Doctor I felt weakness in my body from several days.
convert speech into text.N-gram Language model is also
implemented over wav2vec2 with KenLM.Main advantage IV. T OOLS USED
of using wav2vec2 is it takes unlabeled data in pretraining 1. Libraries like pyctcdecode, transformers, datasets.
and only few hours of labeled data is required in training. 2. From transformers we import wav2vec2
Wav2vec2Tokenizers and Wav2vec2ForCTC.
A. Training dataset 3. Librosa for reading audio file.
For training data we have collected medical domain con-
versation between doctor and patient and record that using
five of our friend which we used to finetune over pretrained
wav2vec2 model.

B. Test Data
For testing we have used recorded audio sentences from
similar set of friends who recorded test data.

C. Implementation stages
There are three stages of implementation of stages which
are as- 1. We give raw input to our wav2vec2 model which
will convert raw audio into latent speech representation and
then perform quantization
2. Some vectors of quantized speech are removed before
it is given to any language model called masking.Masked
input is given to Language model.
3.Language model predict vectors which was masked in
previous step and gives the output text and accurately that
will determine WER

D. Language model
We have integrated 5-gram language model over
wav2vec2.For training our language model we have
used hugging face medical dialog dataset using KenLM
model.This medical dialog data consist of millions of dia-
logue realted to question and answering between doctor and
patient.

You might also like