You are on page 1of 1

SIP PROJECT REPORT

Rishi Gupta M.T echAI , 19729

I. A BSTRACT III. R ESULTS


In this project we have perform speech to text conversion When we fed our test data to wav2vec2 model then there
using deep learning model Wav2vec2.Since output of model are some word which are mismatched with original data that
wav2vec2 is not very accurate, hence we have implemented is -
Language model over wav2vec2 using KenLM for improving • Doctor I felt weakness in my body from several days.
accuracy. In this we have used medical domain dataset for • This sentence when passed through wav2vec2 without
training language model. Language model gives output as-
• Doctor i affect to be ness in my body from several days.
II. T ECHNICAL DETAILS • This model can only predict normal text but not able to
predict keyword of certain domain .
We have used deep learning model wav2vec2 which • Now, when this test data is passed through wav2vec2
convert speech into text.N-gram Language model is also with Language model then model decode this as-
implemented over wav2vec2 with KenLM.Main advantage • Doctor I felt weakness in my body from several days.
of using wav2vec2 is it takes unlabeled data in pretraining
and only few hours of labeled data is required in training. IV. T OOLS USED
• Libraries like pyctcdecode, transformers, datasets.
A. Training dataset • From transformers we import wav2vec2
For training data we have collected medical domain con- Wav2vec2Tokenizers and Wav2vec2ForCTC.
versation between doctor and patient and record that using • Librosa for reading audio file.
five of our friend which we used to finetune over pretrained
wav2vec2 model.

B. Test Data
For testing we have used recorded audio sentences from
similar set of friends who recorded test data.

C. Implementation stages
There are three stages of implementation of stages which
are as-
• We give raw input to our wav2vec2 model which will
convert raw audio into latent speech representation and
then perform quantization
• Some vectors of quantized speech are removed before it
is given to any language model called masking.Masked
input is given to Language model.
• Language model predict vectors which was masked in
previous step and gives the output text and accurately
that will determine WER

D. Language model
• We have integrated 5-gram language model over
wav2vec2.For training our language model we have
used hugging face medical dialog dataset using KenLM
model.
• This medical dialog data consist of millions of dialogue
realted to question and answering between doctor and
patient.

You might also like