You are on page 1of 6

Instructor : Dr.

Anil Kumar Vuppala


IIIT Hyderabad

Report

Sridhar Vanga

13-12-2022
Contents

1 Data Acquisition Pipeline (DAP) 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Current DAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Improvements in DAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 ASR interface 2
2.1 Current web Interface setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4.1 Profanity Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Captioning Pipeline 3
3.1 Pipeline blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

ii
1. Data Acquisition Pipeline (DAP)

1.1 Introduction
India is a multilingual society with 23 official and hundreds of unofficial languages. With many languages
spoken around the nation, It would be difficult to manage affairs between states without the help of a translator.
Artificial Intelligence can lower this language barrier and act as a bridge between the states in official matters
as well as people of different lingual backgrounds. To build language technologies, Data plays a crucial role in
bridging the gap between the languages. Under the Government of India’s aim to overcome language barriers,
Telugu data Collection is being taken place in collaboration with IIIT-Hyderabad.

1.2 Current DAP


As a part of this pilot project around 2000 hours of Telugu data is being created and released under a CC
license. Among the 2000 hours of Telugu speech data, around 1000 hours of Telugu speech data is collected
through youtube and other sources. The data collected from the already existing telugu youtube channels
and other streams like newsonair regional archives1 have been scraped using tools like selenium, scrapy and
BeautifulSoup. The collected data is then converted into chunks based on Voice Activity using webrtcVAD2 .
Then these small chunks were passed through SNR(Signal Noise Ratio) to remove the chunks which has an
SNR less than 15. The filtered small chunks has been passed through an ASR to get the rough transcriptions.
These transcripts were then verified manually. The verified transcripts and the audio chunks are then converted
into ULCA format to publish the dataset. The Published dataset can be found here3

1.3 Improvements in DAP


1. The new modules of Gender classification and Speaker Clustering has been added to retain as much
information from the audio chunks as possible.

2. To achieve this, we have used the speaker embeddings from resemblyzer4 which provides the high-level
representation of a voice.
3. On the top of these embeddings, we have used a neural model to classify the genders in the audio chunks.
4. To handle multiple speaker present in the unseen audio, we have used DBSCAN algorithm to get the
speaker clusters in a unsupervised manner

1 https://newsonair.gov.in/RNU-NSD-Audio-Archive-Search.aspx
2 https://github.com/wiseman/py-webrtcvad
3 https://asr.iiit.ac.in/cstd/
4 https://github.com/resemble-ai/Resemblyzer

1
2. ASR interface

2.1 Current web Interface setup


As a part of my Independent study in Spring 2022, I have developed a web application which serves the speech
input from the microphone as well as a media file in realtime.

2.2 Scope
The Scope of this Objective is to develop an interface for interacting with the Automatic Speech Recognition
Models of Different frameworks including kaldi and end-to-end. This interface enables interaction between
the end-user and the model in real time with low latency. This setup can support multiple architectures and
multiple languages to generate the transcripts from the models concurrently. This web interface is developed
using FastAPI, an asynchronous framework, Websocket protocol for real time communication between the
server and the client, Hark library to get the relevant sound events.

2.3 Improvements
1. I have removed the dependency of hark on the client side, which decreases the amount of javascript being
loaded on the client side.
2. I have added a profanity module at the end of the inference to filter out the hindi profane words present
in the transcription.

2.4 Specifications

2.4.1 Profanity Module


• I have collected hindi profane words from various sources and composed a profane words set out of them
• With the help of profanity-filter1 package and the profane words dataset, the profane-word filtering has
been performed.

1 https://github.com/rominf/profanity-filter

2
3. Captioning Pipeline

With the increase in audio data, The need to structure the audio data along with the subtitles and tagged with
the speaker information as well as the environment information have became an absolute need to process the
unstructured audio data. For this purpose, we have created a captioning pipeline which can give speaker and
the background information at the chunk level along with the transcripts from the ASR.

3.1 Pipeline blocks


For this system, we have used VAD system to chunk the given audio based on the voiced and unvoiced regions
in the audio file. We have used a Speaker Diarization1 Module to get the speaker level labels of the chunks
obtained from the audio files. These chunks are then passed on the ASR model to obtain the chunk level
transcriptions. These transcriptions are then passed on through profanity module discussed in above interface
to get the captions with masked profane words in them. These chunks has been passed through speech event
detector to know the events like laughing, crying, music etc., in the audio chunks. And then the complete
captions were being created combinedly from the transcripts as well as the events.

1 https://github.com/juanmc2005/StreamingSpeakerDiarization

3
Figure 3.1: captioning system

You might also like