Professional Documents
Culture Documents
Report
Sridhar Vanga
13-12-2022
Contents
2 ASR interface 2
2.1 Current web Interface setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4.1 Profanity Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Captioning Pipeline 3
3.1 Pipeline blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
ii
1. Data Acquisition Pipeline (DAP)
1.1 Introduction
India is a multilingual society with 23 official and hundreds of unofficial languages. With many languages
spoken around the nation, It would be difficult to manage affairs between states without the help of a translator.
Artificial Intelligence can lower this language barrier and act as a bridge between the states in official matters
as well as people of different lingual backgrounds. To build language technologies, Data plays a crucial role in
bridging the gap between the languages. Under the Government of India’s aim to overcome language barriers,
Telugu data Collection is being taken place in collaboration with IIIT-Hyderabad.
2. To achieve this, we have used the speaker embeddings from resemblyzer4 which provides the high-level
representation of a voice.
3. On the top of these embeddings, we have used a neural model to classify the genders in the audio chunks.
4. To handle multiple speaker present in the unseen audio, we have used DBSCAN algorithm to get the
speaker clusters in a unsupervised manner
1 https://newsonair.gov.in/RNU-NSD-Audio-Archive-Search.aspx
2 https://github.com/wiseman/py-webrtcvad
3 https://asr.iiit.ac.in/cstd/
4 https://github.com/resemble-ai/Resemblyzer
1
2. ASR interface
2.2 Scope
The Scope of this Objective is to develop an interface for interacting with the Automatic Speech Recognition
Models of Different frameworks including kaldi and end-to-end. This interface enables interaction between
the end-user and the model in real time with low latency. This setup can support multiple architectures and
multiple languages to generate the transcripts from the models concurrently. This web interface is developed
using FastAPI, an asynchronous framework, Websocket protocol for real time communication between the
server and the client, Hark library to get the relevant sound events.
2.3 Improvements
1. I have removed the dependency of hark on the client side, which decreases the amount of javascript being
loaded on the client side.
2. I have added a profanity module at the end of the inference to filter out the hindi profane words present
in the transcription.
2.4 Specifications
1 https://github.com/rominf/profanity-filter
2
3. Captioning Pipeline
With the increase in audio data, The need to structure the audio data along with the subtitles and tagged with
the speaker information as well as the environment information have became an absolute need to process the
unstructured audio data. For this purpose, we have created a captioning pipeline which can give speaker and
the background information at the chunk level along with the transcripts from the ASR.
1 https://github.com/juanmc2005/StreamingSpeakerDiarization
3
Figure 3.1: captioning system