ASRoIL_a_comprehensive_survey_for_automa Kannada

Artificial Intelligence Review
https://doi.org/10.1007/s10462-019-09775-8
ASRoIL: a comprehensive survey for automatic speech

recognition of Indian languages
Amitoj Singh1 · Virender Kadyan2 · Munish Kumar1 · Nancy Bassan3
© Springer Nature B.V. 2019
Abstract
India is the land of language diversity with 22 major languages having more than 720 dia-
lects, written in 13 different scripts. Out of 22, Hindi, Bengali, Punjabi is ranked 3rd, 7th
and 10th most spoken languages around the globe. Expect Hindi, where one can find some
significant research going on, other two major languages and other Indian languages have
not fully developed Automatic Speech Recognition systems. The main aim of this paper is
to provide a systematic survey of the existing literature related to automatic speech recog-
nition (i.e. speech to text) for Indian languages. The survey analyses the possible opportu-
nities, challenges, techniques, methods and to locate, appraise and synthesize the evidence
from studies to provide empirical answers to the scientific questions. The survey was con-
ducted based on the relevant research articles published from 2000 to 2018. The purpose of
this systematic survey is to sum up the best available research on automatic speech recog-
nition of Indian languages that is done by synthesizing the results of several studies.
Keywords Automatic speech recognition · Indian languages · Feature extraction

techniques · Classification techniques · Speech corpus
* Munish Kumar
munishcse@gmail.com
Amitoj Singh
amitoj.ptu@gmail.com
Virender Kadyan
ervirenderkadyan@gmail.com
Nancy Bassan
nanchitkara@gmail.com
1
Department of Computational Sciences, Maharaja Ranjit Singh Punjab Technical University,
Bathinda, Punjab, India
2
Department of Computer Science and Engineering, Chitkara University Institute of Engineering
and Technology, Chitkara University, Rajpura, Punjab, India
3
Department of Mechanical Engineering, Baba Farid College of Engineering and Technology,
Bathinda, Punjab, India
13
Vol.:(0123456789)
A. Singh et al.
1 Introduction
Automatic Speech Recognition (ASR) is an important research area in the field of pattern
recognition. It is a combination of several techniques that help in converting an acoustic
signal to text wherein the output shows the text corresponding to the recognized speech
signal. The goal of speech recognition is to enable computers to recognize the speech sig-
nal and convert the words in the speech to textual form. Technical giants such as Amazon,
Apple, Google, IBM, and Microsoft have developed very sophisticated speech recogni-
tion software for English language. Even though Gartner predicts that, by 2023, 25% of
employee interactions with applications will be via voice, up from under 3% in 2019 (https
://www.gartner.com/en/newsroom/press-releases/2019-01-09-gartner-predicts-25-percent-
of-digital-workers-will-u).
Google Assistant conversant in over 30 languages, which include 6 Indian languages,
i.e. Hindi, Kannada, Malayalam, Marathi, Tamil and Telugu. Although Gartner has pre-
dicted that speech recognition technology has touched its peak during this decade (Besa-
cier et al. 2014), most of the languages still lies in under resource categories. This is the
case with Indian languages also, with 22 official languages and 1652 mother tongues,
languages like Assamese, Dogri, Kashmiri, Punjabi and many more are still not much
researched. In this paper, the authors have highlighted the research and development of
the ASR system in Indian languages and give the short description of the related work and
state-of-the-art work in the field of automatic speech recognition of the Indian languages.
A comprehensive literature survey is conducted, and papers are anodized from different
aspects of speech recognition like feature extraction, toolkits used, speech corpus that is
majorly used to develop the automatic speech recognition system, year wise progress and
so on. All research designs, experimental and non-experimental, are also included. Finally,
future directions are also presented in this survey article.
1.1 Motivations
There has been considerable advancement in European speech recognition engines (Jay-
anna 2009) but research on ASR system development in most of the Indian languages is
still in its initial stage of research. This is because of the unavailability of a standard speech
corpus of Indian languages and dialectal variations. Some of the languages stills lack mini-
mum required speech corpus for development of the system. If suitable classification tech-
nique not provided to the ASR system Lage scale speech corpus can increase computa-
tional complexity and drastically degrades the performance in the testing phase. Building
enough data for training and testing of the system is one major obstacle. There are around
22 official languages in India. Hindi is the major spoken language of India and ranked 3rd
most spoken language in the world (https://www.babbel.com/en/magazine/the-10-most-
spoken-languages-in-the-world/). Other major languages of India like Bengali, Punjabi,
Telugu, Marathi and Tamil ranked 7th, 10th, 15th, 16th and 17th, respectively (Ethnologue
https://www.ethnologue.com/statistics/size). Even though 6 languages in top 20 most spo-
ken languages belong to the Indian subcontinent the state-of-the-art work of other Indian
languages is not well traversed. So, this work motivated authors to provide a systematic
survey of automatic speech recognition methodologies. This paper provided a literature to
find out the present status of the languages spoken in India. Some of the popular applica-
tions of automatic speech recognition are:
13
ASRoIL: a comprehensive survey for automatic speech recognition…
• Generating subtitles in live television

• Dictation tool in professional areas
• Converting speech into text
• Translating voice into foreign languages
• Querying the database with spoken queries, e.g.: E-Farming
• Robotics
• Speech interfaces in mobile applications
This paper is organized into six sections. Introduction to the present work is described in
the Sect. 1. Section 2 presents the background of the automatic speech recognition and
techniques. Various issues for automatic speech recognition are presented in Sect. 3. Sec-
tion 4 presents the various features and classification techniques for speech recognition.
Syntactic analysis based on the survey is presented in Sect. 5 and finally, Sect. 6, presents
the concluding notes.
2 Background
Under the umbrella study of “automatic speech recognition in Indian language, literature”,
there are few available reviews on speech recognition systems. Review of speech recog-
nition by machine (Anusuya and Katti 2010), feature and corpus classification schemes
(Saad and Ashour 2010), machine translation approaches (Antony 2013) and speech rec-
ognition technology (Hemakumar and Punitha 2014), Acoustic–Phonetic Analysis for
Speech Recognition (Sarma and Prasanna 2018) are some of the survey studies conducted
for Indian languages. A comparison on advancement of speech recognition technolo-
gies was described by Anusuya and Katti (2009) in the past 6 decades. Survey results of
Sarma et al. (2010) revealed that research on language variation is the modern trend in the
field of speech recognition. Furthermore, acceptance of neural network techniques from
last 10–15 years in the industry is seen as an alternative to the HMM technology that has
been dominated for many years. Academicians and industry for developing applications of
speech recognition have entrusted artificial Neural Network (ANN) based methodologies.
Review on Hindi regional accents presented by Thakur et al. (2011). They have presented a
review of noisy data and noise free data. They also elaborate that the accuracy rate of noisy
test data is lesser than noisy free dataset. Ghai and Singh (2012) and Kurian (2014) have
presented a survey on speech recognition over the year 2000–2013 for various Indian lan-
guages i.e. Gujarati, Oriya, Punjabi, etc. Approving the Thakur et al. (2013) claims that a
few researches has been reported in regional languages except Kannada, Tamil, and Telugu
that have received major chunks of speech recognition research initiatives. Antony (2013)
have presented the literature survey on put forward that there have been many examples
in MIT. The survey gives the brief description of various approaches and machine trans-
lational development in India. Saini and Kaur (2013) presented a review on automatic
speech recognition from 2000 to 2010 and they concluded that ‘Variational Bayesian (VB)
estimation-based speech recognition’ was the research to a great extent by the research-
ers. Hemakumar and Punitha (2014) have presented a review based on the study of well-
known methods and toolkits used for automatic speech recognition. They observed that, to
acquire more empirical result, research ought to be to evolve and adapt/opt better language
that can reduce the space complexity and computational time. Gulzar et al. (2014) also
conducted an overview of past work comparing modern speech recognition system. The
13
A. Singh et al.
technological aspects of Automatic speech recognition deliberated various techniques of

ASR and progression of the technology in speech recognition for the last 10 years (Sam-
udravijaya 2014). Swain et al. (2018) presented a review of speech emotion recognition
systems base of three parameters: the database design, feature selection and the classifi-
ers that have been used for the building of ASR system. Their review was focused on the
insights sight regarding the features, databases and classifiers used from 2000 to 2017. The
authors comment that to improve the performance of the system and to identify the correct
emotions, classifier selection remains challenging task. Although a variety of classifiers
have been chosen by the researchers for speech emotion recognition system, but it is very
difficult to conclude which performs better-there is no clear winner.
3 Issues for automatic speech recognition
The major issue for ASR engine is to adjust with the variability of the speech signal. The
issue arises due to linguistic, speaker and channel variability, which includes various num-
bers of other attributes such as phonetics, adverse environment conditions (clean, noisy
or real), varying speaker parameters (like their age, gender, accents, speed of their utter-
ance, and dialects), length of training dataset, and voice recording device. Tackling such
degraded speech signal is an unpleasant experience. An efficient ASR system must be able
to identify all such types of factors to produce the text corresponding to an input signal.
The other parameter that has major significance on the success of an ASR system is its
corpus. Indian languages face the challenge of standard speech and text corpus (Aggarwal
and Dave 2013). So, there arises a need to traverse these languages and have a productive
output for them in the sphere of speech recognition. For successful development of an effi-
cient ASR engine, its corpus plays a crucial role. Thus, collecting a speech corpus requires
special focus. The extractions of relevant information, classification of features in the mod-
elling phase with lower computational complexity are other key parameters that need to be
taken care off. These speech factors have wide scope on the sustainability of application
specific or general ASR systems.
4 Research methods
This paper attempts to review all the published literature for automatic speech recognition
of Indian languages from 2000 to 2018. Papers that referred to speech recognition in Indian
languages or allied research on the Indian ASR datasets variety, experimental and non-
experimental have been included in this survey paper. Papers that are focused on opinions
or describing different aspects of automatic speech recognition without evaluation have
been excluded from the research design of this paper. The need of the systematic survey is
to determine the status of the research on automatic speech recognition.
4.1 Corpus development and selection
A prosody team of DA-IICT collected speech corpora of Marathi and Gujarati languages
from the remote villages of Maharashtra and Gujarat states. Read, spontaneous and lecture
mode speech corpora was collected by Malde et al. (2013). IIT Kharagpur in collabora-
tion with Media Lab Asia developed read the speech corpus named ‘Shruti’ that has been
13
spoken by 34 speakers from West Bengal of different age groups. This corpus contains
7383 unique sentences. The speakers are from a region of West Bengal (Shruti 2015).
Upadhyay and Riyal (2010) collected 11,188 isolated words from the spontaneous noisy
speech data of Garhwali language. This corpus also includes phonetically rich sentences
collected from the various newspapers, books and magazines. Speech corpus was collected
100 speakers from various regions of the Uttrakhand state of the India. Samudravijaya and
Gogate (2006) also collected speech corpus in spontaneous and phonetically rich sentences
in 2006 for development of automatic speech recognition system in Marathi language.
Again, Gaikwad et al. (2013) created a speech corpus in Marathi language that consists
of 17,470 sentences and 28,240 words. Data was collected as read speech from different
resources (Gaikwad et al. 2013). Sarma et al. (2013) have collected Assamese language,
speech database from 25 native Assamese speakers for development of an Assamese pho-
netic engine. Large vocabulary speech corpus for continuous speech recognition in Tamil
language was collected from almost 100 speakers by Anna University in two phases. In
phase I, read the speech corpus of literally stories was recorded from around 70 speakers
and in phase II, 1-h speech corpus from newspapers were collected from around 30 speak-
ers. The Linguistic Data Consortium for Indian Languages (LDC-IL) has collected speech
corpora in 2015 of 16 different Indian regional languages as well as segmented data of
22 Indian languages. IITKGP-MLILSC has collected speech corpus of 27 different Indian
regional languages. They collected corpus from news, talk shows, interviews and All India
Radio. Every language has at least 1 h of speech data recorded from 10 different speakers
(Maity et al. 2012). Heavy data of speech corpus (i.e. of approximately 50 h) on ten dif-
ferent Indian languages, i.e. Assamese, Bengali, Hindi, Kannada, Malayalam, Manipuri,
Marathi, Punjabi, Tamil, and Telugu is available at http://www.lidc.gov.in, but it is not
available for public use. Singh and Singh (2011) tried to analyse the vowel phonemes of
Punjabi along with its format analysis. The native speaker of the Punjab state, mainly by
male speakers recorded the vowels of Punjabi language. They analyzed that both English
and Punjabi languages have different format frequencies. Lata and Arora (2013) collected,
isolated speech corpus containing phonemes of Malwai dialects and carrying tonal effect
from native Punjabi speakers. The speech corpus of different language, ideas was designed
and evaluated for Marathi by TIFR and IIT Bombay (Godambe and Samudravijaya 2011),
Hindi language travel domain dataset by C-DAC Noida (Arora et al. 2010). And Telugu
language dataset for Mandi information system by IIIT Hyderabad, English, Hindi and Tel-
ugu dataset for travel and emergency services was collected by IIT Hyderabad (Mantena
et al. 2011). Another general-purpose corpus of Telugu, Hindi, Tamil, and Kannada were
prepared by IIT Kharagpur (Rao 2011). Annotated speech corpora in three East Indian lan-
guages, namely, Assamese, Bangla, and Manipuri have been developed by CDAC, Kolkata
(CDAC Corpus 2015). This project was sponsored by TDIL, DeitY, Data was recorded in
a clean environment at a sampling rate of 22, 050 Hz, 16 bits/sample in PCM wave format.
The corpus is around 8.5 GB. EMILLE-CIIL Corpus (Enabling Minority Language Engi-
neering) consists of three components: monolingual, parallel and annotated corpora. It has
14 monolingual corpora, including both written and spoken data. Spoken data consists of
14 South Asian languages. These monolingual corpora consist of the total 96 million of
words, including more than 2.6 million words of spoken corpora in Bengali, Urdu, Guja-
rati, Hindi and Punjabi. It is a collaborative venture between Lancaster University, UK, and
the Central Institute of Indian Languages (CIIL), Mysore, India (EMILLE Corpora 2015).
The spoken language group of TIFR developed a large multilingual spoken corpus for
Indian languages. The speech database has been developed for four different languages
such as Hindi, Marathi, Malayalam and Indian English and speech database has been
13
A. Singh et al.
collected over telephone channels. Samudravijaya (2006) and Mohamed and Lajish (2016)
developed speech corpus consisting of 1000 samples of five Malayalam vowels collected
from twenty speakers for training and testing. 500 samples each are used for training and
recognition purpose. Besides all these benchmark datasets, many researchers have devel-
oped their own corpus as many other ASR systems don’t have freely available speech cor-
pus that can be used. Therefore, the summary of those kind of speech corpora has been
listed in the Table 1. Mohamed and Lajish (2016) developed speech corpus consisting of
1000 samples of five Malayalam vowels collected from twenty speakers for training and
testing. 500 samples each are used for training and recognition purpose.
Dua et al. (2018a) developed speech corpus of Hindi language containing 1000 sen-
tences spoken by 100 speakers. Out of 100, 38 have mother tongue is Hindi, and the rest
of the others speak Hindi fluently. Out of 10 sentences uttered by every speaker, two sen-
tences are common to all speakers of Microsoft in 2018 (Interspeech 2018) released Low
Resource ASR challenge for Indian languages has been proposed for the Interspeech 2018
conference. The data released for the challenge was provided by SpeechOcean.com and
Microsoft. It consisted of phrasal (recorded as read-out phrases) and conversational speech
in Tamil, Telugu and Gujarati. 40 h of training data and around 5 h of test data for each
language for each language was released. Pandey and Nathwani (2018) collected speech
corpus using 19 h dataset (news and speech) collected from YouTube. The dataset has 19-h
of audio data with 6840 audio clips of 10 s each. Each audio contains an average of 50
syllables.
DIT Govt. of India developed speech corpora of 4 Indian languages namely, Kannada,
Telugu, Bengali and Odia as a part of a consortium project titled “Prosodically guided pho-
netic engine” for searching speech databases in Indian languages. Speech corpora contains
16 bits, 16 kHz speech wave files along-with their IPA transcription. Patel et al. (2018)
develops an ASR and Keyword Search (KWS) system for Manipuri, a low-resource Indian
Language. Read speech data of more than 90 h from 300 speakers was collected for the
development of ASR task. For training the ~ 90 h dataset is split in training sets of 30, 40,
50 and 70 h. The end system built with 70 h of training set includes ~ 65,000 words and
~ 36,000 sentences.
4.2 Feature extraction techniques
Most of the research studies for automatic speech recognition of Indian languages focused
on Linear Predictive Coding (LPC), Zero Crossing with Peak amplitude (ZCPA), Mel-
frequency Cepstrum Coefficient (MFCC), Dynamic Time Wrapping (DTW), and Rela-
tive Spectra Processing (RASTA) features. MFCC confirms better performance in feature
extraction in comparison to PLP (Dua et al. 2012a, b). Thasland et al. (2007) compared
LPC with wavelet packet decomposition method on Malayalam speech corpus. Over-
all recognition accuracy using LPC is found to be much better (34%) than wavelet packet
decomposition method (74%). Kandali et al. (2009) compared the feature extraction tech-
niques with respect to accuracy for speech recognition. They considered WPCC2 (Wave-
let Packet-Cepstral-Coefficients), MFCC2 (Mel Frequency cepstral Coefficient, MFCC,
tfWPCC2 (Teger-energy-operated-in-Transform Domain WPCC2) and tfMFCC for fea-
ture extraction. They observed that WPCC2 and tfWPCC2 than MFCC and tfMFCC tech-
nique performs better than other techniques for automatic speech recognition. Farooq et al.
(2010) have noticed that the performance of WP was better than the features extracted with
MFCC and GFCC except aspirated voiced phoneme class. Additionally, it was observed
13
Table 1 Summary of speech corpora and its properties
13
A. Singh et al.
Table 1 (continued)
13
Table 1 (continued)
13
A. Singh et al.
Table 1 (continued)
13
Table 1 (continued)
13
A. Singh et al.
Table 1 (continued)
that for multi-training conditions, (i.e. Jet, value, Babble, factory) performance of MFCC
is dropped significantly compared to other feature extraction techniques because of some
13
inadequacies in multi-condition training using MFCC (Farooq et al. 2010). Sinha et al.
(2011) considered MFCC and PLP for feature extraction and applied HLDA for feature
reduction in Hindi language corpus. It is noteworthy that the third order derivative of
speech features increases the rate of recognition by 3–4%. An Assamese speech recogni-
tion model proposed by Dutta and Sarma (2012). They used MFCC and LPC technique for
extracting features from the speech signal. This helps them to generate 10 percent gains in
the recognition system. Kaur and Singh (2016a) used Power Normalized Cepstral Coeffi-
cients (PNCC) on connected words to build Punjabi ASR using HMM. WER in noise free
and noisy environment obtained was 16.28% and 32.08%, respectively. Kaur and Singh
(2016b) used MFCC, PLP, PNCC to build ASR for Punjabi connected words using HMM.
WER obtained by MFCC, PLP and PNCC was noted as 13.19%, 16.28% and 16.28%,
respectively, for noise free environment. Arora et al. (2019) studied pitch, intensity and
fundamental frequency and their effect on Punjabi dialects. Kadyan et al. (2018) proposed
method with different combination of extracted feature vectors are performed before classi-
fication. Extracted features are then processed through the LDA, SAT, fMLLR and MLLT
methods using triphone and monophone models, fMLLR and MLLT gives the best perfor-
mance when combined with DNN. The system gives WER of 17.53 using this system.
Ventateswarlu et al. (2012) has presented the comparative study of application of Time
Lagged Recurrent Neural Network (TLRN) and Multilayer perception (MLP) using LPCC
and MFCC feature extraction technique. Their aim was to identify the individual’s utter-
ance using a biometric system. It is found that the conventional system outer is performed
by the proposed system. The achieved a recognition rate of the vowels for the proposed
system was 96.0% and 97.56% for LPCC and MFCC features, respectively, whereas the
conventional system gave 92.47% and 94.0%, respectively. They developed ASR for Hindi
speech by two feature extraction techniques, i.e., MFCC and PLP at the front end of ASR.
The experiment conducted on different vocabulary sires i.e. 50 words to 150 words and
it has been observed that the satisfactory results were achieved with MFCC in compari-
son to PLP feature extraction technique. The accuracy rate was 3–4% higher for MFCC
instead of PLP. Kumar et al. (2014a) compared all three systems (isolated, connected and
continuous) of Hindi language with different vocabulary sizes using both MFCC and PLP
feature extraction techniques. The results of the suggested ASR system with GMM at back-
end for the vocabulary size of 50 words was 95.04% for MFCC and 90.24% for PLP. Fur-
thermore, analysis showed that performance evaluation using HMM at the front end for
50 words (Monophones) was 90.12% for MFCC and 70.72% for PLP. On the other side
for 50 triphones, MFCC and PLP provides accuracy of 92.0% and 73.36%, respectively.
Sriranjani et al., (2014) proposed the comprehensive study of different feature extraction
technique, i.e. MFCC, PLP, PNCC and RASTA-PLP. Results show that PNCC performed
well for clean corpus whereas MFCC impressed in case of multi-condition speech data.
The comparison among the performances of feature extraction methods is observed as:
MFCC > PNCC > RASTA-PLP > PLP. Biswas et al. (2015) has carried out the baseline rec-
ognition test using conventional 36 MFCC and GFCC features. They considered frame size
of 24 ms (with 10 ms skip rate) used to extract features of both of the techniques. As the
authors have observed that MFCC has been mainly used by researchers in the field of auto-
matic speech recognition. There are some studies those have compared with various feature
extraction techniques like Wavelet, Rastra etc. After the comparative study, they observed
that MFCC provide promising recognition results for automatic speech recognition.
Bharali and Kalita (2018) use he delta–delta MFCC feature for building speech recog-
nition system for Assamese language. They developed word models using three different
techniques—HMM, VQ and I-vector to. In clean environment at 39 feature vector I-vector
13
A. Singh et al.
give the word accuracy of 81%. Dua et al. (2018a, b, c, d) resented Differential Evolution
(DE) technique for optimizing the filters in MFCC, GFCC and BFCC. The performance of
the proposed technique was evaluated both in noise-free and noisy environment. Vegesna
et al. (2018) worked on IIIT-H Telugu speech corpus of 64,464 utterances to extract MFCC
and prosody features. A hybrid of GMM-HMM classifier was employed. Bhowmik et al.
(2018) work on detection and classification of phonological features from Bengali continu-
ous speech. They calculate derivative and double derivative for the 13 MFCC features and
yield a 39 dimension input feature vector. The proposed model achieved 86.19% of average
feature detection accuracy for the Bengali CDAC corpus.
Mohamed and Lajish (2016) compared MFCC with nonlinear features for Malayalam
vowels. The joint feature vector of nonlinear feature and MFCC features is considered for
building speech recognition system. The recognition experiment is conducted by simulat-
ing the above algorithms using MATLAB. Where Phase Space Ant-diagonal Point Dis-
tribution (PSAPD) combined with MFCC gives the recognition accuracy of 80.74%. Dua
et al. (2018b) in their work use noise robust method Gammatone Frequency Cepstral Coef-
ficients (GFCC) for feature extraction. They also apply Differential Evolution (DE) tech-
nique to refine the GFCC features and discriminative techniques to enhance performance
of the acoustic mode. The results reveal that DE optimized GFCC with HMM-Gaussian
Mixture Model (GMM) acoustic modeling performs better than MFCC, PLP and MF-PLP
feature extraction methods. Chellapriyadharshini et al. (2018) used a data set of an Indian
language ‘Tamil’ released by Microsoft for Interspeech 2018 challenge. They used DNN-
HMM framework for building ASR system for 5.6 h of data released by Microsoft trained
using Kaldi. 40-dimensional feature (MFCC + LDA + MLLT + fMLLR) is used. The pro-
posed semi-supervised learning offers WER reductions by as much as 50% (approximately
15%) of the best WER-reduction realizable from the seed model’s WER. Manjunath and
Rao (2018) developed a multilingual phone recognition system (MPRS) for four Indian
languages—Kannada, Telugu, Bengali, and Odia. The performance of MPRS is improved
using the Articulatory Features (AFs). MPRS is also developed using oracle AFs and their
performance is compared with that of predicted AFs. It has been deduced that oracle AFs
by feature fusion with MFCCs offer a remarkably low target of Phone error rate (PER) of
10.4%, which is 24.7% absolute reduction compared to baseline MPRS with MFCCs alone.
Some of the prominent techniques are listed in Table 2.
4.3 Classification techniques
Kurian and Balakrishnan (2009) have also illustrated the speech recognition system for the
Malayalam words using HMM classifier. Paul et al. (2009) used ANN structures that are
designed with MLP for LPC features to build Bangla speech recognition system. Sarma
et al. (2015) worked on ANN based cooperative architecture for recognition of Assamese
numeral. The ANN models were designed using both MLP and SOM to handle voice sig-
nals with gender-based differences in different emotional conditions. Sukumar et al. (2010)
considered ANN and Discrete Wavelet Transform (DWT) techniques to build the system
for recognition of isolated question words of Malayalam language from speech queries.
Aggarwal and Dave (2011) summarized the research work using HMM classifier. They
emphasized on standard back end statistical technique, in the context of automatic speech
recognition. They explained various refinements and advancement in HMM over a standard
HMM method. Bhuvanagirir and Kopparapu (2012) used multilayer feedforward ANN for
13
Table 2 Summary of feature extraction techniques
Author MFCC LPCC PLP Rasta-PLP PNCC Wavelet fMLLR and FFT and DCT PSAPD DE AFs
MLLT
Prasanna and Pradhan (2011) ✓

Dua et al. (2012a, b) ✓
Aggarwal and Dave (2012) ✓
Kumari et al. (2014) ✓
Kumar et al. (2014b) ✓ ✓
Sriranjani et al. (2014) ✓ ✓ ✓ ✓
Ramamohan and Dandapat (2006) ✓
Dua et al. (2015) ✓ ✓
Vydana et al. (2015) ✓
Biswas et al. (2015) ✓ ✓
Patel and Patil (2016) ✓
Biswas et al. (2014) ✓ ✓
Mohamed and Nair (2012) ✓
Koolagudi et al. (2012) ✓
Sreenu et al. (2004) ✓
Gunasekaran and Revathy (2008) ✓
Malhotra and Khosla (2008) ✓
Singhvi et al. (2008) ✓
Jothilakshmi et al. (2012) ✓
Kandali et al. (2008) ✓
Thasleema et al. (2007) ✓
Rojathai and Venkatesulu (2014) ✓
Patil and Pardeshi (2014a, b) ✓
13
Rahul et al. (2013) ✓

Kandali et al. (2009) ✓ ✓
Kamble et al. (2014) ✓
Table 2 (continued)
13
MLLT
Hemakumar and Punitha (2014) ✓

Undha et al. (2014) ✓
Venkateswarlu et al. (2012) ✓ ✓
Mohan et al. (2012) ✓
Kotwal et al. (2012) ✓
Dutta and Sarma (2012) ✓ ✓
Sadanandam et al. (2012) ✓
Dileep and Sekhar (2013) ✓
Muralikrishna and Ananthakrishna (2013) ✓
Pandey et al. (2013) ✓
Asfak-Ur-Rahman et al. (2012) ✓
Ahamed et al. (2013) ✓
Mittal et al. (2013) ✓
Ganesh and Ravichandran (2013) ✓ ✓
Manjunath and Rao (2014) ✓
Shah (2009) ✓
Sinha et al. (2013) ✓ ✓
Kurian and Balakrishnan (2009) ✓
Mandal et al. (2010) ✓
Ranjan et al. (2010) ✓
Sukumar et al. (2010) ✓ ✓
A. Singh et al.
Sarma et al. 2010 ✓
Mehta and Anand (2010) ✓ ✓
Table 2 (continued)
MLLT

Bharali and Kalita (2018) ✓
Mohamed and Lajish (2016) ✓ ✓
Bhowmik et al. (2018) ✓ ✓
Manjunath and Rao (2018) ✓ ✓
Kaur and Singh (2016a, b) ✓ ✓ ✓
Kadyan et al. (2018) ✓
13
A. Singh et al.
the average energy information of zero-crossing and their intervals for Malayalam vowel
phoneme recognition. Venkateswarlu et al. (2012) applied Multilayer-classifier perceptron
and Time Lagged Recurrent Neural Network (TLRN) to recognize speech.
Dutta and Sarma (2012) have built their system using RNN and LPC, MFCC features
were taken to separate decision blocks. A gain of 10%, in recognition was recorded by
using multiple feature extraction. Thasleema and Narayanan (2012) have carried out conso-
nant classification in noisy and clean environment. Rani and Girija (2012) have presented
a Telugu speech recognition system. They sorted out so many confusions to maximize the
accuracy of the Telugu speech recognition system, they tried to describe almost all the
errors they found and the chances of another inaccurate outputs. Then work on a Bengali
speech recognition system came into the spotlight that was first initialized by Das et al.
(2011) and use a phone and triphone based speech corpora and use HMM for building
ASR systems using HTK and SPHINX. Sarma et al. (2013) used HMM based classifier
for recognizing Tamil and Telugu language words using Doordarshan corpora. Kumar
et al. (2013a, b) revels in their research report that the distribution pattern can utilize for
the classification and the recognition of the phoneme. Sarma et al. (2013) presented an
ANN model to recognize initial phones of Assamese language. Initial phonemes were seg-
mented from its word counterpart using a SOM based algorithm. With the help of three
ANN structures (RNN), SOM, and Probabilistic Neural Network (PNN), its superiority
over the Discrete Wavelet Transform (DWT)-based phoneme segmentation was detailed.
Bhattacharjee (2013a, b) presented a comparative analysis of the features of LPCC and
MFCC of the phones of the Assamese language. The performance of these two techniques
has been evaluated using MLP-based baseline phoneme recognizes.
Pravin and Jethva (2013) proposed the MFCC and MLP based Gujarati speech recog-
nition system. Kumar et al. (2014b) compared and developed a new ASR system for iso-
lated connected and continuous speech with different vocabulary sizes. They used Hidden
Markov model to develop the system for Hindi language. GMM and HMM are using the
backend of the system with increasing the size of the vocabulary, the results improved, but
using MFCC at the front end and two state GMM model with backend gives best results.
Rojathai and Venkatesulu (2014) have proposed phasing auto-correlation spectrum fea-
tures that increased recognition rate of noisy speech signals. Patil and Pardeshi (2014a,
b) presents Devanagari (Indo-Aryan language) ASR system. Their system contained 35
phonemes, uttered by a single user 20 time; HMM classifier system gave 60.57% of the
recognition rate. Patil and Pardeshi (2014a, b) has proposed an ASR system for Marathi
connected words using MFCC technique and continuous density Hidden Markov Bi-gram
model. Hemakumar and Punitha (2014) designed a speaker independent system for con-
tinuous Kannada speech using HMM. Voiced part of the speech is detected by dynamic
threshold (i.e. short time energy and magnitude of a signal), and then LPC coefficients
were extracted from the signals and converted into Real Cepstrum Coefficients (RCC).
RCC coefficients were pushed through k-means clustering algorithm using three state
HMM model. They reported accuracy of 87.0%. Patil and Rao (2016) proposed a recog-
nition identifies non-native accents of Hindi language. The phonetic features performed
well in two-way classification, where native Hindi utterances were tested by statistical
models (i.e. HMM models) that was trained on Marathi speech. The acoustic–phonetic
features were completely separated the native utterances from the non-native utterances
while performing the experiments on classifying the native and non-native speech of
Hindi corpus by Tamil speaker. A discriminative approach to train the HMM for continu-
ous speech systems in Hindi was proposed (Dua et al. 2017). The feature extraction tech-
nique, ensemble MFCC and PLP features. For acoustic training of the model, MMIE and
13
MME discriminative techniques were adopted. The proposed ensemble features with MPE
gave better results than other feature extraction and discriminative techniques. Pal et al.
(2018) developed an ASR in Bengali language for handling queries regarding agricultural
commodities. The experiments were conducted on KALDI toolkit using a speech corpus
of local people. Darekar and Dhande (2018) implemented a novel technique to recognize
emotions using hybrid PSO-FF algorithm in Marathi speech. Cepstral, NMF and MFCC
feature extraction techniques were used to extract features from Marathi and benchmark
databases. The whole experiment was conducted on MATLAB. Kannadaguli and Bhat
(2018) evaluated the performance of Bayesian and HMM based techniques to recognize
emotions in Kannada speech. Bhowmik et al. (2018) developed DNN-based model of Ben-
gali language. Place and manner of articulations are classified in the output layer. The pro-
posed model achieved 86.19% of average feature detection accuracy for the Bengali CDAC
corpus. Kadyan et al. (2018) worked on Punjabi speech recognition and tried to reduce
acoustic mismatch between training and testing conditions with the help of DNN-HMM
and handled the over fitting issue of training data with DNN-HMM. The system gives the
best result using 6 hidden layers. Kadyan et al. (2017) evaluated the performance of the
ASR system for Punjabi using the combination of HMM with DE and with GA. The maxi-
mum WA in noisy environment was obtained from Malwai and it was 84.96%, 83.26% and
78.44% for DE-HMM, GA-HMM and HMM, respectively.
Pulugundla et al. (2018) investigate multilingual time-delay neural network (TDNN)
architecture and compare them to bi-directional residual memory networks (BRMN)
and bi-directional LSTM. The authors submit that they get word error rates of 13.92%,
14.71% and 14.06% for Tamil, Telugu and Gujarati respectively with the system devel-
oped with TDNN and BRMN. They use Kneser–Ney 3-gram language model. Low rank
TDNN with skip connections gave improvement of 0.6–1.1% over baseline TDNN. Dua
et al. (2018c) proposed to investigate the result using different discriminative training tech-
niques (MMI, MPE). The system was developed by using speech corpus by TIFR, Mum-
bai (Samudravijaya et al. 2002). 256 Gaussian mixtures per state HMM is discriminatively
trained in this experiment with standard 39 MFCC features. It compares the performance
of n-gram language modeling with RNNLM. The WER 20.9% is reported by the RNNLM
technique with MMI discriminative training, the results show that MPE technique leads
to the significant improvement over MMI technique with interpolated LM. The experi-
ments observe that MMI and MPE discriminative training methods outperform the tra-
ditional MLE training technique for Hindi speech recognition. Dua et al. (2018d) in his
proposed show that discriminative training using MPE with MF-GFCC integrated feature
vector and PSO-HMM parameter refinement gives significantly better results than the other
implemented techniques. Dua et al. (2018d) in their work use trigram language modeling,
and HMM-Gaussian mixture model (GMM) based acoustic modeling to build continu-
ous Hindi language ASR system. They also apply Differential Evolution (DE) technique
to refine the GFCC features and discriminative techniques to enhance performance of the
acoustic mode. The results reveal that DE optimized GFCC with HMM-Gaussian Mixture
Model (GMM) acoustic modeling gives better results than baseline systems. The experi-
mental results show that the Minimum Phone Error (MPE) outperforms Maximum Mutual
Information (MMI) and Maximum Likelihood Estimation (MLE) and trigram-based lan-
guage modeling gives more accurate results than unigram and bigram language modeling.
Fathima et al. (2018) explored phonetic properties of the languages that are essential for
improved ASR performance. They proposed a multilingual Time Delay Neural Network
(TDNN) system for building and ASR system based on phonetic information. The speech
13
A. Singh et al.
corpus released by Microsoft is used to build the system for Gujarati, the WER is decreas-
ing gradually from GMM (16.95%), DNN (14.38%) to TDNN (12.7%) systems.
Pandey and Nathwani (2018) build a DNN based keyword spotting framework that
utilizes both spectral as well as prosodic information present in the speech signal. An
improved methodology for Key Word Spotting (KWS) in Hindi language was proposed by
fusing spectral and prosodic information using a deep network architecture. The effective-
ness of the proposed framework is evaluated in terms of accuracy and True Positive Rate
(TPR) and False Positive Rate (FPR) Performance of the proposed framework is evalu-
ated on syllable recognition and keyword spotting which indicates 5.9% and 6.3% improve-
ment over corresponding DNN-HMM baseline system. Pal et al. (2018) presents a voice
based mobile application for dissemination of agricultural commodity procurement and
consumer prices. A dynamic language model is designed that automatically builds itself
for daily reported agricultural commodities. Comparative recognition performance analysis
is performed on field collected live test audio of around 5 h using Sphinx and Kaldi tool-
kits to finalize a robust backend ASR system. This reveals our best performing Kaldi ASR
system having 7.9% WER using SGMM with LDA, MLLT and SAT training on extracted
MFCC, delta and double delta features. Patel et al. (2018) developed an ASR system for
Manipuri language using Gaussian Mixture-Hidden Markov Model, Deep Neural Network
Hidden Markov Model (GMM-HMM) and Deep Neural Network-Hidden Markov Model
(DNN-HMM) based architectures are developed as a baseline DNN-HMM systems pro-
duce 13.57% WER and 7.64% EER for KWS. The KALDI speech recognition toolkit is
used for developing the systems. The Manipuri ASR system along with KWS is integrated
as a visual interface for demonstration purpose. Summary of modeling techniques with rec-
ognition rate has been mentioned in Table 3.
A brief summary of classification techniques has been mentioned in Table 4.
5 Synthetic analysis and future directions of the compiled work
• Many of the ASR databases lack large speech corpus. This corpus can be built by
including more dialectal, prosodic and tonal information (if present) to more analytical
information processing.
• Some of Indian languages are tonal in nature like Bodo, Dogari, and Punjabi. An analy-
sis needs to be performed using pitch and vocal tract information about these languages
and their subsequent dialects.
• Another major issue with languages is a variation of dialectal information. A few stud-
ies were carried out on extracting the linguistic information of Indian languages. This
needs to be combined with speech technologies to reduce WER.
• A number of work adopted of bottle neck features (Grézl et al. 2011). Most of speech
corpus developed in Indian languages is based on noise free environment. Further work
can be drawn-out by developing noisy or mixed datasets and applying different noise
robust approaches to pitch characteristics to improve the recognition performance.
• An attempt can be made to refine the acoustic feature using an optimization algorithm
on model parameters. Only some of the studies have worked on feature optimization/
refinement. Research in other languages focuses on already established feature extrac-
tion technique like MFCC. A few studies have used hybridization feature extraction
techniques for feature refinement.
13
Table 3 Overview of modeling techniques for Indian language
Language type Modeling technique Recognition rate (%)
Manipuri (Patel et al. 2018) DNN + HMM 13.57 WER

Punjabi (Kadyan et al. 2018) DNN + HMM 17.53 WER
Punjabi (Kadyan et al. 2017) HMM + DE 84.96 WA
GA + HMM 83.26 WA
HINDI (Pandey and Nathwani 2018) DDML + Attn-LSTM + DDA 87.2 syllable recognition (SR) and 94.8 keyword spotting (KWS)
Bengali (Pal et al. 2018) SGMM + LDA, MLLT + SAT 7.9% WER
HINDI (Manjunath and Rao 2018) MFCC + Articulatory features + DNN 10.4% (PER)
HINDI (Dua et al. 2018a, b, c, d) GFCC + DE 77.9 (WA)
HMM + GMM
HINDI (Dua et al. 2018a, b, c, d) PSO-HMM 87.37 (WA)
HINDI (Dua et al. 2018a, b, c, d) MMI + HMM (1) and MPE + HMM (2) (1) 24.1 (WER)
(2) 22.3 (WER)
Gujarati (Fathima et al. 2018) Time delay neural network 12.7% (WER)
Tamil, Telugu and Gujarati (Pulugundla et al. 2018) TDNN (low rank, transfer learning), BRMN, 13.92%, 14.71% and 14.06% for Tamil, Telugu and Gujrati (WER)
Manipuri (Bharali and Kalita 2018) HMM, Ivector 80% WA at MFCC 39 facture vector
Bengali (Bhowmik et al. 2018) DNN 86.19% FDA (Feature detection Accuracy)
Assamese (Dutta and Sarma 2012) RNN 90.47% WA
Assamese (Bharali and Kalita 2015) HMM 80% WA for MFCC and 95% WA for LPSEPSTRA
Bangla (Hasnat et al. 2007) HMM, SVM Isolated for speaker independent-70% and continuous speaker independ-
ent-60%
Hindi (Kumar et al. 2004) HMM 85.46% WA
Hindi, Sanskrit, Punjabi and Telugu (Ranjan 2010) ANN 83.29% with back propagation algorithm and 92.78% using clustering
algorithm
Hindi (Dey et al. 2014) DNN 59.90% WA
Hindi (Kumar et al. 2014a) GMM-HMM 97.04% WA
13
Hindi (Kumar et al. 2014a) GMM-HMM 95.40% WA

Hindi (Mandal et al. 2015) DNN 82.00% WA
Table 3 (continued)
13
Language type Modeling technique Recognition rate (%)
Hindi (Mittal and Sharma 2016) SVM, binary PSO-SVM, binary PSO and 83.70%, 89.20%, 91.10%, 90.00% WA
Hooke–Jeeve-SVM
Hindi (Pandey et al. 2017) DNN 10.63% WER
Hindi (Dua et al. 2017) GMM − HMM, MMIE, MPE 31.14%, 27.40%, 25.90% WER
Hindi (Dua et al. 2018a, b, c, d) HMM + GMM, HMM + MMI, HMM + MPE 86.9% WA (Clean), 86.20% WA (noisy)
Kannada (Hegde et al. 2012) SVM 79.00% WA
Kannada (Thalengala and Shama 2016) HMM 74.35% WA for triphone
Mizo (Dey et al. 2018) SGMM, DNN 10.30% PER, 23.20% PER
Punjabi (Kumar and Singh 2017) HMM 96.87% WA, 98.71% Sentence accuracy
Tamil (Radha 2012) HMM 88.00% WA
Tamil and Telugu (Renjith and Manju 2017) KNN/ANN 76.41% WA
A. Singh et al.
Table 4 Summary of classification techniques

Author HMM ANN/DNN VQ GMM KNN TLRN SVM TDNN GA
Samudravijaya et al. (1998) ✓

Rajput et al. (2000) ✓
Yegnanarayana and Gangashetty ✓
(2011)
Sekhar and Yegnanarayana (2002) ✓
Sreenu et al. (2004) ✓
Kumar et al. (2004) ✓
Udhyakumar et al. (2004) ✓
Yegnanarayana et al. (2005) ✓ ✓ ✓ ✓
Thasleema et al. (2007) ✓
Thangarajan et al. (2008) ✓
Lakshmi and Murthy (2008) ✓
Shah (2009) ✓ ✓ ✓ ✓
Lakshmi et al. (2009) ✓
Kalyani and Sunitha (2010) ✓
Kurian and Balakrishnan (2009) ✓
Paul et al. (2009) ✓
Sarma et al. (2015) ✓
Mandal et al. (2010) ✓
Ranjan (2010) ✓
Sukumar et al. (2010)
Mehta and Anand (2010) ✓
Nair and Sreenivas (2010) ✓
Bhattacharjee (2013a, b) ✓
Koolagudi and Krothapalli (2011) ✓
Prasanna and Pradhan (2011) ✓
Das et al. (2011) ✓
Jothilakshmi et al. (2012) ✓
Mohamed and Nair (2012) ✓ ✓
Dua et al. (2012a, b) ✓
Asfak-Ur-Rahman et al. (2012) ✓
Venkateswarlu et al. (2012) ✓ ✓
Mohan et al. (2012) ✓ ✓
Kotwal et al. (2012) ✓
Sunil and Lajish (2012) ✓
Rani and Girija (2012) ✓
Sadanandam et al. (2012) ✓ ✓
13
A. Singh et al.
Table 4 (continued)
Author HMM ANN/DNN VQ GMM KNN TLRN SVM TDNN GA
Dutta and Sarma (2012) ✓

Dileep and Sekhar (2013) ✓
Mohan and Rose (2013) ✓
Malde et al. (2013)
Muralikrishna and Anan- ✓
thakrishna (2013)
Pandey et al. (2013) ✓
Sinha et al. (2013) ✓ ✓
Ahamed et al. (2013) ✓
Mittal et al. (2013) ✓
Ganesh and Ravichandran (2013) ✓ ✓ ✓
Anukriti et al. (2013) ✓
Rahul et al. (2013) ✓ ✓
Bhattacharjee (2013a, b) ✓
Manjunath and Rao (2014) ✓ ✓
Kumari et al. (2014) ✓
Sriranjani et al. (2014) ✓
Rojathai and Venkatesulu (2014) ✓
Patil and Pardeshi (2014a, b) ✓
Patil and Rao (2016) ✓
Kamble et al. (2014) ✓
Hemakumar and Punitha (2014) ✓
Dua et al. (2017) ✓ ✓
Dua et al. (2015) ✓
Patel et al. 2018 ✓ ✓
Kadyan et al. 2018 ✓ ✓
Kadyan et al., (2017) ✓ ✓
Pulugundla et al. (2018) ✓
Dey et al. (2018) ✓ ✓
Renjith and Manju (2017) ✓ ✓ ✓
Dua et al. (2018a, b, c, d) ✓ ✓ ✓
• Studies on ASR systems are less focused on hybridization of other back end
approaches. This can help in boosting the performance of the ASR system, integration,
or refinement of decoding stage can be used to optimize the output parameters for bet-
ter knowledge generation.
• Recently DNN approach has been explored with HMM classifier technique. Lack of
large speech corpora, many refinement techniques like CNN, RNN etc. cannot effi-
ciently apply in Indian context.
13
6 Inferences
There are many toolkits available as an open source like HTK, SPHINX, KALDI, JULIUS
and many more for developing the speech recognition model. However, from the survey;
it has been observed that HTK is most widely used toolkit in the Indian scenario. Now,
researchers have prominently used Kaldi for system building. Almost every kind of rec-
ognition has already been carried out in MATLAB like phoneme recognition, continuous
speech data recognition, speech enhancement, format analysis, spontaneous speech recog-
nition etc. The detailed literature survey brings out that not much of ASR system has been
experimented using different deep learning techniques. The reasons because there is a scar-
city of large speech corpus of different languages. Moreover, experimental work on feature
extraction is also limited to some of the prominent languages only. Most of the studies use
HMM-GMM classifier for classifying speech signals.
References
Aggarwal RK and Dave M (2010) Fitness evaluation of Gaussian mixtures in Hindi speech recognition sys-
tem. In: Proceedings of first international conference on integrated intelligent computing (ICIIC), pp
177–183
Aggarwal RK, Dave M (2011) Application of genetically optimized neural networks for Hindi speech recog-
nition system. In: Proceedings of the world congress on information and communication technologies
(WICT), pp 512–517
Aggarwal RK, Dave M (2012) Filter bank optimization for robust ASR using GA and PSO. Int J Speech
Technol 15(2):191–201
Aggarwal RK, Dave M (2013) Performance evaluation of sequentially combined heterogeneous feature
streams for Hindi speech recognition system. Telecommun Syst 52(3):1457–1466
Ahamed B, Israt F, Chowdhury SMR, Huda MN (2013) Effect of speaker variation on the performance
of Bangla ASR. In: Proceedings of the 2nd international conference on informatics, electronics and
vision, pp 1–5
Ali H, Ahmad N, Yahya KM, Farooq O (2012) A medium vocabulary Urdu isolated words balanced corpus
for automatic speech recognition. In: Proceedings of the international conference on electronics com-
puter technology, pp 473–476
Ali H, Ahmad N, Zhou X, Ali M, Manjotho AA (2013) Linear discriminant analysis based approach for
automatic speech recognition of Urdu isolated words. In: Proceedings of the international multi topic
conference, pp 24–34
Antony PJ (2013) Machine translation approaches and survey for Indian Languages. Int J Comput Linguist
Chin Lang Process 18(1):47–78
Anukriti, Tiwari S, Chatterjee T, Bhattacharya M (2013) Speaker independent speech recognition imple-
mentation with adaptive language models. In: Proceedings of international symposium on computa-
tional and business intelligence, pp 7–10
Anumanchipalli G, Chitturi R, Joshi S, Kumar R, Singh SP, Sitaram RNV, Kishore SP (2005) Development
of Indian language speech databases for large vocabulary speech recognition systems. In: Proceedings
of international conference on speech and computer (SPECOM), pp 1–4
Anusuya MA, Katti SK (2009) Comments on the paper by Perlovsky entitled integrating language and cog-
nition. IEEE Comput Intell Mag 4:48–49
Anusuya MA, Katti SK (2010) Speech recognition by machine: a review. Int J Comput Sci Inf Secur
6(3):181–205
Arora A, Kadyan V, Singh A (2019) Effect of tonal features on various dialectal variations of Punjabi lan-
guage. In: Proceedings of the conference on advances in signal processing and communication, pp
467–475
Asfak-Ur-Rahman M, Kotwal MRA, Hassan F, Ahmmed S, Huda MN (2012) Gender effect cannonicaliza-
tion for Bangla ASR. In: Proceedings of the 15th international conference on computer and informa-
tion technology (ICCIT), pp 179–184
13
A. Singh et al.
Bansal S, Dev A (2013) Emotional Hindi speech database. In: Proceedings of the International Confer-
ence on Oriental COCOSDA held jointly with Asian spoken language research and evaluation
(O-COCOSDA/CASLRE), pp 1–4
Bansal S, Sharan S, Agrawal SS (2015) Corpus design and development of an annotated speech database
for Punjabi. In: Proceedings of the international conference on oriental COCOSDA held jointly with
Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pp 32–37
Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced lan-
guages: a survey. Speech Commun 56:85–100
Bharali SS, Kalita SK (2015) A comparative study of different features for isolated spoken word recognition
using HMM with reference to Assamese language. Int J Speech Technol 18(4):673–684
Bharali SS, Kalita SK (2018) Speech recognition with reference to Assamese language using novel fusion
technique. Int J Speech Technol 21(2):251–263
Bhattacharjee U (2013a) Recognition of the tonal words of Bodo language. Int J Recent Technol Eng
1(6):114–118
Bhattacharjee U (2013b) A comparative study Of LPCC and MFCC features for the recognition of Assa-
mese phonemes. Int J Eng Res Technol 2(1):1–6
Bhowmik T, Chowdhury A, Mandal SKD (2018) Deep neural network based place and manner of articula-
tion detection and classification for Bengali continuous speech. Procedia Comput Sci 125:895–901
Bhuvanagirir K, Kopparapu SK (2012) Mixed language speech recognition without explicit identification of
language. Am J Signal Process 2(5):92–97
Biswas A, Sahu PK, Chandra M (2014) Admissible wavelet packet features based on human inner ear fre-
quency response for Hindi consonant recognition. Comput Electr Eng 40(4):1111–1122
Biswas A, Sahu PK, Bhowmick A, Chandra M (2015) Hindi phoneme classification using Wiener filtered
wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42:12–22
Chaudhuri BB, Bhattacharya U (2000) Efficient training and improved performance of multilayer percep-
tron in pattern classification. Neurocomputing 34(1–4):11–27
Chellapriyadharshini M, Toffy A, Srinivasa RKM, Ramasubramanian V (2018) Semi-supervised and active-
learning scenarios: efficient acoustic model refinement for a low resource Indian language. In: Com-
puter and languages, pp 1041–1045
Chhayani NH, Patil HA (2013) Development of corpora for person recognition using humming, singing
and speech. In: Proceedings of the international conference on oriental COCOSDA held jointly with
Asian spoken language research and evaluation, pp 1–6
CDAC Corpus (2015) http://cdac.in/index.aspx?=mc-i/fspeech-corpora. Accessed 22 Feb 2018
Darekar RV, Dhande AP (2018) Emotion recognition from Marathi speech database using adaptive artificial
neural network. Biol Inspired Cognit Archit 23:35–42
Das B, Mandal S, Mitra P (2011) Bengali speech corpus for continuous automatic speech recognition sys-
tem. In: Proceedings of the international conference on speech database and assessments, pp 51–55
Dey A, Zhang W, Fung P (2014) Acoustic modeling for Hindi speech recognition in low-resource settings.
In: International conference on audio, language and image processing (ICALIP), pp 891–894
Dey A, Lalhminghlui W, Sarmah P, Samudravijaya K, Prasanna SM, Sinha R, Nirmala SR (2018) Mizo
phone recognition system. In: Interspeech, pp 1–5
Dileep AD, Sekhar CC (2013) HMM based pyramid match kernel for classification of sequential patterns
of speech using support vector machines. In: Proceedings of the IEEE international conference on
acoustics, speech and signal processing, pp 3562–3566
Dua M, Aggarwal RK, Kadyan V, Dua S (2012a) Punjabi automatic speech recognition using HTK. Int J
Comput Sci Issues (IJCSI) 9(4):359
Dua M, Aggarwal RK, Kadyan V, Dua S (2012b) Punjabi speech to text system for connected words. In:
Proceedings of the fourth international conference on advances in recent technologies in communica-
tion and computing (ARTCom2012), pp 206–209
Dua M, Kumar A, Chaudhary T (2015) Implementation and performance evaluation of speaker adaptive
continuous Hindi ASR using tri-phone based acoustic modelling. In: Proceedings of 2015 interna-
tional conference on future computational technologies, pp 68–73
Dua M, Aggarwal RK, Biswas M (2017) Discriminative training using heterogeneous feature vector for
Hindi automatic speech recognition system. In: Proceedings of international conference on computer
and applications (ICCA), pp 158–162
Dua M, Aggarwal RK, Biswas M (2018a) GFCC based discriminatively trained noise robust continuous
ASR system for Hindi language. J Ambient Intell Humaniz Comput 10:1–14
Dua M, Aggarwal RK, Biswas M (2018b) Discriminative training using noise robust integrated features and
refined HMM modeling. J Intell Syst. https://doi.org/10.1515/jisys-2017-0618
13
Dua M, Aggarwal RK, Biswas M (2018c) Discriminative training using noise robust integrated features
and refined HMM modeling. J Intell Syst. https://doi.org/10.1515/jisys-2017-0618
Dua M, Aggarwal RK, Biswas M (2018d) Discriminative training using noise robust integrated features
and refined HMM modeling. J Intell Syst. https://doi.org/10.1515/jisys-2017-0618
Dutta K, Sarma KK (2012) Multiple feature extraction for RNN-based Assamese speech recognition for
speech to text conversion application. In: Proceedings of the international conference on commu-
nications, devices and intelligent systems (CODIS), pp 600–603
EMILLE Corpora (2015) http://catalog.elra.info/search-result.php?keywords=W0037&language=en.
Accessed on 05 Jan 2018
Farooq O, Datta S, Shrotriya MC (2010) Wavelet sub-band based temporal features for robust Hindi pho-
neme recognition. Int J Wavelets Multi Resolut Inf Process 8(6):847–859
Fathima N, Patel T, Mahima C, Iyengar A (2018) TDNN-based multilingual speech recognition system
for low resource Indian languages. In: Proceedings of the Inter-speech, pp 3197–3201
Gaikwad S, Gawali B, Mehrotra S (2013) Creation of Marathi speech corpus for automatic speech rec-
ognition. In: Proceedings of the international conference on oriental COCOSDA held jointly with
Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pp 1–5
Ganesh AA, Ravichandran C (2013) Grapheme Gaussian model and prosodic syllable based Tamil
speech recognition system. In: Proceedings of the international conference on signal processing
and communication (ICSC), pp 401–406
Ghai W, Singh N (2012) Literature review on automatic speech recognition. Int J Comput Appl
41(8):42–50
Godambe T, Samudravijaya K (2011) Speech data acquisition for voice based agricultural information
retrieval. In: Proceedings of the all India DLA conference, Punjabi University, Patiala, pp 1–8
Grézl F, Karafiat M, Janda M (2011) Study of probabilistic and bottle-neck features in multilingual environ-
ment. In: IEEE workshop on automatic speech recognition and understanding (ASRU), pp 359–364
Gulzar T, Singh A, Rajoriya DK, Farooq N (2014) A systematic analysis of automatic speech recogni-
tion: an overview. Int J Curr Eng Technol 4(3):1664–1675
Gunasekaran S, Revathy K (2008) Fractal dimension analysis of audio signals for Indian musical instru-
ment recognition. In: Proceedings of the international conference on audio, language and image
processing, pp 257–261
Hasnat M, Mowla J, Khan M (2007) Isolated and continuous bangla speech recognition: implementation,
performance and application perspective. In: Proceedings of international symposium on natural
language processing (SNLP), pp 1–6
Hassan F, Kotwal MRA, Huda MN (2011) Bangla ASR design by suppressing gender factor with gen-
der-independent and gender-based HMM classifiers. In: Proceedings of the world Congress on
information and communication technologies (WICT), pp 1276–1281
Hegde RM, Murthy HA, Gadde VRR (2004) Continuous speech recognition using joint features derived
from the modified group delay function and MFCC. In: Proceedings of 8th international confer-
ence on spoken language processing, pp 1–4
Hegde S, Achary KK, Shetty S (2012) Isolated word recognition for Kannada language using support
vector machine. In: Wireless networks and computational intelligence, pp 262–269
Hemakumar G, Punitha P (2014) Automatic Segmentation of Kannada speech signal into syllables and
sub-words: noised and noiseless signals. Int J Sci Eng Res 5(1):1707–1711
Interspeech (2018) Low resource speech recognition challenge for Indian Languages. https://www.micro
soft.com/en-us/resea rch/event /inter speec h-2018-speci al-sessi on-low-resou rce-speec h-recog nitio
n-challenge-indian-languages/. Accessed 22 Feb 2018
Jain A, Prakash N, Agrawal SS (2011) Evaluation of MFCC for emotion identification in Hindi speech.
In: Proceedings of the 3rd international conference on communication software and networks
(ICCSN), pp 189–193
Jayanna HS (2009) Limited data speaker recognition. Ph.D. thesis, Indian Institute of Technology Guwahati
Jothilakshmi S, Ramalingam V, Palanivel S (2012) A hierarchical language identification system for
Indian languages. Digit Signal Process 22(3):544–553
Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach
with hybrid hmm classifiers. Int J Speech Technol 20(4):761–769
Kadyan V, Mantri A, Aggarwal RK, Singh A (2018) A comparative study of deep neural network based
Punjabi—ASR system. Int J Speech Technol 22(1):111–119
Kalyani N, Sunitha KVN (2010) Syllable analysis to build a dictation system in Telugu language. Int J Com-
put Sci Inf Secur (IJCSIS) 6(3):171–176
13
A. Singh et al.
Kamble VV, Gaikwad BP, Rana DM (2014) Spontaneous emotion recognition for Marathi spoken words.
In: Proceedings of international conference on communications and signal processing (ICCSP), pp
1984–1990
Kandali AB, Routray A, Basu TK (2008) Emotion recognition from Assamese speeches using MFCC fea-
tures and GMM classifier. In: Proceedings of the IEEE region 10 conference, pp 1–5
Kandali AB, Routray A, Basu TK (2009) Vocal emotion recognition in five native languages of Assam
using new wavelet features. Int J Speech Technol 12:1–13
Kannadaguli P, Bhat V (2018) A comparison of Bayesian and HMM based approaches in machine learning
for emotion detection in native Kannada speaker. In: Proceedings of the IEEMA Engineer infinite
conference (eTechNxT), pp 1–6
Kaur A, Singh A (2016a) Power-normalized cepstral coefficients (PNCC) for Punjabi automatic speech rec-
ognition using phone based modelling in HTK. In: Proceedings of the 2nd international conference
on applied and theoretical computing and communication technology (iCATccT), Bangalore, India,
pp 372–375
Kaur A, Singh A (2016b) Optimizing feature extraction techniques constituting phone based modelling on
connected words for Punjabi automatic speech recognition. In: Proceedings of the 2nd international
conference on advances in computing, communications and informatics (ICACCI), Jaipur, India, pp
2104–2108
Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech
Technol 14(1):35–48
Koolagudi SG, Reddy R, Yadav J, Rao KS (2011) IITKGP-SEHSC: Hindi speech corpus for emotion analy-
sis. In: Proceedings of the international conference on devices and communications (ICDeCom), pp
1–5
Koolagudi SG, Rastogi D, Rao KS (2012) Identification of language using mel-frequency cepstral coeffi-
cients (MFCC). Procedia Eng 38:3391–3398
Kotwal MRA, Halim T, Almaji MMH, Hossain I, Huda MN (2012) Extraction of local features for tri-phone
based Bangla ASR. In: Proceedings of the ninth international conference on information technology:
new generations (ITNG), pp 668–673
Kumar Y, Singh N (2017) An automatic speech recognition system for spontaneous Punjabi speech corpus.
Int J Speech Technol 20(2):297–303
Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi.
IBM J Res Dev 48(5–6):703–715
Kumar K, Aggarwal R, Jain A (2012) A Hindi speech recognition system for connected words using HTK.
Int J Comput Syst Eng 1:25–32
Kumar SBS, Rao KS, Pati D (2013a) Phonetic and prosodically rich transcribed speech corpus in Indian
languages: Bengali and Odia. In: Proceedings of international conference oriental COCOSDA held
jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/
CASLRE), pp 1–5
Kumar SSB, Rao KS, Pati D (2013b) Phonetic and prosodically rich transcribed speech corpus in Indian
languages: Bengali and Odia. In: Proceedings of the international conference oriental COCOSDA
held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/
CASLRE), pp 1–5
Kumar A, Dua M, Choudhary A (2014a) Implementation and performance evaluation of continuous Hindi
speech recognition. In: Proceedings of international conference on electronics and communication
systems (ICECS), pp 1–5
Kumar A, Dua M, Choudhary T (2014b) Continuous Hindi speech recognition using monophone based
acoustic modeling. In: Proceedings of the international conference on advances in computer engineer-
ing & applications. pp 1–5
Kumar VR, Vydana HK, Vuppala AK (2015) Significance of GMM-UBM based modelling for Indian lan-
guage identification. Procedia Comput Sci 54:231–236
Kumari P, Deiv DS, Bhattacharya M (2014) Automatic speech recognition of accented Hindi data. In: Pro-
ceedings of the international conference on computation of power, energy, information and communi-
cation (ICCPEIC), pp 68–76
Kurian C (2014) A review on technological development of automatic speech recognition. Int J Soft Com-
put Eng 4(4):80–86
Kurian C, Balakrishnan K (2009) Speech recognition of Malayalam numbers. In: Proceedings of the world
Congress on nature and biologically inspired computing, pp 1475–1479
Lakshmi A, Murthy HA (2008) A new approach to continuous speech recognition in Indian languages. In:
Proceedings of the national conference on communication, pp 1–5
13
Lakshmi SG, Lakshmi A, Murthy HA, Nagarajan T (2009) Automatic transcription of continuous speech
into syllable-like units for Indian languages. Sadhana 34(2):221–233
Lata S, Arora S (2013) Laryngeal tonal characteristics of Punjabi—an experimental study. In: Proceedings
of the international conference on human computer interactions (ICHCI), pp 1–6
Maity S, Vuppala AK, Rao KS, Nandi D (2012) IITKGP-MLILSC speech database for language identifica-
tion. In: Proceedings of the national conference on communication, pp 1–5
Malde KD, Vachhani BB, Madhavi MC, Chhayani NH, Patil HA (2013) Development of speech corpora in
Gujarati and Marathi for phonetic transcription. In: Proceedings of the international conference on
oriental COCOSDA held jointly with Asian spoken language research and evaluation (O-COCOSDA/
CASLRE), pp 1–6
Malhotra K, Khosla A (2008) Automatic identification of gender and accent in spoken Hindi utterances with
regional Indian accents. In: Proceedings of the spoken language technology workshop, pp 309–312
Malhotra K, Khosla A (2013) Impact of regional Indian accents on spoken Hindi. In: Proceedings of the
international conference on oriental COCOSDA held jointly with 2013 conference on Asian spoken
language research and evaluation (O-COCOSDA/CASLRE), pp 1–4
Mandal S, Das B, Mitra P (2010) Shruti-II: a vernacular speech recognition system in Bengali and an appli-
cation for visually impaired community. In: Proceedings of the students’ technology symposium
(TechSym), pp 229–233
Mandal P, Jain S, Ojha G, Shukla A (2015) Development of Hindi speech recognition system of agricultural
commodities using deep neural network. In: Proceedings of sixteenth annual conference of the inter-
national speech communication association, pp 1241–1245
Manjunath KE, Rao KS (2014) Automatic phonetic transcription for read, extempore and conversation
speech for an Indian language: Bengali. In: Proceedings of the twentieth national conference on com-
munications (NCC), pp 1–6
Manjunath KE, Rao KS (2018) Improvement of phone recognition accuracy using articulatory features. Cir-
cuits Syst Signal Process 37(2):704–728
Mantena GV, Rajendran S, Gangashetty SV, Yegnanarayana B, Prahallad K (2011) Development of a spo-
ken dialogue system for accessing agricultural information in Telugu. In: Proceedings of the 9th inter-
national conference on natural language processing, pp 1–6
Mehta K, Anand RS (2010) Robust front-end and back-end processing for feature extraction for Hindi
speech recognition. In: Proceedings of the IEEE international conference on computational intelli-
gence and computing research (ICCIC), pp 1–4
Mittal T, Sharma RK (2016) Integrated search technique for parameter determination of SVM for speech
recognition. J Cent South Univ 23(6):1390–1398
Mittal T, Barthwal A, Koolagudi SG (2013) Age approximation from speech using Gaussian mixture mod-
els. In: Proceedings of 2nd international conference on advanced computing, networking and security
(ADCONS), pp 74–78
Mohamed FK, Lajish VL (2016) Nonlinear speech analysis and modeling for Malayalam vowel recognition.
Procedia Comput Sci 93:676–682
Mohamed A, Nair KR (2012) HMM/ANN hybrid model for continuous Malayalam speech recognition. Pro-
cedia Eng 30:616–622
Mohan A, Rose R (2013) Cross-lingual context sharing and parameter-tying for multi-lingual speech rec-
ognition. In: Proceedings of the IEEE workshop on automatic speech recognition and understanding
(ASRU), pp 126–131
Mohan A, Umesh S, Rose R (2012) Subspace based for Indian languages. In: 11th International conference
on information science, signal processing and their applications (ISSPA), pp 35–39
Mohan A, Rose R, Ghalehjegh SH, Umesh S (2014) Acoustic modelling for speech recognition in Indian
languages in an agricultural commodities task domain. Speech Commun 56:167–180
Muralikrishna H, Ananthakrishna T (2013) HMM based isolated Kannada digit recognition system using
MFCC. In: Proceedings of the international conference on advances in computing, communications
and informatics (ICACCI), pp 730–733
Musfir M, Krishnan KR, Murthy HA (2014) Analysis of fricatives, stop consonants and nasals in the auto-
matic segmentation of speech using the group delay algorithm. In: Proceedings of twentieth national
conference on communications (NCC), pp 1–6
Nair NU, Sreenivas TV (2010) Joint evaluation of multiple speech patterns for speech recognition and train-
ing. Comput Speech Lang 24(2):307–340
Pal M, Roy R, Khan S, Bepari MS, Basu J (2018) PannoMulloKathan: voice enabled mobile app for agri-
cultural commodity price dissemination in Bengali language. In: Interspeech, pp 1491–1492
Pandey L, Nathwani K (2018) LSTM based attentive fusion of spectral and prosodic information for key-
word spotting in Hindi language. In: Interspeech, pp 112–116
13
A. Singh et al.
Pandey D, Mondal T, Agrawal SS, Bangalore S (2013) Development and suitability of Indian languages
speech database for building watson based ASR system. In: Proceedings of the international con-
ference on oriental COCOSDA held jointly with Asian Spoken language research and evaluation
(O-COCOSDA/CASLRE), pp 1–6
Pandey A, Srivastava BML, Gangashetty SV (2017) Adapting monolingual resources for code-mixed hindi-
english speech recognition. In: Proceedings of international conference on Asian language processing
(IALP), pp 218–221
Parameswarappa S, Narayana VN (2011) Target word sense disambiguation system for Kannada language.
In: Proceedings of the 3rd international conference on advances in recent technologies in communica-
tion and computing, pp 269–273
Patel TB, Patil HA (2016) Effectiveness of fundamental frequency (F 0) and strength of excitation (SoE) for
spoofed speech detection. In: Proceedings of the IEEE international conference on acoustics, speech
and signal processing (ICASSP), pp 5105–5109
Patel T, Krishna DN, Fathima N, Shah N, Mahima C, Kumar D, Iyengar A (2018) Development of large
vocabulary speech recognition system with keyword search for Manipuri. Proc Inter speech. https://
doi.org/10.21437/Interspeech.2018-2133
Patil PP, Pardeshi SA (2014a) Marathi connected word speech recognition system. In: Proceedings of the
first international conference on networks and soft computing (ICNSC), pp 314–318
Patil PP, Pardeshi SA (2014b) Devnagari phoneme recognition system. In: Proceedings of the fourth inter-
national conference on advances in computing and communications (ICACC), pp 5–8
Patil VV, Rao P (2016) Detection of phonemic aspiration for spoken Hindi pronunciation evaluation. J Phon
54:202–221
Paul AK, Das D, Kamal M (2009) Bangla speech recognition system using LPC and ANN. In: Proceedings
of the 7th international conference on advances in pattern recognition, pp 171–174
Plauche M, Nallasamy U, Pal J, Wooters C, Ramachandran D (2006) Speech recognition for illiterate access
to information and technology. In: Proceedings of the international conference on information and
communication technologies and development, pp 83–92
Prasanna SRM, Pradhan G (2011) Significance of vowel-like regions for speaker verification under degraded
conditions. IEEE Trans Audio Speech Lang Process 19(8):2552–2565
Pravin P, Jethva H (2013) Neural network based Gujarati language speech recognition. Int J Comput Sci
Manag Res 2(5):2623–2627
Pulugundla B, Baskar MK, Kesiraju S, Egorova E, Karafiat M, Burget L, Cernock J (2018) BUT system for
low resource Indian language ASR. In: Interspeech, pp 3182–3186
Radha V (2012) Speaker independent isolated speech recognition system for Tamil language using HMM.
Procedia Eng 30:1097–1102
Rahul L, Nandakishor S, Singh LJ, Dutta SK (2013) Design of Manipuri keywords spotting system using
HMM. In: Proceedings of the fourth national conference on computer vision, pattern recognition,
image processing and graphics (NCVPRIPG), pp 1–3
Raji SA, Sarin SA, Firoz SA, Babu AP (2010) Key-word based query recognition in a speech corpus by
using artificial neural networks. In: Proceedings of 2nd international conference on computational
intelligence, communication systems and networks, pp 33–36
Rajput N, Subramaniam LV, Verma A (2000) Adapting phonetic decision trees between languages for con-
tinuous speech recognition. In: Proceedings of the sixth international conference on spoken language
processing, pp 1–3
Ram CS, Ponnusamy R (2014) An effective automatic speech emotion recognition for Tamil language using
support vector machine. In: Proceedings of the international conference on issues and challenges in
intelligent computing techniques (ICICT), pp 19–23
Ramamohan S, Dandapat S (2006) Sinusoidal model-based analysis and classification of stressed speech.
IEEE Trans Audio Speech Lang Process 14(3):737–746
Rani NU, Girija PN (2012) Error analysis to improve the speech recognition accuracy on Telugu language.
Sadhana 37(6):747–761
Ranjan S (2010) A discrete wavelet transform based approach to Hindi speech recognition. In: Proceedings
of the international conference on signal acquisition and processing, pp 345–348
Ranjan R, Singh SK, Shukla A, Tiwari R (2010) Text-dependent multilingual speaker identification for
indian languages using artificial neural network. In: Proceedings of 3rd international conference on
emerging trends in engineering and technology (ICETET), pp 632–635
Rao KS (2011) Application of prosody models for developing speech systems in Indian languages. Int J
Speech Technol 14:19–33
13
Renjith S, Manju KG (2017) Speech based emotion recognition in Tamil and Telugu using LPCC and
hurst parameters—a comparitive study using KNN and ANN classifiers. In: Proceedings of inter-
national conference on in circuit, power and computing technologies (ICCPCT), pp 1–6
Rojathai S, Venkatesulu M (2014) Noise robust Tamil speech word recognition system by means of PAC
features with ANFIS. In: Proceedings of the IEEE/ACIS 13th international conference on com-
puter and information science (ICIS), pp 435–440
Saad MK, Ashour W (2010) OSAC: open source Arabic corpora. In: Proceedings of the 6th international
conference on electrical and computer systems, pp 1–6
Sadanandam M, Prasad VK, Janaki V, Nagesh A (2012) Text independent language recognition system
using DHMM with new features. In: Proceedings of the IEEE 11th international conference on
signal processing (ICSP), pp 511–514
Saini P, Kaur P (2013) Automatic speech recognition: a review. Int J Eng Trends Technol 4(2):132–136
Samudravijaya K (2006) Development of multi-lingual spoken corpora of Indian languages. In: Proceed-
ings of the Chinese spoken language processing symposium, pp 792–801
Samudravijaya K (2014) HMMs as generative models of speech, workshop on Text-to-Speech (TTS)
synthesis. http://www.iitg.ac.in/samudravijaya/tutor ialSlides/hmm4hts_samudravijaya140615.pdf.
Accessed 20 Jan 2018
Samudravijaya K, Gogate MR (2006) Marathi speech database. In: Proceedings of the international sym-
posium on speech technology and processing systems and oriental COCOSDA, pp 21–24
Samudravijaya K, Ahuja R, Bondale N, Jose T, Krishnan S, Poddar P, Raveendran R (1998) A feature-
based hierarchical speech recognition system for Hindi. Sadhana 23(4):313–340
Samudravijaya K, Rao PVS, Agrawal SS (2002) Hindi speech database. In: Proceedings of the interna-
tional conference on spoken language processing, pp 456–464
Sangwan A, Mehrabani M, Hansen JH (2010) Automatic language analysis and identification based on
speech production knowledge. In: Proceedings of the IEEE international conference on acoustics
speech and signal processing (ICASSP), pp 5006–5009
Saraswathi S, Geetha T (2007) Morpheme based language model for Tamil speech recognition system.
Int Arab J Inf Technol 4(3):214–219
Sarfraz H, Hussain S, Bokhari R, Raza AA, UllahI, Sarfraz Z, Parveen R (2010) Speech corpus develop-
ment for a speaker independent spontaneous Urdu speech recognition system. In: Proceedings of
the O-COCOSDA, pp 1–6
Sarma BD, Prasanna SRM (2018) Acoustic–phonetic analysis for speech recognition: a review. IETE
Tech Rev 35(3):305–327
Sarma M, Dutta K, Sarma KK (2010) Speech corpus of Assamese numerals extracted using an adaptive
pre-emphasis filter for speech recognition. In: Proceedings of the international conference on com-
puter and communication technology (ICCCT), pp 461–466
Sarma BD, Sarma M, Sarma M, Prasanna SRM (2013) Development of Assamese phonetic engine:
some issues. In: Proceedings of the annual IEEE India conference (INDICON), pp 1–6
Sarma BD, Sarmah P, Lalhminghlui W, Prasanna SM (2015) Detection of Mizo tones. In: Proceedings
of sixteenth annual conference of the international speech communication association, pp 934–937
Sekhar CC, Yegnanarayana B (2002) A constraint satisfaction model for recognition of stop consonant-
vowel (SCV) utterances. IEEE Trans Speech Audio Process 10(7):472–480
Seshadri G, Yegnanarayana B (2011) Performance of an event-based instantaneous fundamen-
tal frequency estimator for distant speech signals. IEEE Trans Audio Speech Lang Process
19(7):1853–1864
Shah F (2009) Automatic emotion recognition from speech using artificial neural networks with gender-
dependent databases. In: Proceedings of the international conference on advances in computing,
control, and telecommunication technologies, pp 162–164
Shruti (2015) Bengali continuous ASR Speech Corpus 2015. http://cse.iitkgp.ac.in/pabitra/shruti-corpu
s.html. Accessed 20 Jan 2018
Singh B, Singh P (2011) Voice based user machine interface for Punjabi using Hidden Markov model.
Int J Comput Sci Technol 2(3):222–224
Singhvi A, Gupta P, Sanyal S (2008) Hierarchical phoneme classifier for Hindi speech. In: Proceedings
of the 9th international conference on signal processing, pp 571–574
Sinha S, Agrawal SS and Olsen J (2011) Development of Hindi mobile communication text and speech
corpus. In: Proceedings of O-COCODSA, pp 30–35
Sinha S, Agrawal SS, Jain A (2013) Continuous density hidden markov model for context dependent
Hindi speech recognition. In: Proceedings of the international conference on advances in comput-
ing, communications and informatics (ICACCI), pp 1953–1958
13
A. Singh et al.
Sreenu G, Girija PN, Prasad MN, Nagamani M (2004) A human machine speaker dependent speech interac-
tive system. In: Proceedings of the IEEE INDICON, pp 349–351
Sriranjani R, Karthick BM, Umesh S (2014) Experiments on front-end techniques and segmentation model
for robust Indian Language speech recognizer. In: Proceedings of the twentieth national conference on
communications (NCC), pp 1–6
Sukumar AR, Shah AF, Anto PB (2010) Isolated question words recognition from speech queries by using
artificial neural networks. In: Proceedings of international conference on computing communication
and networking technologies, pp 1–4
Sunil KRK, Lajish VL (2012) Vowel phoneme recognition based on average energy information in the zero-
crossing intervals and its distribution using ANN. Int J Inf Sci Tech 2(6):33–42
Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recogni-
tion: a review. Int J Speech Technol 21(1):93–120
Taheri A, Tarihi MR, Ali HV (2006) Fuzzy hidden markov models and fuzzy NN models in speaker recog-
nition. In: Proceedings of the 1st IEEE conference on industrial electronics and applications, pp 1–5
Thakur EA, Singla N, Patil VV (2011) Design of Hindi key word recognition system for home automation
system using MFCC and DTW. Int J Adv Eng Sci Technol 11:177–182
Thakur A, Kumar R, Kumar N (2013) Automatic speech recognition system for Hindi utterances with
regional Indian accents: a review. Int J Electron Commun Technol 4(3):1–6
Thalengala A, Shama K (2016) Study of sub-word acoustical models for Kannada isolated word recognition
system. Int J Speech Technol 19(4):817–826
Thangarajan R, Natarajan AM, Selvam M (2008) Word and triphone based approaches in continuous speech
recognition for Tamil language. WSEAS Trans Signal Process 4(3):76–86
Thasleema TM, Narayanan NK (2012) Wavelet transform based consonant–vowel (CV) classification using
support vector machines. In: Proceedings of the international conference on neural information pro-
cessing, pp 250–257
Thasleema TM, Kabeer V, Narayanan NK (2007) Malayalam vowel recognition based on linear predictive
coding parameters and k-NN algorithm. In: Proceedings of international conference on computational
intelligence and multimedia applications (ICCIMA 2007), pp 361–365
Udhyakumar N, Swaminathan R, Ramakrishnan SK (2004) Multilingual speech recognition for informa-
tion retrieval in Indian context. In: Proceedings from the student research workshop, HLT/NAACL,
Boston, MA. pp 1–6
Undha AG, Patil HA, Madhavi MC (2014) Exploiting speech source information for vowel landmark detec-
tion for low resource language. In: Proceedings of 9th international symposium on chinese spoken
language processing, pp 546–550
Upadhyay RK, Riyal MK (2010) Garhwali speech database. In: Proceedings of O-COCOSDA, pp 1–3
Vegesna V, Gurugubelli K, Vuppala A (2018) Application of emotion recognition and modification for emo-
tional Telugu speech recognition. Mob Netw Appl 24(1):193–201
Venkateswarlu RLK, Teja RR, Kumari RV (2012) Developing efficient speech recognition system for Tel-
ugu letter recognition. In: Proceedings of international conference on computing, communication and
applications, pp 1–6
Vydana HK, Vikash P, Vamsi T, Kumar KP, Vuppala AK (2015) Detection of emotionally significant
regions of speech for emotion recognition. In: Proceedings of the annual IEEEIndia conference
(INDICON), pp 1–6
Yegnanarayana B, Gangashetty SV (2011) Epoch-based analysis of speech signals. Sadhana 36(5):651–697
Yegnanarayana B, Prasanna SRM, Gupta CS (2005) Combining evidence from source, supra segmental
and spectral features for a fixed-text speaker verification system. IEEE Trans Speech Audio Process
13(4):575–582
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13

ASRoIL_a_comprehensive_survey_for_automa Kannada

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASRoIL_a_comprehensive_survey_for_automa Kannada

Uploaded by

Copyright:

Available Formats

Artificial Intelligence Review

ASRoIL: a comprehensive survey for automatic speech

Amitoj Singh1 · Virender Kadyan2 · Munish Kumar1 · Nancy Bassan3

© Springer Nature B.V. 2019

Keywords Automatic speech recognition · Indian languages · Feature extraction

• Generating subtitles in live television

technological aspects of Automatic speech recognition deliberated various techniques of

3 Issues for automatic speech recognition

4.1 Corpus development and selection

4.2 Feature extraction techniques

Table 1 Summary of speech corpora and its properties

4.3 Classification techniques

Prasanna and Pradhan (2011) ✓

Rahul et al. (2013) ✓

Hemakumar and Punitha (2014) ✓

Koolagudi et al. (2012) ✓

5 Synthetic analysis and future directions of the compiled work

Manipuri (Patel et al. 2018) DNN + HMM 13.57 WER

Hindi (Kumar et al. 2014a) GMM-HMM 95.40% WA

Language type Modeling technique Recognition rate (%)

Table 4 Summary of classification techniques

Samudravijaya et al. (1998) ✓

Dutta and Sarma (2012) ✓

You might also like