Professional Documents
Culture Documents
Funding Information:
Abstract: This article addresses attempts made to establish a Continuous Speech Recognition
(CSR) framework recognize continuous speech in the Kannada dialect. It is a difficult
challenge to deal with a local language as Kannada which is not as resourceful as to
the availability of a single language database. In this article, modelling techniques such
as monophone, triphone, deep neural network (DNN)-hidden Markov model (HMM)
and Gaussian Mixture Model (GMM)-HMM-based models are implemented in Kaldi
toolkit and used for continuous Kannada speech recognition (CKSR). To extract
feature vectors from speech data, the Mel frequency Cepstral (MFCC) coefficient
technique is used.The model efficiency is determined based on the word error rate
(WER) and the obtained results are assessed with the well-known datasets such as
TIMIT and Aurora-4. This study found that using Kaldi-based features extraction
recipes for monophone, triphone, DNN-HMM and GMM-HMM acoustic models had a
word error rate (WER) of 8.23\%, 5.23\%, 4.05\% and 4.64\% respectively. The
experimental results suggest that the rate of recognition of Kannada speech data has
increased higher than that of state-of-the-art databases.
Jayanna H S
Author Comments:
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Click here to access/download;Manuscript;Kannada_ASR.pdf
1
2
3
4
5
Development of Speaker-Independent Automatic
6 Speech Recognition System For Kannada Language
7
8 Praveen Kumar P S · Jayanna H S
9
10
11
12
13
14
15 Received: date / Accepted: date
16
17
18 Abstract This article addresses attempts made to establish a Continuous
19 Speech Recognition (CSR) framework recognize continuous speech in the Kan-
20 nada dialect. It is a difficult challenge to deal with a local language as Kannada
21 which is not as resourceful as to the availability of a single language database.
22 In this article, modelling techniques such as monophone, triphone, deep neural
23 network (DNN)-hidden Markov model (HMM) and Gaussian Mixture Model
24 (GMM)-HMM-based models are implemented in Kaldi toolkit and used for
25 continuous Kannada speech recognition (CKSR). To extract feature vectors
26 from speech data, the Mel frequency Cepstral (MFCC) coefficient technique is
27 used. The model efficiency is determined based on the word error rate (WER)
28 and the obtained results are assessed with the well-known datasets such as
29 TIMIT and Aurora-4. This study found that using Kaldi-based features ex-
30 traction recipes for monophone, triphone, DNN-HMM and GMM-HMM acous-
31 tic models had a word error rate (WER) of 8.23%, 5.23%, 4.05% and 4.64%
32 respectively. The experimental results suggest that the rate of recognition
33
of Kannada speech data has increased higher than that of state-of-the-art
34
databases.
35
36 Keywords Speech recognition · DNN · Continuous speech · HMM · Kannada
37 dialect · Kaldi toolkit · monophone · triphone · WER
38
39
Research scholar, Department of Electronics and Communication Engineering, Siddaganga
40
Institute of Technology, Tumakuru, Karnataka, India
41 E-mail: pravin227@gmail.com
42
43 Professor and Head, Department of Information Science and Engineering, Siddaganga
44 Institute of Technology, Tumakuru, Karnataka, India
45 E-mail: jayannahs@gmail.com
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
2 Praveen Kumar P S, Jayanna H S
1 1 Introduction
2
3
The word ”Speech Processing” is one that encompasses several different ap-
4
proaches to the issue of recognition of human speech. It varies from iso-
5
6 lated word recognition to continuous speech recognition (CSR), from speech-
7 dependent recognition to speech-independent recognition, and from a limited
8 vocabulary to a broad vocabulary. The easiest scenario is speech-dependent,
9 isolated word recognition in a small vocabulary and the most challenging is
10 speech-independent, CSR in a large vocabulary. In any case, the problem of
11 speech recognition (SR), has evolved over the years, is a highly computational
12 problem; it requires powerful processors and a huge amount of memory. As
13 a result, several efforts have been made to speed up the method by using dif-
14 ferent methods. Around 50 million people in India speak Kannada as their
15 primary language. This is a significant amount of the world’s population.
16 The result of surveys on speakers released by the Census of India recently
17 has shown that Kannada continues to be the eighth most spoken language in
18 the country. But the literature study reveals that there are few quantitative,
19 as well as standard work on Kannada SR. Today voice is the best means of
20 contact between a human being and a computer. This conversation would be
21 more successful if, and only if, human beings are comfortable communicating
22
with computers through their native language. Effective research into Kan-
23
nada SR is more essentially needed.
24
25 The technology of speech recognition is something that has been dreamed of
26 and focused on for decades. However, with all the advances of electronic sci-
27 ence, speech control was a very unsophisticated affair. Instead, what is meant
28 to simplify our lives has traditionally been disappointingly clunky and nothing
29 but a new feature. That is, before big data, deep learning, machine learning,
30 and AI started to make their way into the mainstream of technology more and
31 more. What we know today, as in any technology, must have originated from
32 somewhere, from some time, from someone.
33
34
35 2 Overview of SR Technology
36
37 The first attempt to understand speech technology started way back in 1000
38 A.D. By designing an instrument that can respond to direct questions with
39 ”yes” or ”no.” While this trial did not require voice processing physically in any
40 way, the concept behind it remains one of the pillars of SR technology. While
41 this experiment did not require any voice synthesis physically, the concept
42 behind it remains part of the foundations of speech recognition technologies.
43 Bell Labs developed a system by the name ”Audrey,” which used to recognize
44
1 to 9 numbers spoken in a single voice, several decades later. Subsequently,
45
IBM designed a system to recognise and classify 16 speech words. These ac-
46
complishments led to a higher prevalence of technology firms that concentrate
47
48 on technology related to voice. Even the Defense Ministry needed to take
49 steps. Slowly but gradually, engineers have been moving to make it possible
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 3
1 same time, there are also concerns about how fast and powerful the system
2
model is. The efficiency of automatic speech recognition (ASR) therefore de-
3
pends on the consistency of the device model. A lot of SR work is underway,
4
5 but preserving the precision of decoding is difficult for researchers. Feature
6 extraction, speaker standardization, acoustic modelling, language modelling,
7 etc. are challenging tasks in SR. Modern SR systems are based on HMM
8 for general use. The toolkits such as Sphinx, Hidden Markov Model Toolkit
9 (HTK), Julius are available for SR research. Recently, Kaldi is one of the most
10 common and most recent ASR researcher tools written in C++ language. The
11 benefits of the SR technology developed using Kaldi create high-quality net-
12 works which are quick enough to identify in real-time [32]. SR is a technique
13 for converting speech data into a similar text or for interpreting human voice
14 into machine-readable material. In SR research, the deep learning approach
15 is a very hot topic today, and that is why every neurolinguistic program-
16 ming (NLP) researcher is trying to use deep learning. Deep learning methods
17 are becoming popular, as many of the previously established SR models are
18 outperforming. This is the era when central processing units (CPUs) are re-
19 placed by graphical processing units (GPUs). So, it is really easy to train
20 large ones. This work sets up a CSR network for the Kannada language using
21
phoneme modelling, where each phoneme is represented by a 5-state HMM
22
and each state is represented by a GMM. Also, this work provides a study
23
24 on monophone, triphone and hybrid simulation approaches for Kannada SR.
25 The open-source Kaldi [33] toolkit is used to train and check the SR framework.
26
27 There are two implementations of the DNN in Kaldi. The first version is
28 the version of Dan [34]. It does not support the pre-training of the RBM.
29 The second version is the implementation of Kerel [44]. It allows Restricted
30 Boltzmann Machinery (RBM) pre-testing, stochastic gradient simulation us-
31 ing GPUs and differential preparation [6]. The Kaldi tool is built using C++
32 and is formed on the OpenFST library which utilizes the Linear Algebra BLAS
33 and LAPACK libraries. For this work, we have opted to incorporate the above
34 DNN because it facilitates concurrent training of multiple CPUs [38]. This
35 work is an attempt to build the continuous Kannada SR system. The devel-
36 opment of such a system would help to convert the audiobooks available in
37 Kannada into corresponding transcripts. It can also be very useful to digitize
38 old palm-leaf manuscript documents simply by someone reading it. Such ef-
39 forts will help to contribute the research for the development of the SR system
40
for the Kannada language.
41
The organization of this paper is as follows: The related work in the field of
42
43 ASR, ASR for continuous speech and ASR for Indian languages is discussed in
44 Section 2. The reason for the development of ASR for Kannada dialect is ad-
45 dressed in section 3 followed by descriptions of Kannada phoneme characters
46 and data collection in section 4 and section 5 respectively. The model archi-
47 tecture is discussed in detail in section 6. Technical details of hybrid modelling
48 techniques such as GMM-HMM, DNN-HMM and SGMM are discussed in sec-
49 tion 7, section 8 and section 9 respectively. Unit preparation and validation
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
6 Praveen Kumar P S, Jayanna H S
1 using the Kaldi SR toolkit is covered in section 10. The training and testing
2
of the Kannada ASR system are set out in section 11. The analysis of the
3
experiment and the results is carried out in section 12 and the conclusions are
4
5 finally set out in section 13.
6
7
8 3 Related Work
9
10 3.1 Automatic Speech Recognition
11
12 According to the authors in [1], ASR strives to build an intelligent device ca-
13 pable of interpreting voice phonemes or phoneme strings automatically from
14 speech input signals. The authors in [35] defined that ASR is a technology
15 which enables computers to translate the speech data into machine-readable
16 text through a microphone or telephone input. A team of researchers did a
17 good job at isolated and continuous Bangla dialect Recognition [16]. They
18 used the HMM classifier and sought to grasp both separated and continu-
19 ous terms. They used 100 Bangla words for their experiment. The results
20 of speech-dependent and speaker-independent comprehension levels are 90%
21 and 70% respectively. According to the authors in [3], ASR is defined as a
22 method of converting speech signal into word order by adding an algorithm to a
23 machine so that the system may produce and recognise speech input. Speaker-
24 dependent accuracy was 95%, and the independent accuracy was 91%. The
25
standard of their research was good enough, but the scale was too limited for
26
real-world results. Another study is done by a back-propagation neural net-
27
work to grasp the Bengali digits [18]. Automatic real number identification
28
29 was done by the team of researchers using CMU-SPHINX. The accuracy was
30 84% for personal computers and 74% for smartphone android [29].
31 The research on DNN for the Russian SR process is explained in [41]. In
32 this work, the speaker adaptation method was proposed for context-dependent
33 DNNs-HMMs AM. The features that are derived from GMM are utilized
34 in DNN as input feature. According to the speaker-independent context-
35 dependent DNN-HMM systems, a relative reduction of 5% to 36% was ob-
36 served on separate adaptation sets. Acoustic modelling based on DNN for
37 Russian speech using Kaldi toolkit is provided in [36]. The researchers imple-
38 mented one Russian speech database with the main steps of the Kaldi Switch-
39 board recipe. The SR results obtained were compared with those for English
40 speech data. For Russian and English speaking, the absolute difference be-
41 tween WERs was over 15%. The probability of extracting features directly
42 from DNN without translating the performance probabilities into features ap-
43 propriate for the GMM-HMM method has been explored in [14]. Experiments
44
were performed using DNN modelling technique with five hidden layers. The
45
findings of this experiment are utilized as input features of the GMM-HMM
46
framework for SR during training for DNN. The WER reduction was con-
47
48 trasted with the probabilistic feature system and the device size reduction,
49 since only a part of the network, was used.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 7
1 The use of ASR for several purposes, such as application for public service
2
assistance by telephone directory feedback, applications for identification of
3
database queries, office recognition applications, speech subsidiary applica-
4
5 tions, and translation applications for automatic language expression in the
6 foreign language are beneficial under Kurian [25]. In [46], Shi-Xiong et al.,
7 two key principles for ASR were suggested. The first principle is focused on a
8 large-scale training of sentence-level log-linear models and the second principle
9 is focused on the typical elements, effective lattice-based training and decod-
10 ing. They developed an SR framework that is resilient to the noise by applying
11 the principle of standardised log-linear models. The suggested technique, the
12 structured log-linear model (SLLM), accomplished a WER of 8.40%. The
13 WER achieved for HMM, Support Vector Machines (SVM), Multi-Class SVM
14 (MSVM) and Merger SVM and HMM (SVM+HMM) are 9.45%, 7.98%, 8.08%
15 and 8.55% respectively. The database used is AURORA-2, comprising a total
16 of 8330 sentences. Thus, the researchers concluded that the SLLM approach
17 performs better than the other algorithms in a noisy background.
18 The context-dependent DNNs-HMMs are defined in [38]. The implemen-
19 tation of CD-DNN-HMMs lowered the WER from 26.9% to 19.7%. A variety
20 of recent experiments have proved that DNN-HMM models perform better
21
than the conventional GMM-HMM models. A CD model based on DNN and
22
vocabulary-based SR is provided in [10]. The DNNs provide undirected links
23
24 between the top two layers and direct connections to all of the other layers
25 above. Hybrid DNN-HMM architecture has been used in this study; it has
26 been shown that the DNN-HMM model will perform better than GMM-HMMs
27 and that the researchers obtained the sentence error reduction of 4.9%. The
28 Kaldi toolkit was used for the Italian language identification of children based
29 on DNN in [9]. The Karel and Dan DNN training had been discussed. SR
30 results from Dan’s implementation were slightly lower than Karels’s DNN, but
31 both implementations greatly performed better than the non-DNN configura-
32 tion.
33
34 In [11], the researchers explored the use of acoustic features obtained using
35 pitch-synchronous analysis with standard features employing MFCCs. They
36 investigated the mixing of complementary acoustic features in large vocabulary
37 CSR. Also, they integrated these portrayals explicitly at the acoustic function
38 level using heteroscedastic linear discriminant analysis (HLDA) and at the
39 structure level. The attributes derived from the pitch-synchronous study are
40
particularly convincing when coupled with the duration of the vocal tract.
41
Results show that the consolidation of conventional and pitch-synchronous
42
43 acoustic characteristics using HLDA results in a predictable and notable de-
44 crease in WER. Liang Lu and Steve Renals suggested the use of the highway
45 deep neural network (HDNN) [26]. The authors also contend that HDNNs are
46 easy to control and resilient than DNNs. Experimental data suggested that
47 HDNNs are more reliable than standard DNNs. HDNNs reached a WER of
48 22% compared to DNN, which had a WER of 26.6%. Studies on enhanced
49 multi-party networking with 79 hours of training data were performed. The
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
8 Praveen Kumar P S, Jayanna H S
1 13,984 words, of which 5 hours of learning was used for training. The findings
2
reveal that the SR systems produce a phone error rate (PER) of 24.21% and a
3
WER of 4.12% respectively. The suggested simulation approach demonstrated
4
5 a substantial change relative to the monophone acoustic model.
6 The researchers in [40] have come up with a speech-independent philosophy,
7 a CSR system for Hindi dialect containing 600 words of vocabulary. The
8 speech database has a voice sample of 62 speakers, 40 of whom are male and
9 22 are female. To implement this system, HTK and Sphinx have been used.
10 The accuracy of 91.45% was obtained with the MFCC at the front end and 8
11 GMM states.
12 Over the past two decades, much of the SR work has been focused on
13 HMM [12]. They used GMM to describe the states of the HMM. The acoustic
14 feature vectors consisted primarily of MFCCs or perceptual linear predictive
15 coefficients (LPCs) [17]. For quite a time, the CSR has become a working
16 area of study. Numerous studies have been performed with the identification
17 different languages like Kannada, Punjabi, Tamil, Hindi, Telugu, etc. [24] [15]
18 [20]. SR-related study in Hindi using Kaldi is documented in [43] [39] [2].
19 The extensive literature survey concludes that work on CKSR is not re-
20 markable. Since we do not know their success for CKSR, which made us
21
conduct some tests by developing our database of 2800 speakers gathered
22
throughout the state of Karnataka in the real-world conditions, we would like
23
24 to check the actions of state-of-the-art techniques for continuous Kannada
25 speech. That database is named as continuous Kannada speech database
26 (CKSD). The transcription and validation were performed on all speakers
27 wave scripts. According to the speech data the phoneme level lexicon is built.
28
29
30 4 Motivation for database creation
31
32 Speech is one of the most effective means of communication; it is productive
33 when normal man will have the option to receive its rewards. Most of the
34 international organizations are concentrating on European language, particu-
35 larly English. English is spoken in a wide portion of the globe where native
36 speakers have formed various accents in the USA, Australia, the UK, New
37 Zealand and Canada as well as non-native accents. In India, more people
38 speak regional languages or native languages rather than English in day to
39 day life. There are a significant number of regional varieties which are distinct
40 from each other and, as a result, there is a shift in the phonetic variations
41 of the regional language of that particular region [13]. The Kannada is one
42 among the regional languages of India which is spoken by over fifty million
43
people. It is also the principal language of the state of Karnataka (one of
44
the southern Indian states). The main reason for developing the database is
45
46 that there is no specific database available for Kannada. There is a difficulty
47 in trying to collect and validating large databases. This can be overcome by
48 obtaining manually transcribed data and then adapting it using small tran-
49 scribed data [22]. The speech, as well as the text corpus, are also intended
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
10 Praveen Kumar P S, Jayanna H S
1 Table 2 List of Kannada gaadegalu/naannudigalu recorded from the people across the state
2 of Karnataka
3 English Version of Gaadagalu Kannada Version of Gaadegalu
4 ati aase gatigeid:u ಅತಿ ಆ¡ೆ ಗತಿ ೇಡು
5 haal:uurige ul:idavane gaud:a ¢ಾಳೂರಿ ೆ ಉಳಿದವ ೆ ೌಡ
ban:daddellaa barali goovin:dana dayeyirali ಬಂದ ೆದ್ ಾಲ್ ಬರಲಿ ೋವಿಂದನ ದಯೆಯಿರಲಿ
6 hani hani seiridare hal:l:a tene tene seiridare bal:l:a ಹನಿ ಹನಿ ¡ೇರಿದ ೆ ಹಳಳ್ ೆ ೆ ೆ ೆ ¡ೇರಿದ ೆ ಬಳಳ್
7 handndele uduruvaaga chigurele nagutittu ಹ ೆಣ್ ೆ ಉದುರು ಾಗ ಚಿಗು ೆ ೆ ನಗುತಿತುತ್
8 beiline eddu hola meiyitan:te ೇಲಿ ೆ ಎದುದ್ ¢ೊಲ ಮೇಯಿತಂ ೆ
haagalakaayige beivinakaayi saakshhi ¢ಾಗಲ ಾಯಿ ೆ ೇವಿನ ಾಯಿ ¡ಾಕಿಷ್
9 haavuu saayalilla koolu muriililla ¢ಾವw ¡ಾಯಲಿಲಲ್ ೋಲು ಮುರೀಲಿಲಲ್
10 hiriyakkana chaal:i mane man:digella ಹಿರಿಯಕಕ್ನ ಾಳಿ ಮ ೆ ಮಂದಿ ೆಲಲ್
11 handa an:dre hendavuu baayi bid:uttade ಹಣ ಅಂ ೆರ್ ¢ೆಣವw ಾಯಿ ಬಿಡುತತ್ ೆ
huuvini:n:da naaru swarga seiritu ಹೂವಿನಿಂದ ಾರು ಸವ್ಗರ್ ¡ೇರಿತು
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 Figure 1 Schematic view of the CSR framework for Kannada dialect.
32
33
7.1 Feature Extraction
34
35
MFCC stands for cepstral coefficients of Mel frequency. The abbreviation
36
37 comprises 4 words: Mel, frequency, cepstral and coefficients. The MFCC con-
38 cept is to translate the time-domain speech signal into a frequency domain
39 to capture all the possible information from voice signals. The cochlea in our
40 ear has essentially more low-frequency filters and very few higher frequencies
41 filters. The Mel filters are used to imitate this. The concept of MFCC is, thus,
42 to translate time-domain signals into a frequency domain signal by mimicking
43 cochlea function using Mel filters. The coefficients of cepstral are the inverse-
44 FFT of the logarithm of the spectrum. The MFCC features were initially
45 proposed for the recognition of monosyllabic terms. It is basically as repre-
46 sents the filter (vocal tract) in the source filter speech model. The MFCC’s
47 first 13 (lower dimensions) is a bandwidth envelope. The higher dimensions
48 that are discarded convey the spectral data. Envelopes are adequate to rep-
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 13
1 resent the difference for many phonemes so that phonemes can be recognised
2
through MFCC. The standard frequency scale fHz was warped by logarithmic
3
compression into one perceived pitch at a linear scale fmel :
4
5 ( )
fHz
6 fmel = 1127log 1 + (1)
700
7
8
9 7.2 Language Model (LM)
10
11 To restrict word search, an LM is used. It determines which word should
12
follow previously known terms and helps to limit the matching process sub-
13
stantially by removing words that are not feasible. The LM is a file used to
14
recognise speech by the SR model. It contains a wide list of words and their
15
16 likelihood. LMs are used to limit the number of possible words which should
17 be considered at any point of the search in a decoder. The result is quicker
18 performance and greater precision. The N-gram LMs are the most common
19 LMs, with word sequence statistics and finite language patterns. To obtain
20 a high degree of accuracy, the search space constraint must be quite success-
21 ful. This means that predicting the next term must be successful. Languages
22 constrain the search either completely or probabilistically (by listing a small
23 subset of possible extensions) (by calculating a probability for every possible
24 word in succession). The former typically has a grammar connected to it that
25 is assembled into a graph. An LM typically limits the vocabulary to the words
26 in it. This is a challenging recognition task. To cope with this, smaller chunks
27 like subwords or even phones might be present in a language model. It is noted
28 that in this case, the search space cap is normally worse than for a word-based
29 language model, the accuracy of the resulting identification. The statistical
30 LMs (SLM), were not feasible or possible for a priori defining all possible legal
31
word sequences, are ideal for free form inputs, such as dictation or sponta-
32
neous expression. Possibly the most common trigram SLMs in ASR and a
33
strong mix between difficulty and rigorous approximation. A model tri-gram
34
35 encodes the likelihood of a phrase regardless of its immediate two-word past.
36 In reality, tri-gram models should be ”backed-offs” to bi-gram and unigram
37 modelling so that the decoder can send out every potential word series.
38
39
40 7.3 Acoustic Modelel (AM)
41
42 To train ASR, training an AM and a Language Model (LM) is essential. Fun-
43 damental AM preparation includes:
44 – Monophone HMM instruction for a training sub-set.
45 – Aligned data set with the monophone model
46
– HMM-training using triphone.
47
48 The AM is trained through audio files and transcripts. Here are two major
49 forms of models, one-phone (monophone) and three-phone (triphone) models.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
14 Praveen Kumar P S, Jayanna H S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Figure 3 DNN-HMM Hybrid Device Model [45]
26
27
28 – fMLLR functionality suited to linear regression;
29 – Speaker adapted training (SAT), i.e. training of the highest possibility of
30 function space.
31
– DNN-HMM final model training.
32
33 DNN-HMM is training with fMLLr-adapted features; the SAT fMLLR GMM
34 system includes the decision tree and alignments. Furthermore, the statis-
35 tically inadequate modelling of GMMs in HMM is significantly insufficient
36 where they are placed in or around a non-linear multiple in the data space.
37 DNN-HMM is a modern paradigm of the hybrid model that has been proposed
38 and commonly used in the recognition of speech in recent years. DNN models
39 are better classifiers than GMMs, and with fewer parameters over complex dis-
40
tributions, they can generalise even better. DNN is the standard multi-layer
41
perceptrons with several layers that capture the underlying non-linear rela-
42
tionship between data where training is usually initialised with a pre-training
43
44 algorithm.
45 They model distributions of different classes jointly, this is called “dis-
46 tributed” learning, or, more properly “tied” learning. In GMM, each senone
47 separately with a separate set of GMMs is modelled whereas in DNN the fea-
48 tures are classified together and distribution of senone posteriors is calculated.
49 The alignment for training is calculated for the whole utterance but the con-
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
18 Praveen Kumar P S, Jayanna H S
1 text for the classifier is different. DNNs can model much longer context. In
2
GMM system it is typical to model simply 7–9 frames in a raw, GMM models
3
does not improve if the context due to convexity of the distribution they model
4
5 is increased.
6
7
8 10 SGMM
9
10 ASR systems focused on the GMM-HMM framework typically require the full
11 training of individual GMMs in each HMM condition. A modern modelling
12 methodology is used in the SR domain called SGMM [32]. Consequently, no
13 parameters are distributed between states. The states are described by Gaus-
14 sian mixtures and these parameters transmit the normal structure between
15 the states of the simulation technique of the SGMM. Dedicated multivariate
16 Gaussian mixtures are used for state-level modelling of conventional GMM-
17 HMM acoustic modelling techniques. The SGMM consists of GMM within
18 each context-dependent state, the vector Ia ∈ V r in each state is specified
19 instead of directly defining the parameters. The basic form of SGMM can be
20 described in the following equations:
21
22
23 ∑
L
1 Table 4 The depiction of WER for hybrid modelling techniques at different phoneme levels
2 for continuous Kannada speech database
3
WER_1 WER_2 WER_3 WER_4 WER_5
4 Phonemes
CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT
5 SGMM+MMI_it1 5.23 6.45 5.06 5.89 4.98 6.02 5.21 6.66 4.95 8.98
6 SGMM+MMI_it2 5.38 6.59 5.64 6.97 6.29 7.21 5.83 6.82 6.02 7.13
SGMM+MMI_it3 5.88 6.94 5.64 7.23 6.02 7.89 5.98 6.68 6.23 7.63
7 SGMM+MMI_it4 6.02 7.82 5.88 6.97 6.23 7.85 5.81 6.85 6.01 7.69
8 DNN+HMM 4.56 6.02 4.67 5.92 5.01 6.23 4.05 5.87 5.21 6.90
DNN+SGMM_it1 4.87 6.23 4.65 5.02 4.94 5.45 5.10 5.67 4.86 5.54
9 DNN+SGMM_it2 4.59 5.24 4.62 5.14 4.85 5.41 5.03 5.67 5.24 5.89
10 DNN+SGMM_it3 5.31 6.12 4.92 5.61 5.09 5.84 4.86 5.29 4.64 5.77
11 DNN+SGMM_it4 4.99 5.64 5.06 6.11 4.85 5.90 5.22 5.93 5.59 6.28
12
13 Table 5 The representation of WER at the different phoneme levels for the continuous
14 Kannada speech and Aurora-4 database
15 WER_1 WER_2 WER_3 WER_4 WER_5
Phonemes
16 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4
MONO 8.23 9.25 8.54 9.47 8.36 9.59 8.64 8.86 8.66 8.59
17 tri1_600_2400 7.52 7.86 7.48 8.45 7.26 8.21 7.81 8.58 7.65 8.68
18 tri1_600_4800 6.57 6.97 6.61 7.24 6.75 7.47 6.81 7.67 6.69 7.69
19 tri1_600_9600 6.24 6.54 6.37 7.35 6.48 7.81 6.22 7.29 6.35 7.03
tri2_600_2400 7.45 7.58 7.24 8.65 7.14 8.56 7.35 8.46 7.27 8.45
20 tri2_600_4800 6.52 6.98 6.27 7.25 6.29 7.64 6.35 7.61 6.33 7.97
21 tri2_600_9600 5.59 6.55 5.54 6.59 5.38 6.87 5.84 6.73 6.01 6.85
22 tri3_600_2400 5.79 6.25 5.61 6.58 5.54 6.69 5.58 6.67 5.88 6.95
tri3_600_4800 5.45 6.55 5.62 6.68 5.38 6.62 5.48 6.77 5.41 6.86
23 tri3_600_9600 5.12 6.62 5.34 6.84 5.27 6.85 5.23 6.23 5.32 6.97
24 SGMM 4.86 5.26 5.12 5.65 4.84 5.86 4.89 5.15 4.92 5.58
25
26 Table 6 The WER representation for hybrid modeling techniques for Continuous Kannada
27 speech database and Aurora-4 database
28 WER_1 WER_2 WER_3 WER_4 WER_5
29 Phonemes
CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4
30 SGMM+MMI_it1 5.23 7.02 5.06 6.89 4.98 6.91 5.21 7.23 4.95 7.21
SGMM+MMI_it2 5.38 6.65 5.64 7.35 6.29 6.86 5.83 6.54 6.02 6.58
31 SGMM+MMI_it3 5.88 7.28 5.64 7.86 6.02 7.61 5.98 7.54 6.23 6.98
32 SGMM+MMI_it4 6.02 8.01 5.88 7.61 6.23 8.28 5.81 7.81 6.01 8.03
33 DNN+HMM 4.56 5.68 4.67 5.27 5.01 5.81 4.05 6.21 5.21 5.97
DNN+SGMM_it1 4.87 5.94 4.65 5.68 4.94 5.94 5.10 6.01 4.86 6.24
34 DNN+SGMM_it2 4.59 5.54 4.62 5.64 4.85 5.29 5.03 6.31 5.24 6.10
35 DNN+SGMM_it3 5.31 6.54 4.92 5.58 5.09 5.64 4.86 5.68 4.64 5.93
36 DNN+SGMM_it4 4.99 6.14 5.06 5.59 4.85 6.21 5.22 5.92 5.59 6.34
37
38
39 The comparison between the CKSD and the TIMIT database and the
40 Aurora-4 repository w.r.t recognition rate is seen in Figure 4 and Figure 5
41 respectively. The plots reveal that the efficiency of the triphone modelling
42 technique is greater than that of the monophone modelling technique. The effi-
43 ciency of CKSD is also higher than that of the TIMIT and Aurora-4 databases.
44
45 In Figure 6 and Figure 7, a comparison of the CKSD with the TIMIT
46 database and the Aurora-4 database for hybrid modelling techniques is shown.
47 These plots show that the combination of DNN and HMM modelling technique
48 is performing better than the other hybrid modelling techniques for CKSD.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 Figure 4 The performance comparision of CKSD and TIMIT database
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 Figure 5 The performance comparision of CKSD and Aurora-4 database
38
39
40 Also from the above plots, it is observed that the performance of CKSD has
41 a better edge compared to that of TIMIT and Aurora-4 database.
42
43
44 13 Conclusions
45
46 The CKSR system has been demonstrated in this study. The speech data have
47 been collected, transcribed and checked with the transcription system. Using
48 an SR toolkit, the ASR models were developed. In the lexicon were included
49 all alternate pronouncements for the Kannada speaking sentence. The WERs
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
22 Praveen Kumar P S, Jayanna H S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 Figure 6 The performance comparision of CKSD and TIMIT database for hybrid modelling
19 techniques
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 Figure 7 The performance comparision of CKSD and Aurora4 database for hybrid mod-
39 elling techniques
40
41
42 performed for the monophonic, triphone1, triphone2, triphone3, combination
43 of SGMM and MMI, combination of DNN and HMM, combination of DNN
44 and SGMM are 8.36%, 6.22%, 5.38%, 5.12%, 4.84%, 4.98%, 4.05%, 4.59%,
45 respectively. The recognition rate of CKSR system is higher than that the
46 WER for TIMIT and Aurora-4 database. The SGMM and hybrid DNN-based
47 modelling techniques have achieved the least WER for continuous Kannada
48 speech data. These least WER models (SGMM and DNN-based models) could
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 23
1 be used to build a stable ASR framework. The developed ASR system has been
2
tested under deteriorated conditions by different speakers. Also, the precision
3
of the ASR system can be expanded to allow us to implement noise reduction
4
5 methods more effectively. The expected challenge is to further improve the
6 system efficiency by increasing the number of speakers and also by increasing
7 the number of phonemes appropriately.
8
9
10 References
11
1. AM Ahmed. Abushariah, tedi s. gunawan, othman o. khalifa., english digit speech
12 recognition system based on hidden markov models”. In International Conference on
13 Computer and Communication Engineering (ICCCE 2010), pages 11–13, 2010.
14 2. Md Alif Al Amin, Md Towhidul Islam, Shafkat Kibria, and Mohammad Shahidur Rah-
15 man. Continuous bengali speech recognition based on deep neural network. In 2019
16 International Conference on Electrical, Computer and Communication Engineering
(ECCE), pages 1–6. IEEE, 2019.
17 3. MA Anasuya and SK Katti. Speech recognition by machine: A review. International
18 Journal of Computer Science and Information Security, 6:181–205, 2009.
19 4. CS Anoop and AG Ramakrishnan. Automatic speech recognition for sanskrit. In 2019
20 2nd International Conference on Intelligent Computing, Instrumentation and Control
21 Technologies (ICICICT), volume 1, pages 1146–1151. IEEE, 2019.
5. Pratyush Banerjee, Gaurav Garg, Pabitra Mitra, and Anupam Basu. Application of
22 triphone clustering in acoustic modeling for continuous speech recognition in bengali.
23 In 2008 19th International Conference on Pattern Recognition, pages 1–4. IEEE, 2008.
24 6. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise
25 training of deep networks. In Advances in neural information processing systems, pages
26 153–160, 2007.
7. Astik Biswas, PK Sahu, Anirban Bhowmick, and Mahesh Chandra. Hindi vowel classi-
27 fication using gfcc and formant analysis in sensor mismatch condition. WSEAS Trans
28 Syst, 13:130–43, 2014.
29 8. Vishal Chourasia, K Samudravijaya, and Manohar Chandwani. Phonetically rich hindi
30 sentence corpus for creation of speech database. Proc. O-Cocosda, pages 132–137, 2005.
31 9. Piero Cosi. A kaldi-dnn-based asr system for italian. In 2015 International Joint
Conference on Neural Networks (IJCNN), pages 1–5. IEEE, 2015.
32 10. George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained
33 deep neural networks for large-vocabulary speech recognition. IEEE Transactions on
34 audio, speech, and language processing, 20(1):30–42, 2011.
35 11. Mathias De Wachter, Mike Matton, Kris Demuynck, Patrick Wambacq, Ronald Cools,
36 and Dirk Van Compernolle. Template-based continuous speech recognition. IEEE
Transactions on Audio, Speech, and Language Processing, 15(4):1377–1390, 2007.
37 12. H Gabzili, Z Lachiri, and N Ellouze. Experimental study of the hmms effect on the
38 word recognition performance. In First International Symposium on Control, Commu-
39 nications and Signal Processing, 2004., pages 615–618. IEEE, 2004.
40 13. Anushri Garud, Arti Bang, and Shrikant Joshi. Development of hmm based automatic
41 speech recognition system for indian english. In 2018 Fourth International Conference
on Computing Communication Control and Automation (ICCUBEA), pages 1–6. IEEE,
42 2018.
43 14. Frantisek Grézl, Martin Karafiát, Stanislav Kontár, and Jan Cernocky. Probabilistic
44 and bottle-neck features for lvcsr of meetings. In 2007 IEEE International Conference
45 on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–757. IEEE,
46 2007.
15. Jyoti Guglani and AN Mishra. Continuous punjabi speech recognition model based on
47 kaldi asr toolkit. International Journal of Speech Technology, 21(2):211–216, 2018.
48 16. Md Hasnat, Jabir Mowla, Mumit Khan, et al. Isolated and continuous bangla speech
49 recognition: implementation, performance and application perspective. 2007.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
24 Praveen Kumar P S, Jayanna H S
1 17. Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech. the Journal of
2 the Acoustical Society of America, 87(4):1738–1752, 1990.
3 18. Md Hossain, Md Rahman, Uzzal Kumar Prodhan, Md Khan, et al. Implementation of
4 back-propagation neural network for isolated bangla speech recognition. arXiv preprint
5 arXiv:1308.3785, 2013.
19. Daniel Jurafsky and James H Martin. Chapter 9: Hidden markov models. Speech and
6
Language Processing, 2018.
7 20. M Kalamani, M Krishnamorti, and RS Valarmati. Continuous tamil speech recognition
8 technique under non stationary noisy environments. International Journal of Speech
9 Technology, 22(1):47–58, 2019.
10 21. Irina Kipyatkova and Alexey Karpov. Dnn-based acoustic modeling for russian speech
recognition using kaldi. In International Conference on Speech and Computer, pages
11
246–253. Springer, 2016.
12 22. P. S. Praveen Kumar and H. S. Jayanna. Creation and instigation of triphone based
13 big-lexicon speaker-independent continuous speech recognition framework for kannada
14 language. International Journal of Innovative Technology and Exploring Engineering
15 (IJITEE), 9(2S), December 2019.
23. P. S. Praveen Kumar, G. Thimmaraja Yadava, and H. S. Jayanna. Continuous kannada
16
speech recognition system under degraded condition. Circuits, Systems, and Signal
17 Processing, 39(1):391–419, July 2019.
18 24. PS Praveen Kumar, G Thimmaraja Yadava, and HS Jayanna. Continuous kannada
19 speech recognition system under degraded condition. Circuits, Systems, and Signal
20 Processing, pages 1–29, 2019.
21 25. Cini Kurian and Kannan Balakrishnan. Speech recognition of malayalam numbers. In
2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pages
22 1475–1479. IEEE, 2009.
23 26. Liang Lu and Steve Renals. Small-footprint highway deep neural networks for speech
24 recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
25 25(7):1502–1511, 2017.
26 27. Andrew L Maas, Peng Qi, Ziang Xie, Awni Y Hannun, Christopher T Lengerich, Daniel
Jurafsky, and Andrew Y Ng. Building dnn acoustic models for large vocabulary speech
27 recognition. Computer Speech & Language, 41:195–213, 2017.
28 28. A Madhavraj and AG Ramakrishna. Design and development of a large vocabulary,
29 continuous speech recognition system for tamil. In 2017 14th IEEE India Council
30 International Conference (INDICON), pages 1–5. IEEE, 2017.
31 29. Md Mahadi Hasan Nahid, Md Ashraful Islam, and Md Saiful Islam. A noble approach
for recognizing bangla real number automatically using cmu sphinx4. In 2016 5th Inter-
32 national Conference on Informatics, Electronics and Vision (ICIEV), pages 844–849.
33 IEEE, 2016.
34 30. Branislav Popović, Stevan Ostrogonac, Edvin Pakoci, Nikša Jakovljević, and Vlado
35 Delić. Deep neural network based continuous speech recognition for serbian using the
36 kaldi toolkit. In International Conference on Speech and Computer, pages 186–192.
Springer, 2015.
37 31. Daniel Povey, Lukáš Burget, Mohit Agarwal, Pinar Akyazi, Feng Kai, Arnab Ghoshal,
38 Ondřej Glembek, Nagendra Goel, Martin Karafiát, Ariya Rastrow, et al. The subspace
39 gaussian mixture model—a structured model for speech recognition. Computer Speech
40 & Language, 25(2):404–439, 2011.
41 32. Daniel Povey, Lukśš Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal,
Ondřej Glembek, Nagendra Kumar Goel, Martin Karafiát, Ariya Rastrow, et al. Sub-
42 space gaussian mixture models for speech recognition. In 2010 IEEE International
43 Conference on Acoustics, Speech and Signal Processing, pages 4330–4333. IEEE, 2010.
44 33. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Na-
45 gendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al.
46 The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech
recognition and understanding, number CONF. IEEE Signal Processing Society, 2011.
47 34. Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of dnns with
48 natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 25
1 35. Hamdan Prakoso, Ridi Ferdiana, and Rudy Hartanto. Indonesian automatic speech
2 recognition system using cmusphinx toolkit and limited dataset. In 2016 International
3 Symposium on Electronics and Smart Devices (ISESD), pages 283–286. IEEE, 2016.
4 36. Alexey Prudnikov, Ivan Medennikov, Valentin Mendelev, Maxim Korenevsky, and Yuri
5 Khokhlov. Improving acoustic models for russian spontaneous speech recognition. In
International Conference on Speech and Computer, pages 234–242. Springer, 2015.
6
37. Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in
7 speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
8 38. Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-
9 dependent deep neural networks. In Twelfth annual conference of the international
10 speech communication association, 2011.
39. Roshan S Sharma, Sri Harsha Paladugu, K Jeeva Priya, and Deepa Gupta. Speech
11
recognition in kannada using htk and julius: A comparative study. In 2019 Interna-
12 tional Conference on Communication and Signal Processing (ICCSP), pages 0068–0072.
13 IEEE, 2019.
14 40. Shweta Sinha, Shyam S Agrawal, and Aruna Jain. Continuous density hidden markov
15 model for context dependent hindi speech recognition. In 2013 International Conference
on Advances in Computing, Communications and Informatics (ICACCI), pages 1953–
16
1958. IEEE, 2013.
17 41. Natalia Tomashenko and Yuri Khokhlov. Speaker adaptation of context dependent
18 deep neural networks based on map-adaptation and gmm-derived feature processing. In
19 Fifteenth Annual Conference of the International Speech Communication Association,
20 2014.
21 42. Prashant Upadhyaya, Sanjeev Kumar Mittal, Omar Farooq, Yash Vardhan Varshney,
and Musiur Raza Abidi. Continuous hindi speech recognition using kaldi asr based
22 on deep neural network. In Machine Intelligence and Signal Analysis, pages 303–311.
23 Springer, 2019.
24 43. Prashanth Upadyaya, Omar Faroq, Musiur Raza Abidi, and Yash Vardhan Varshney.
25 Continuous hindi speech recognition model based on kaldi asr toolkit. In 2017 Inter-
26 national Conference on Wireless Communications, Signal Processing and Networking
(WiSPNET), pages 786–789. IEEE, 2017.
27 44. Karel Veselỳ, Arnab Ghoshal, Lukás Burget, and Daniel Povey. Sequence-discriminative
28 training of deep neural networks. In Interspeech, volume 2013, pages 2345–2349, 2013.
29 45. Dong Yu and Li Deng. AUTOMATIC SPEECH RECOGNITION. Springer, 2016.
30 46. Shi-Xiong Zhang, Anton Ragni, and Mark John Francis Gales. Structured log linear
31 models for noise robust speech recognition. IEEE Signal Processing Letters, 17(11):945–
948, 2010.
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65