You are on page 1of 26

Advances in Computational Intelligence

Development of Speaker-Independent Automatic Speech Recognition System For


Kannada Language
--Manuscript Draft--

Manuscript Number: ADCI-D-21-00047

Full Title: Development of Speaker-Independent Automatic Speech Recognition System For


Kannada Language

Article Type: Original Article

Section/Category: Ambient Intelligence and Humanized Computing

Funding Information:

Abstract: This article addresses attempts made to establish a Continuous Speech Recognition
(CSR) framework recognize continuous speech in the Kannada dialect. It is a difficult
challenge to deal with a local language as Kannada which is not as resourceful as to
the availability of a single language database. In this article, modelling techniques such
as monophone, triphone, deep neural network (DNN)-hidden Markov model (HMM)
and Gaussian Mixture Model (GMM)-HMM-based models are implemented in Kaldi
toolkit and used for continuous Kannada speech recognition (CKSR). To extract
feature vectors from speech data, the Mel frequency Cepstral (MFCC) coefficient
technique is used.The model efficiency is determined based on the word error rate
(WER) and the obtained results are assessed with the well-known datasets such as
TIMIT and Aurora-4. This study found that using Kaldi-based features extraction
recipes for monophone, triphone, DNN-HMM and GMM-HMM acoustic models had a
word error rate (WER) of 8.23\%, 5.23\%, 4.05\% and 4.64\% respectively. The
experimental results suggest that the rate of recognition of Kannada speech data has
increased higher than that of state-of-the-art databases.

Corresponding Author: Praveen Kumar P S


Siddaganga Institute of Technology
INDIA

Corresponding Author Secondary


Information:

Corresponding Author's Institution: Siddaganga Institute of Technology

Corresponding Author's Secondary


Institution:

First Author: Praveen Kumar P S

First Author Secondary Information:

Order of Authors: Praveen Kumar P S

Jayanna H S

Order of Authors Secondary Information:

Author Comments:

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Click here to access/download;Manuscript;Kannada_ASR.pdf

Click here to view linked References

Noname manuscript No.


(will be inserted by the editor)

1
2
3
4
5
Development of Speaker-Independent Automatic
6 Speech Recognition System For Kannada Language
7
8 Praveen Kumar P S · Jayanna H S
9
10
11
12
13
14
15 Received: date / Accepted: date
16
17
18 Abstract This article addresses attempts made to establish a Continuous
19 Speech Recognition (CSR) framework recognize continuous speech in the Kan-
20 nada dialect. It is a difficult challenge to deal with a local language as Kannada
21 which is not as resourceful as to the availability of a single language database.
22 In this article, modelling techniques such as monophone, triphone, deep neural
23 network (DNN)-hidden Markov model (HMM) and Gaussian Mixture Model
24 (GMM)-HMM-based models are implemented in Kaldi toolkit and used for
25 continuous Kannada speech recognition (CKSR). To extract feature vectors
26 from speech data, the Mel frequency Cepstral (MFCC) coefficient technique is
27 used. The model efficiency is determined based on the word error rate (WER)
28 and the obtained results are assessed with the well-known datasets such as
29 TIMIT and Aurora-4. This study found that using Kaldi-based features ex-
30 traction recipes for monophone, triphone, DNN-HMM and GMM-HMM acous-
31 tic models had a word error rate (WER) of 8.23%, 5.23%, 4.05% and 4.64%
32 respectively. The experimental results suggest that the rate of recognition
33
of Kannada speech data has increased higher than that of state-of-the-art
34
databases.
35
36 Keywords Speech recognition · DNN · Continuous speech · HMM · Kannada
37 dialect · Kaldi toolkit · monophone · triphone · WER
38
39
Research scholar, Department of Electronics and Communication Engineering, Siddaganga
40
Institute of Technology, Tumakuru, Karnataka, India
41 E-mail: pravin227@gmail.com
42
43 Professor and Head, Department of Information Science and Engineering, Siddaganga
44 Institute of Technology, Tumakuru, Karnataka, India
45 E-mail: jayannahs@gmail.com
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
2 Praveen Kumar P S, Jayanna H S

1 1 Introduction
2
3
The word ”Speech Processing” is one that encompasses several different ap-
4
proaches to the issue of recognition of human speech. It varies from iso-
5
6 lated word recognition to continuous speech recognition (CSR), from speech-
7 dependent recognition to speech-independent recognition, and from a limited
8 vocabulary to a broad vocabulary. The easiest scenario is speech-dependent,
9 isolated word recognition in a small vocabulary and the most challenging is
10 speech-independent, CSR in a large vocabulary. In any case, the problem of
11 speech recognition (SR), has evolved over the years, is a highly computational
12 problem; it requires powerful processors and a huge amount of memory. As
13 a result, several efforts have been made to speed up the method by using dif-
14 ferent methods. Around 50 million people in India speak Kannada as their
15 primary language. This is a significant amount of the world’s population.
16 The result of surveys on speakers released by the Census of India recently
17 has shown that Kannada continues to be the eighth most spoken language in
18 the country. But the literature study reveals that there are few quantitative,
19 as well as standard work on Kannada SR. Today voice is the best means of
20 contact between a human being and a computer. This conversation would be
21 more successful if, and only if, human beings are comfortable communicating
22
with computers through their native language. Effective research into Kan-
23
nada SR is more essentially needed.
24
25 The technology of speech recognition is something that has been dreamed of
26 and focused on for decades. However, with all the advances of electronic sci-
27 ence, speech control was a very unsophisticated affair. Instead, what is meant
28 to simplify our lives has traditionally been disappointingly clunky and nothing
29 but a new feature. That is, before big data, deep learning, machine learning,
30 and AI started to make their way into the mainstream of technology more and
31 more. What we know today, as in any technology, must have originated from
32 somewhere, from some time, from someone.
33
34
35 2 Overview of SR Technology
36
37 The first attempt to understand speech technology started way back in 1000
38 A.D. By designing an instrument that can respond to direct questions with
39 ”yes” or ”no.” While this trial did not require voice processing physically in any
40 way, the concept behind it remains one of the pillars of SR technology. While
41 this experiment did not require any voice synthesis physically, the concept
42 behind it remains part of the foundations of speech recognition technologies.
43 Bell Labs developed a system by the name ”Audrey,” which used to recognize
44
1 to 9 numbers spoken in a single voice, several decades later. Subsequently,
45
IBM designed a system to recognise and classify 16 speech words. These ac-
46
complishments led to a higher prevalence of technology firms that concentrate
47
48 on technology related to voice. Even the Defense Ministry needed to take
49 steps. Slowly but gradually, engineers have been moving to make it possible
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 3

1 for machines to understand and respond to our verbalised commands more


2
and more. There has been a long and reminiscent tradition of SR technolo-
3
gies. With the introduction of emerging innovations such as cloud computing
4
5 and continuing data collection, the language systems continually expand their
6 ability to ’hear’ and understand a larger diversity of phrases, languages and
7 accents. It is easy to understand how SR technology functions in the sense
8 of smartphones, smart vehicles, smart home appliances, voice assistants and
9 more. Since the ease of being able to communicate with digital assistants is
10 deceptive. In reality, SR is extremely difficult, even now. This doesn’t mean
11 we’re not progressing; since May 2017, a 95% word precision score for the
12 English language has been reached by Google’s machines learning algorithms.
13 This present rate is also the human accuracy threshold. It would take a lot of
14 time and much more field knowledge to complete these SR systems; there are
15 thousands of languages, accents, and dialects, after all. The word accuracy
16 rating of Google improved from 80% to an outstanding 95% between 2013 and
17 2017. SR is becoming too easy to use and too effective, analysts expect that
18 by 2020 50% of all web search will be carried out using voice.
19 However, today’s speech interfaces, such as Google Voice, Microsoft Cor-
20 tana, Amazon’s Alexa and Apple’s Siri are not going to be where they are
21
today without the early pioneers that led the way. The high degree of rivalry
22
that is seen between these technological giants in the market and the growing
23
24 proliferation of businesses in SR devices suggests that we already have a long
25 way to go. Owing to the integration of emerging technology such as cloud-
26 based processing and current data collection programs, these speech systems
27 have steadily enhanced their ability to ’hear’ and comprehend a broader range
28 of words, languages and accents. At this point, the speech researchers vision
29 of the future doesn’t appear to be as far distant as we would imagine.
30
31
32 2.1 Categories of ASR
33
34 SR Systems can be divided into various categories based on the limitations
35 placed on the essence of the input speech. The essence of the utterance: Users
36 are expected to utter words with a simple delay between words in an Isolated
37 Word Recognition system. The Related Word Recognition device will identify
38 terms taken from a limited collection of words spoken without the need for
39 a delay between words. Continuous speech recognition devices, on the other
40 hand, understand phrases that have been spoken repeatedly. Spontaneous
41 SR frameworks can fix speech dysfunctions such as ah, am or false startups,
42 grammatical errors found in conversational speech. The Keyword Spotting
43 Process proceeds to search for a predefined sequence of terms and detects the
44 presence of each of them in the input expression.
45
46 – Number of speakers: the machine is said to be an autonomous speaker if
47 it can understand the voice of any speaker. The machine has learned the
48 characteristics of a significant number of speakers. A significant amount of
49 user speech data is required for the training of a speaker-dependent device.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
4 Praveen Kumar P S, Jayanna H S

1 The voice of another speaker is not well understood by such a method.


2
Speaker adaptive systems, on the other hand, are speaker-independent
3
frameworks, initially, but may respond to the sound of a new speaker,
4
5 provided that a sufficient amount of speech is provided for the preparation
6 of the device. A speaker-adapted computer is the traditional dictation
7 system.
8 – Spectral bandwidth: Telephone/mobile channel bandwidth is reduced to
9 300-3400Hz and thus attenuates the frequency components below this band-
10 width. This kind of speech is called narrow-band speech. On the other
11 hand, natural speech that does not travel into such a medium is consid-
12 ered wide-band speech; it comprises a broader range-restricted only to the
13 sampling frequency. As a result, it is easier to understand the precision of
14 ASR systems educated in broadband voice.
15
16
17 2.2 Sources of voice variability
18
19 Sounds, unlike written text, due to the streaming nature of continuous speech,
20 there are no well-established boundaries between phonemes or phrases. For
21 example, in the absence of additional detail, the term ”six sheep” can easily be
22 mistaken as ”sick sheep.” Sometimes, several instances of a letter look almost
23 the same in written text. On the other hand, the spectral and temporal
24
properties of speech sound differ considerably depending on a variety of factors.
25
Few significant aspects are explained follows.
26
27 – Physiological: Vowel speech waveforms can vary due to various pitch fre-
28 quencies. And, the various dimensions of the vocal tract (the height of the
29 head) change the resonance frequency of the oral cavity. In general, the
30 resonance frequency of male adults will be smaller than that of females,
31 which in turn will be lesser than that of children. Thus, even though the
32 pitch of two people is the same, the speech spectra can vary in different
33 head sizes.
34
– Environmental conditions: The presence of background noise decreases the
35
ratio of signal to noise. Neighbours’ background speech gives rise to major
36
37 confusions between speech tones. Speech captured by a recording device
38 not only records a person’s voice but also numerous reflections from walls
39 and other reflective surfaces.
40 – Behavioural: People’s speech rate changes a lot. Syntax and semantics
41 influence the prosodic structure of the utterance. The dialect and use of
42 words rely on the regional and social context of the speaker. The vocabu-
43 lary of unknown words can deviate from the normal word. For example, the
44 word ”Bengaluru” may be mispronounced as ”Bengluru.” Such influence of
45 expression aggravates the often difficult role of ASR.
46
47 There are many applications of speech signal processing, such as voice com-
48 mands, voice dialling, voice to text converters, hands-on applications, voice to
49 database requests, etc. There are many advantages of SR system but at the
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 5

1 same time, there are also concerns about how fast and powerful the system
2
model is. The efficiency of automatic speech recognition (ASR) therefore de-
3
pends on the consistency of the device model. A lot of SR work is underway,
4
5 but preserving the precision of decoding is difficult for researchers. Feature
6 extraction, speaker standardization, acoustic modelling, language modelling,
7 etc. are challenging tasks in SR. Modern SR systems are based on HMM
8 for general use. The toolkits such as Sphinx, Hidden Markov Model Toolkit
9 (HTK), Julius are available for SR research. Recently, Kaldi is one of the most
10 common and most recent ASR researcher tools written in C++ language. The
11 benefits of the SR technology developed using Kaldi create high-quality net-
12 works which are quick enough to identify in real-time [32]. SR is a technique
13 for converting speech data into a similar text or for interpreting human voice
14 into machine-readable material. In SR research, the deep learning approach
15 is a very hot topic today, and that is why every neurolinguistic program-
16 ming (NLP) researcher is trying to use deep learning. Deep learning methods
17 are becoming popular, as many of the previously established SR models are
18 outperforming. This is the era when central processing units (CPUs) are re-
19 placed by graphical processing units (GPUs). So, it is really easy to train
20 large ones. This work sets up a CSR network for the Kannada language using
21
phoneme modelling, where each phoneme is represented by a 5-state HMM
22
and each state is represented by a GMM. Also, this work provides a study
23
24 on monophone, triphone and hybrid simulation approaches for Kannada SR.
25 The open-source Kaldi [33] toolkit is used to train and check the SR framework.
26
27 There are two implementations of the DNN in Kaldi. The first version is
28 the version of Dan [34]. It does not support the pre-training of the RBM.
29 The second version is the implementation of Kerel [44]. It allows Restricted
30 Boltzmann Machinery (RBM) pre-testing, stochastic gradient simulation us-
31 ing GPUs and differential preparation [6]. The Kaldi tool is built using C++
32 and is formed on the OpenFST library which utilizes the Linear Algebra BLAS
33 and LAPACK libraries. For this work, we have opted to incorporate the above
34 DNN because it facilitates concurrent training of multiple CPUs [38]. This
35 work is an attempt to build the continuous Kannada SR system. The devel-
36 opment of such a system would help to convert the audiobooks available in
37 Kannada into corresponding transcripts. It can also be very useful to digitize
38 old palm-leaf manuscript documents simply by someone reading it. Such ef-
39 forts will help to contribute the research for the development of the SR system
40
for the Kannada language.
41
The organization of this paper is as follows: The related work in the field of
42
43 ASR, ASR for continuous speech and ASR for Indian languages is discussed in
44 Section 2. The reason for the development of ASR for Kannada dialect is ad-
45 dressed in section 3 followed by descriptions of Kannada phoneme characters
46 and data collection in section 4 and section 5 respectively. The model archi-
47 tecture is discussed in detail in section 6. Technical details of hybrid modelling
48 techniques such as GMM-HMM, DNN-HMM and SGMM are discussed in sec-
49 tion 7, section 8 and section 9 respectively. Unit preparation and validation
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
6 Praveen Kumar P S, Jayanna H S

1 using the Kaldi SR toolkit is covered in section 10. The training and testing
2
of the Kannada ASR system are set out in section 11. The analysis of the
3
experiment and the results is carried out in section 12 and the conclusions are
4
5 finally set out in section 13.
6
7
8 3 Related Work
9
10 3.1 Automatic Speech Recognition
11
12 According to the authors in [1], ASR strives to build an intelligent device ca-
13 pable of interpreting voice phonemes or phoneme strings automatically from
14 speech input signals. The authors in [35] defined that ASR is a technology
15 which enables computers to translate the speech data into machine-readable
16 text through a microphone or telephone input. A team of researchers did a
17 good job at isolated and continuous Bangla dialect Recognition [16]. They
18 used the HMM classifier and sought to grasp both separated and continu-
19 ous terms. They used 100 Bangla words for their experiment. The results
20 of speech-dependent and speaker-independent comprehension levels are 90%
21 and 70% respectively. According to the authors in [3], ASR is defined as a
22 method of converting speech signal into word order by adding an algorithm to a
23 machine so that the system may produce and recognise speech input. Speaker-
24 dependent accuracy was 95%, and the independent accuracy was 91%. The
25
standard of their research was good enough, but the scale was too limited for
26
real-world results. Another study is done by a back-propagation neural net-
27
work to grasp the Bengali digits [18]. Automatic real number identification
28
29 was done by the team of researchers using CMU-SPHINX. The accuracy was
30 84% for personal computers and 74% for smartphone android [29].
31 The research on DNN for the Russian SR process is explained in [41]. In
32 this work, the speaker adaptation method was proposed for context-dependent
33 DNNs-HMMs AM. The features that are derived from GMM are utilized
34 in DNN as input feature. According to the speaker-independent context-
35 dependent DNN-HMM systems, a relative reduction of 5% to 36% was ob-
36 served on separate adaptation sets. Acoustic modelling based on DNN for
37 Russian speech using Kaldi toolkit is provided in [36]. The researchers imple-
38 mented one Russian speech database with the main steps of the Kaldi Switch-
39 board recipe. The SR results obtained were compared with those for English
40 speech data. For Russian and English speaking, the absolute difference be-
41 tween WERs was over 15%. The probability of extracting features directly
42 from DNN without translating the performance probabilities into features ap-
43 propriate for the GMM-HMM method has been explored in [14]. Experiments
44
were performed using DNN modelling technique with five hidden layers. The
45
findings of this experiment are utilized as input features of the GMM-HMM
46
framework for SR during training for DNN. The WER reduction was con-
47
48 trasted with the probabilistic feature system and the device size reduction,
49 since only a part of the network, was used.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 7

1 The use of ASR for several purposes, such as application for public service
2
assistance by telephone directory feedback, applications for identification of
3
database queries, office recognition applications, speech subsidiary applica-
4
5 tions, and translation applications for automatic language expression in the
6 foreign language are beneficial under Kurian [25]. In [46], Shi-Xiong et al.,
7 two key principles for ASR were suggested. The first principle is focused on a
8 large-scale training of sentence-level log-linear models and the second principle
9 is focused on the typical elements, effective lattice-based training and decod-
10 ing. They developed an SR framework that is resilient to the noise by applying
11 the principle of standardised log-linear models. The suggested technique, the
12 structured log-linear model (SLLM), accomplished a WER of 8.40%. The
13 WER achieved for HMM, Support Vector Machines (SVM), Multi-Class SVM
14 (MSVM) and Merger SVM and HMM (SVM+HMM) are 9.45%, 7.98%, 8.08%
15 and 8.55% respectively. The database used is AURORA-2, comprising a total
16 of 8330 sentences. Thus, the researchers concluded that the SLLM approach
17 performs better than the other algorithms in a noisy background.
18 The context-dependent DNNs-HMMs are defined in [38]. The implemen-
19 tation of CD-DNN-HMMs lowered the WER from 26.9% to 19.7%. A variety
20 of recent experiments have proved that DNN-HMM models perform better
21
than the conventional GMM-HMM models. A CD model based on DNN and
22
vocabulary-based SR is provided in [10]. The DNNs provide undirected links
23
24 between the top two layers and direct connections to all of the other layers
25 above. Hybrid DNN-HMM architecture has been used in this study; it has
26 been shown that the DNN-HMM model will perform better than GMM-HMMs
27 and that the researchers obtained the sentence error reduction of 4.9%. The
28 Kaldi toolkit was used for the Italian language identification of children based
29 on DNN in [9]. The Karel and Dan DNN training had been discussed. SR
30 results from Dan’s implementation were slightly lower than Karels’s DNN, but
31 both implementations greatly performed better than the non-DNN configura-
32 tion.
33
34 In [11], the researchers explored the use of acoustic features obtained using
35 pitch-synchronous analysis with standard features employing MFCCs. They
36 investigated the mixing of complementary acoustic features in large vocabulary
37 CSR. Also, they integrated these portrayals explicitly at the acoustic function
38 level using heteroscedastic linear discriminant analysis (HLDA) and at the
39 structure level. The attributes derived from the pitch-synchronous study are
40
particularly convincing when coupled with the duration of the vocal tract.
41
Results show that the consolidation of conventional and pitch-synchronous
42
43 acoustic characteristics using HLDA results in a predictable and notable de-
44 crease in WER. Liang Lu and Steve Renals suggested the use of the highway
45 deep neural network (HDNN) [26]. The authors also contend that HDNNs are
46 easy to control and resilient than DNNs. Experimental data suggested that
47 HDNNs are more reliable than standard DNNs. HDNNs reached a WER of
48 22% compared to DNN, which had a WER of 26.6%. Studies on enhanced
49 multi-party networking with 79 hours of training data were performed. The
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
8 Praveen Kumar P S, Jayanna H S

1 results show that the performance of HDNN AM is slightly increased relative


2
to traditional DNN AM.
3
4
5
3.2 Continuous Speech Recognition
6
7
8 A continuous speech varies from an individual word and related digits, while
9 continuous language overlaps the words. As a result, the term limit is am-
10 biguous and it is impossible to determine the beginning point and the end-
11 point of the phrase. CSR has been an influential area of study for some time
12 now. A continuous Russian SR framework designed using DNNs was found in
13 citezulkarneev2013system. The DNNs were used for the measurement matrix
14 to measure the probabilities of the states. SR was done with the aid of a
15 finite-state transducer (FST). In this article, It has been shown that the pro-
16 posed approach allows for an improvement in SR precision relative to HMMs.
17 A further DNN research for the Russian SR system is presented. Recently,
18 some excellent work has been done in several languages. Research on contin-
19 uous Hindi SR is performed by a research unit [42]. Their data set consists
20 of 1000 unique sentences and the WER obtained was better than many of the
21 preceding works on Hindi.
22 Good research work is also being done on the recognition of Serbian con-
23 tinuous speech [30]. They have 90 hours of speech details and 21,000 utter-
24
ances. The findings obtained were satisfactory, with the GMM-HMM WER
25
being 2.19% and the DNN (for 3 secret layers) being 1.86%. The efficiency
26
27 of a large vocabulary continuous speech recognition (LVCSR) framework de-
28 pends to a large degree on the accuracy of the phoneme level recognizer. As
29 a result, different approaches have been attempted to enhance the standard
30 understanding of the phoneme by using a range of technologies, by combining
31 feature sets, by improving mathematical models, by improving speech, acous-
32 tic and language models, etc. Work on acoustic modelling for LVCSR was
33 also discussed in [27]. In this article, the researchers carried on their research
34 using the analytical analysis to find out what elements of the DNN-based AM
35 architecture is most relevant for the success of the SR framework. Growing
36 model size and complexity have been shown to be only to a certain extent
37 efficient. Also, a distinction has been made between standard DNNs, neural
38 convolution networks and large, locally untied neural networks. Big, locally
39 untied neural networks have been shown to do marginally better.
40
41
42 3.3 ASR of Indian Languages
43
44 Much of the research work on ASR focuses on English and other European
45 dialects. Much effort has recently been made to develop ASR for Indian lan-
46 guages [7] [8] [4] [5]. In [28], the authors presented their work on the con-
47
struction of an LVCSR system for Tamil dialect using DNN. They used 8 long
48
stretches of Tamil speech collected from 30 speakers with a lexicon size of
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 9

1 13,984 words, of which 5 hours of learning was used for training. The findings
2
reveal that the SR systems produce a phone error rate (PER) of 24.21% and a
3
WER of 4.12% respectively. The suggested simulation approach demonstrated
4
5 a substantial change relative to the monophone acoustic model.
6 The researchers in [40] have come up with a speech-independent philosophy,
7 a CSR system for Hindi dialect containing 600 words of vocabulary. The
8 speech database has a voice sample of 62 speakers, 40 of whom are male and
9 22 are female. To implement this system, HTK and Sphinx have been used.
10 The accuracy of 91.45% was obtained with the MFCC at the front end and 8
11 GMM states.
12 Over the past two decades, much of the SR work has been focused on
13 HMM [12]. They used GMM to describe the states of the HMM. The acoustic
14 feature vectors consisted primarily of MFCCs or perceptual linear predictive
15 coefficients (LPCs) [17]. For quite a time, the CSR has become a working
16 area of study. Numerous studies have been performed with the identification
17 different languages like Kannada, Punjabi, Tamil, Hindi, Telugu, etc. [24] [15]
18 [20]. SR-related study in Hindi using Kaldi is documented in [43] [39] [2].
19 The extensive literature survey concludes that work on CKSR is not re-
20 markable. Since we do not know their success for CKSR, which made us
21
conduct some tests by developing our database of 2800 speakers gathered
22
throughout the state of Karnataka in the real-world conditions, we would like
23
24 to check the actions of state-of-the-art techniques for continuous Kannada
25 speech. That database is named as continuous Kannada speech database
26 (CKSD). The transcription and validation were performed on all speakers
27 wave scripts. According to the speech data the phoneme level lexicon is built.
28
29
30 4 Motivation for database creation
31
32 Speech is one of the most effective means of communication; it is productive
33 when normal man will have the option to receive its rewards. Most of the
34 international organizations are concentrating on European language, particu-
35 larly English. English is spoken in a wide portion of the globe where native
36 speakers have formed various accents in the USA, Australia, the UK, New
37 Zealand and Canada as well as non-native accents. In India, more people
38 speak regional languages or native languages rather than English in day to
39 day life. There are a significant number of regional varieties which are distinct
40 from each other and, as a result, there is a shift in the phonetic variations
41 of the regional language of that particular region [13]. The Kannada is one
42 among the regional languages of India which is spoken by over fifty million
43
people. It is also the principal language of the state of Karnataka (one of
44
the southern Indian states). The main reason for developing the database is
45
46 that there is no specific database available for Kannada. There is a difficulty
47 in trying to collect and validating large databases. This can be overcome by
48 obtaining manually transcribed data and then adapting it using small tran-
49 scribed data [22]. The speech, as well as the text corpus, are also intended
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
10 Praveen Kumar P S, Jayanna H S

1 to provide information for acoustic-phonetic studies and the development and


2
assessment of the ASR system.
3
4
5
5 Kannada Phoneme Characteristics
6
7
8 The South Indian language Kannada or Canarese is spoken in India in the
9 State of Karnataka. Kannada is the 27th most commonly spoken language
10 in the world. Since the BCs, Kannada has changed as a language. Based on
11 developments, this can be categorised in four types: Purva Halegannada (be-
12 ginning before the 10th century), Halegannada (between the 10th and 12th),
13 Nadugannada (between the 12th and 15th centuries), and Hosagannada (be-
14 tween the 15th centuries). The language of Hosagannada or Kannada uses 49
15 phonemic characters, categorized into 3 types:
16
– Swaragalu / vowels: There are thirteen vowels, the existing letters are
17
18 separately ಅ ಆ ಇ ಈ ಉ ಊ ಋ ಎ ಏ ಒ ಓ ಔ There are two forms of Swaras
19 (Vowels) based on the time spent pronouncing. One of these forms is
20 Hrasva Swara: A freely existing independent vowel that can be pronounced
21 in a single matra period (matra Kala) often referred to as a mantra is ಆ
22 ಈ ಊ ಎ ಐ ಓ and the other form is Deerga Swara: A freely existing
23 independent vowel that can be pronounced in two matras is ಅ ಇ ಉ ಏ ಒ.
24 – Vyanjanagalu/ Consonants: there are thirty-four consonants, to take an
25 individual form of the Consonant, they are dependent on vowels. These
26 can be classified into two types of Vargeeya, which are ್ ್ ್ ್ ್, ್
27 ್ ್ ್ ್, ್ ್ ್ ್ ್, ್ ್ ್ ್ ್, ್ ್ ್ ್ ್ and
28 Avargeeya are ್ ್ ್ ್ ್ ್ ¡್ ¢್ ್ and
29 – Yogavaahakagalu: Anuswaras (ಅಂ) and Visarga (ಅಃ) are the two Yo-
30 gavaahakagalu.
31
32 The description of the Kannada phonemes and the ITRANS (Indian Di-
33 alect transliterations) as seen in the table ??. The ITRANS of the phonemes
34 were seen within the brackets.
35 In Kannada, the basic language rule is when a dependent consonant com-
36 bines with an independent vowel; as seen below, an Akshara is created:
37 Consonant(Vyanjana) + Vowel (Matra) = Letter (Akshara) Example: ್ + ಅ
38 = ಕ All the consonants (Vyanjanas) are combined with the Vowels (matra)
39 based on this law to form Kagunitha for the Kannada alphabet. The sounds
40 in all languages come under two classified categories: Vowels and consonants.
41 Consonants are created with some sort of limitation or closure in the vocal
42 tract, which blocks the flow of air from the lungs. They are categorised ac-
43
cording to where the airflow has been narrowed in the vocal tract. This is also
44
known as the place of articulation. In spoken language, Vowel is a sound pro-
45
46 nounced with an open vocal tract such that there is no buildup of air pressure
47 above the glottis at any time.
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 11

1 Table 1 The Kannada character analysis and its related ITRANS


2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 6 Data Collection and Preparation
19
20 There may be some factors that will certainly, alter the implementation of
21 the SR system. The causes are close to session variability, intra-speaker and
22 inter-speaker inconstancy. Bharat Sanchar Nigam Limited (BSNL) offered an
23 Integrated Voice Response System (IVRS) call flow telephone service. The
24 continuous voice data of 2800 speakers belong to the age group 8 year to
25 80 years were obtained (1680 males and 1120 females). A collection of ten
26 phonetically rich and important Kannada sentences was pronounced by any
27 speaker. Spoken data was obtained in the real world from different areas of the
28 state of Karnataka. There are 30,846 words, which represent 30 municipalities,
29 as there is a variety in Kannada-speaking languages from region to region in
30
the state of Karnataka. In the entire process, the data collection ratio 60:40
31
(60% for male speakers and 40% for female speakers) is maintained.
32
The method used for transcription is the Indic Transliteration (IT3 to
33
34 UTF-8). The continuous Kannada speech information collected from the
35 speakers shall be transcribed from the word level to the phoneme level. Tags
36 used during the transcription of speaker data for non-lexical sounds also known
37 as silence phones. Table 2 Gaadegalu indicates the few continuous phrases in
38 Kannada dialects. These sentences are known as Kannada gaadegalu/nannudigalu.
39 The collected speech data is manually transcribed and authenticated by su-
40 pervisors at the word level.
41
42
43 7 Model Architecture
44
45 The CSR protocol for Kannada language involves various units, as seen in
46 Figure 1.
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
12 Praveen Kumar P S, Jayanna H S

1 Table 2 List of Kannada gaadegalu/naannudigalu recorded from the people across the state
2 of Karnataka
3 English Version of Gaadagalu Kannada Version of Gaadegalu
4 ati aase gatigeid:u ಅತಿ ಆ¡ೆ ಗತಿ ೇಡು
5 haal:uurige ul:idavane gaud:a ¢ಾಳೂರಿ ೆ ಉಳಿದವ ೆ ೌಡ
ban:daddellaa barali goovin:dana dayeyirali ಬಂದ ೆದ್ ಾಲ್ ಬರಲಿ ೋವಿಂದನ ದಯೆಯಿರಲಿ
6 hani hani seiridare hal:l:a tene tene seiridare bal:l:a ಹನಿ ಹನಿ ¡ೇರಿದ ೆ ಹಳಳ್ ೆ ೆ ೆ ೆ ¡ೇರಿದ ೆ ಬಳಳ್
7 handndele uduruvaaga chigurele nagutittu ಹ ೆಣ್ ೆ ಉದುರು ಾಗ ಚಿಗು ೆ ೆ ನಗುತಿತುತ್
8 beiline eddu hola meiyitan:te ೇಲಿ ೆ ಎದುದ್ ¢ೊಲ ಮೇಯಿತಂ ೆ
haagalakaayige beivinakaayi saakshhi ¢ಾಗಲ ಾಯಿ ೆ ೇವಿನ ಾಯಿ ¡ಾಕಿಷ್
9 haavuu saayalilla koolu muriililla ¢ಾವw ¡ಾಯಲಿಲಲ್ ೋಲು ಮುರೀಲಿಲಲ್
10 hiriyakkana chaal:i mane man:digella ಹಿರಿಯಕಕ್ನ ಾಳಿ ಮ ೆ ಮಂದಿ ೆಲಲ್
11 handa an:dre hendavuu baayi bid:uttade ಹಣ ಅಂ ೆರ್ ¢ೆಣವw ಾಯಿ ಬಿಡುತತ್ ೆ
huuvini:n:da naaru swarga seiritu ಹೂವಿನಿಂದ ಾರು ಸವ್ಗರ್ ¡ೇರಿತು
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 Figure 1 Schematic view of the CSR framework for Kannada dialect.
32
33
7.1 Feature Extraction
34
35
MFCC stands for cepstral coefficients of Mel frequency. The abbreviation
36
37 comprises 4 words: Mel, frequency, cepstral and coefficients. The MFCC con-
38 cept is to translate the time-domain speech signal into a frequency domain
39 to capture all the possible information from voice signals. The cochlea in our
40 ear has essentially more low-frequency filters and very few higher frequencies
41 filters. The Mel filters are used to imitate this. The concept of MFCC is, thus,
42 to translate time-domain signals into a frequency domain signal by mimicking
43 cochlea function using Mel filters. The coefficients of cepstral are the inverse-
44 FFT of the logarithm of the spectrum. The MFCC features were initially
45 proposed for the recognition of monosyllabic terms. It is basically as repre-
46 sents the filter (vocal tract) in the source filter speech model. The MFCC’s
47 first 13 (lower dimensions) is a bandwidth envelope. The higher dimensions
48 that are discarded convey the spectral data. Envelopes are adequate to rep-
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 13

1 resent the difference for many phonemes so that phonemes can be recognised
2
through MFCC. The standard frequency scale fHz was warped by logarithmic
3
compression into one perceived pitch at a linear scale fmel :
4
5 ( )
fHz
6 fmel = 1127log 1 + (1)
700
7
8
9 7.2 Language Model (LM)
10
11 To restrict word search, an LM is used. It determines which word should
12
follow previously known terms and helps to limit the matching process sub-
13
stantially by removing words that are not feasible. The LM is a file used to
14
recognise speech by the SR model. It contains a wide list of words and their
15
16 likelihood. LMs are used to limit the number of possible words which should
17 be considered at any point of the search in a decoder. The result is quicker
18 performance and greater precision. The N-gram LMs are the most common
19 LMs, with word sequence statistics and finite language patterns. To obtain
20 a high degree of accuracy, the search space constraint must be quite success-
21 ful. This means that predicting the next term must be successful. Languages
22 constrain the search either completely or probabilistically (by listing a small
23 subset of possible extensions) (by calculating a probability for every possible
24 word in succession). The former typically has a grammar connected to it that
25 is assembled into a graph. An LM typically limits the vocabulary to the words
26 in it. This is a challenging recognition task. To cope with this, smaller chunks
27 like subwords or even phones might be present in a language model. It is noted
28 that in this case, the search space cap is normally worse than for a word-based
29 language model, the accuracy of the resulting identification. The statistical
30 LMs (SLM), were not feasible or possible for a priori defining all possible legal
31
word sequences, are ideal for free form inputs, such as dictation or sponta-
32
neous expression. Possibly the most common trigram SLMs in ASR and a
33
strong mix between difficulty and rigorous approximation. A model tri-gram
34
35 encodes the likelihood of a phrase regardless of its immediate two-word past.
36 In reality, tri-gram models should be ”backed-offs” to bi-gram and unigram
37 modelling so that the decoder can send out every potential word series.
38
39
40 7.3 Acoustic Modelel (AM)
41
42 To train ASR, training an AM and a Language Model (LM) is essential. Fun-
43 damental AM preparation includes:
44 – Monophone HMM instruction for a training sub-set.
45 – Aligned data set with the monophone model
46
– HMM-training using triphone.
47
48 The AM is trained through audio files and transcripts. Here are two major
49 forms of models, one-phone (monophone) and three-phone (triphone) models.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
14 Praveen Kumar P S, Jayanna H S

1 The AM is used in SR to explain the relationship between phonemes and audio


2
signals. The monophone model tries to match the sound that a single phone
3
has heard, whereas the acoustic triphone model relies more extensively on the
4
5 background. For both model generation, the features required to be extracted
6 from the speech signal. An AM is a file containing statistical representations
7 of each sound which constitutes a word. A label called phoneme is given for
8 each of these statistical representations. There are roughly 40 distinct sounds
9 in English which are helpful to understand the voice, and we have thus 40
10 separate phonemes. Recognition or classification means that the maximum
11 likelihood (ML) criterion can be used to assign any new sequence of obser-
12 vations to the most similar model. HMM suffer from such limits, however.
13 Continuous HMMs with Baum-Welch or Viterbi algorithms have low descrip-
14 tive potential among the models as they are based on the ML parameters.
15 The working of AM in recognition of any word is listed below:
16
17 – The voice decoder listens to the sounds articulated by a person and scans
18 for an HMM in the AM.
19 – In the AM, the decoder states the phoneme as it detects an analogous
20 HMM.
21 – The decoder then records the corresponding phonemes before the user talks
22 for a break.
23 – Later the decoder looks at the right collection of phonemes that is heard in
24 its dictionary of pronunciation to decide the word spoken when the delay
25 is hit.
26 – Then the decoder searches for a related word or sentence in the grammar
27 file.
28
29
30 7.4 Monophone model generation
31
32 The HMM is used here to model phonemes. Each phoneme is represented
33 by five HMM states, each of which is represented by one state GMM. The
34
HMM uses a series of continuous density states to model the set of feature
35
vectors. MFCC is used for monophone model generation. The audio signal was
36
37 sampled at 8 kHz. So we get 8000*0.020=160 observations in one window and
38 the same is reduced to 13 static cepstral coefficients. The feature is extracted
39 by applying a window of 20 ms shifted by 10 ms.
40
41
42 7.5 Creating triphone models
43
44 It is necessary to use some MFCC transformations to optimize recognition, in
45 addition to the static features derived from each speech data frame. These
46 transformations are used to create three-phone templates. Transformations
47 shall include Delta Function Computation, Linear Discrimination Analysis
48 (LDA) and MLLT.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 15

1 – The LDA transform: is a linear transformation that helps to reduce the


2
dimension of the data of input functions. The purpose of the LDA is
3
to consider a linear translation of the characteristics of vectors from n-
4
5 dimensional space to vectors in m-dimensional space (m<n). This makes
6 the system faster.
7 – Delta function computation: the features of the MFCC have been taken
8 into account only for phonetic frames, without taking into account the in-
9 teraction between them. The phonetic signals are continuous because the
10 signals are continuous. Acquiring a dynamic shift function across phonetic
11 frames will enhance recognition efficiency. Delta is the Fourier representa-
12 tion of the temporal series of phonetic artefacts. For instance, if we have 13
13 MFCC coefficients, with the delta+delta-delta transform, we also have 13
14 + 13 delta coefficients, which merge to give a vector of 39 length features
15 (13 + 13 + 13).
16 – The MLLT estimation: MLLT calculates the parameters of linear trans-
17 formation to optimize the probability of training results given the GMMs
18 of diagonal covariance; The rounded features depicted by the model are
19 better compared to the original features.
20
21
22 8 GMM-HMM Modelling
23
24
The most effective and simplest classification model, Hidden Markov model,
25
has many different applications [37]. Speech data were taken and features
26
27 were extracted from it in the context of SR. The process of GMM-HMM is
28 shown in Figure 2. To recognize speech, a classifier which will identify which
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Figure 2 Working of HMM
43
44
45 (if any) phone was uttered in every frame is needed. A simple GMM can be
46 used for that, i.e. for each phone class, fit a GMM using all the frames where
47
that phone is found - to classify a new utterance, frame by frame checking is
48
done to find out which phone is the most probable (i.e. which GMM gives the
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
16 Praveen Kumar P S, Jayanna H S

1 highest likelihood of generating the 39 feature vector). However, this kind of


2
model doesn’t exploit the temporal dependencies in the acoustic signal. The
3
classification depends only upon the current frame, ignoring the context (i.e.
4
5 previous and next frames). Moreover, this model assumes that there is no
6 acoustic difference, in the beginning, middle and end of a phone. The GMM-
7 HMM model is a response to these problems. The HMM is a temporal model,
8 which assumes that the source (observation generator) has some state which
9 we don’t know about (e.g. position of the larynx, the shape of the oral cavity,
10 tongue placement). Most common HMM architectures for SR have phone
11 models consisting of three states. It can be interpreted that as an assumption
12 that a phone, when uttered, has three distinct phases - a beginning, a middle,
13 and an ending. In each phase, it sounds a bit different. Each state is modelled
14 by a GMM to determine the likelihood of the observation in that state, and
15 our observations are the frames. Moreover, this model assumes that there is
16 no acoustic difference in the beginning, middle and ending of a phone (just
17 think of how the voice goes up and then down when ‘ah’ is uttered). The
18 GMM-HMM model is a response to these problems. The HMM is a temporal
19 model, which assumes that the source (observation generator) has some state
20 which is not known (e.g. position of the larynx, the shape of the oral cavity,
21
tongue placement). Most common HMM architectures for speech recognition
22
have phone models consisting of three states. Assume that a phone, when it
23
24 is uttered, has three distinct phases - a beginning, a ‘middle’, and an ending.
25 In each phase, it ‘sounds’ a bit different. Each state is modelled by a GMM to
26 determine the likelihood of the observation in that state, and the observations
27 are the frames. So, wrapping it up - for the sequence of frames, each of which
28 is being classified as a particular state, belonging to the particular phone.
29 There may be many frames generated by one state, and there may be several
30 states which sum up to a single phone (and going further, there may be several
31 phones which build up a single word).
32
33
34 9 DNN-HMM Modelling
35
36 Figure 3 demonstrates the general structure of the hybrid DNN-HMM frame-
37 work. With given acoustic measurements, the DNN is trained to predict poste-
38 rior probabilities of each context-dependent state. During decoding the prob-
39 abilities of output are divided by the prior probability of each state forming
40 ”pseudo-likelihood” that is used in place of the probabilities of state emis-
41 sions in the HMM [19]. The initial phase in training DNN-HMM model is to
42 train GMM-HMM model through the data allocated for training. The stan-
43 dard Kaldi recipe for DNN-based acoustic model comprises of the following
44 steps [21]:
45
46 – Feature extraction (the features are 13 MFCCs +13 Delta+13 Delta-Delta);
47 – Training a triphone model with delta, delta-delta characteristics, maximum
48 Likelihood linear transform and LDA;
49 – A monophonic model training;
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 17

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Figure 3 DNN-HMM Hybrid Device Model [45]
26
27
28 – fMLLR functionality suited to linear regression;
29 – Speaker adapted training (SAT), i.e. training of the highest possibility of
30 function space.
31
– DNN-HMM final model training.
32
33 DNN-HMM is training with fMLLr-adapted features; the SAT fMLLR GMM
34 system includes the decision tree and alignments. Furthermore, the statis-
35 tically inadequate modelling of GMMs in HMM is significantly insufficient
36 where they are placed in or around a non-linear multiple in the data space.
37 DNN-HMM is a modern paradigm of the hybrid model that has been proposed
38 and commonly used in the recognition of speech in recent years. DNN models
39 are better classifiers than GMMs, and with fewer parameters over complex dis-
40
tributions, they can generalise even better. DNN is the standard multi-layer
41
perceptrons with several layers that capture the underlying non-linear rela-
42
tionship between data where training is usually initialised with a pre-training
43
44 algorithm.
45 They model distributions of different classes jointly, this is called “dis-
46 tributed” learning, or, more properly “tied” learning. In GMM, each senone
47 separately with a separate set of GMMs is modelled whereas in DNN the fea-
48 tures are classified together and distribution of senone posteriors is calculated.
49 The alignment for training is calculated for the whole utterance but the con-
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
18 Praveen Kumar P S, Jayanna H S

1 text for the classifier is different. DNNs can model much longer context. In
2
GMM system it is typical to model simply 7–9 frames in a raw, GMM models
3
does not improve if the context due to convexity of the distribution they model
4
5 is increased.
6
7
8 10 SGMM
9
10 ASR systems focused on the GMM-HMM framework typically require the full
11 training of individual GMMs in each HMM condition. A modern modelling
12 methodology is used in the SR domain called SGMM [32]. Consequently, no
13 parameters are distributed between states. The states are described by Gaus-
14 sian mixtures and these parameters transmit the normal structure between
15 the states of the simulation technique of the SGMM. Dedicated multivariate
16 Gaussian mixtures are used for state-level modelling of conventional GMM-
17 HMM acoustic modelling techniques. The SGMM consists of GMM within
18 each context-dependent state, the vector Ia ∈ V r in each state is specified
19 instead of directly defining the parameters. The basic form of SGMM can be
20 described in the following equations:
21
22
23 ∑
L

24 p(y|i) = wik N (y; µik,∑ ) (2)


k
k=1
25
26 µik = Mk Ii (3)
27 expwkT Ii
28 wik = ∑L (4)
T
29 k′ =1 expwk′ Ii
30 Where y ∈ RD is a vector function and i ∈ 1, 2, ...I is a context-dependent
31 speech signal state. The speech condition j model is a GMM with
32 ∑ L Gaus-
sians (L is between 200 and 2000), with a covariance matrix of kwhich is
33
distributed between systems, a combination of wik and µik. The derivation
∑ of
34
35 µikwik parameters is done by using Ii together with M k, wk and k. The
36 detailed definition and effect of the SGMM parameterisation is given in [31].
37
38
39 11 Training and Testing
40
41 The speech files are required for training with high phonetic parity and in-
42 clusion. The training data is used to create the LM and the AM along with
43 acoustic information. In this process, the algorithm searches effectively for
44 the best sequence in space consisting of an observation sequence, an LM and
45 an AM. The search method is often referred to as decoding. The classical
46 Viterbi algorithm can successfully solve HMM decoding problems. The test
47 stage is to view in the words present in the lexicon, through the expressed
48 word arrangement. For training and testing the Kannada ASR system [23] we
49 used CKSD collected in real-time through Bharath Sanchar Nigam Limited
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 19

1 (BSNL) telephonic connection. At present 32 hours of information is available


2
among which 80% of data is used for training and 20% of data is used for
3
testing.
4
5
6
7 12 Experiment and Result Analysis
8
9 The results of the exploration are from a corpus of 10 Kannada sentences,
10 which are spoken by 2800 speakers and communicated verbally. Our trials
11 were all carried out on the Ubuntu 18.04 LTS, Intel Core i7, 3.70GHz clock-
12 speed (64-bit work system) platform. MFCC features and their affiliates are
13 used for the creation of models. Kaldi uses an FST architecture, which has
14 been developed with an IRSTLM toolbox.
15
16
17 Table 3 The representation of WER at the different phoneme levels for the continuous
Kannada speech database
18
19 WER_1 WER_2 WER_3 WER_4 WER_5
Phonemes
CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT
20 MONO 8.23 8.96 8.54 9.03 8.36 8.81 8.64 8.79 8.66 8.98
21 tri1_600_2400 7.52 7.86 7.48 7.86 7.26 7.59 7.81 8.06 7.65 8.15
22 tri1_600_4800 6.57 6.86 6.61 6.82 6.75 7.03 6.81 7.14 6.69 7.28
tri1_600_9600 6.24 6.54 6.37 6.85 6.48 6.94 6.22 6.76 6.35 6.86
23 tri2_600_2400 7.45 7.58 7.24 7.52 7.14 7.49 7.35 7.84 7.27 7.68
24 tri2_600_4800 6.52 6.98 6.27 6.94 6.29 6.84 6.35 6.85 6.33 6.73
tri2_600_9600 5.59 6.12 5.54 6.04 5.38 5.96 5.84 6.16 6.01 6.53
25 tri3_600_2400 5.79 6.25 5.61 6.09 5.54 6.02 5.58 6.12 5.88 6.24
26 tri3_600_4800 5.45 5.95 5.62 6.01 5.38 5.93 5.48 5.97 5.41 5.86
27 tri3_600_9600 5.12 5.62 5.34 5.96 5.27 5.81 5.23 5.88 5.32 5.59
SGMM 4.86 4.97 5.12 5.59 4.84 5.81 4.89 5.29 4.92 5.31
28
29
30
31 The table 3 displays the various WERs that have been gained at spe-
32 cific phoneme levels. The table indicates that the monophone has a WER of
33 8.23% and the WER for the triphone1, triphone2, triphone3 and SGMM are
34 6.22%, 5.38%, 5.12%, 4.84% respectively. The distinctive WER of the stan-
35 dard TIMIT database is also seen in the table. From the table, it is observed
36 that the SGMM and the triphone modelling method offered a more distinct
37 precision than the monophone modelling method. The CKSD has a higher
38 degree of identification than the TIMIT database.
39 Table 4 displays the various WERs at various phoneme speeds. It was
40 observed in the table that the mixture of (DNN+HMM) has a WER of 4.05%
41 and a WER relative to (DNN+SGMM) of 4.65%. Finally, the combining of
42 the MMI+SGMM process gives the WER 5.23%. The table reveals that the
43 mixture of (DNN+HMM) has provided predominant precision relative to other
44 modelling techniques.
45 Similarly, the WER at different phoneme levels for CKSD and Aurora-4
46 database is depicted in table 5 and WER for hybrid modelling techniques is
47
depicted in table 6. From both the tables, it is noticed that the WER for
48
CKSD is lesser than that of Aurora-4 database.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
20 Praveen Kumar P S, Jayanna H S

1 Table 4 The depiction of WER for hybrid modelling techniques at different phoneme levels
2 for continuous Kannada speech database
3
WER_1 WER_2 WER_3 WER_4 WER_5
4 Phonemes
CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT CKSD TIMIT
5 SGMM+MMI_it1 5.23 6.45 5.06 5.89 4.98 6.02 5.21 6.66 4.95 8.98
6 SGMM+MMI_it2 5.38 6.59 5.64 6.97 6.29 7.21 5.83 6.82 6.02 7.13
SGMM+MMI_it3 5.88 6.94 5.64 7.23 6.02 7.89 5.98 6.68 6.23 7.63
7 SGMM+MMI_it4 6.02 7.82 5.88 6.97 6.23 7.85 5.81 6.85 6.01 7.69
8 DNN+HMM 4.56 6.02 4.67 5.92 5.01 6.23 4.05 5.87 5.21 6.90
DNN+SGMM_it1 4.87 6.23 4.65 5.02 4.94 5.45 5.10 5.67 4.86 5.54
9 DNN+SGMM_it2 4.59 5.24 4.62 5.14 4.85 5.41 5.03 5.67 5.24 5.89
10 DNN+SGMM_it3 5.31 6.12 4.92 5.61 5.09 5.84 4.86 5.29 4.64 5.77
11 DNN+SGMM_it4 4.99 5.64 5.06 6.11 4.85 5.90 5.22 5.93 5.59 6.28
12
13 Table 5 The representation of WER at the different phoneme levels for the continuous
14 Kannada speech and Aurora-4 database
15 WER_1 WER_2 WER_3 WER_4 WER_5
Phonemes
16 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4
MONO 8.23 9.25 8.54 9.47 8.36 9.59 8.64 8.86 8.66 8.59
17 tri1_600_2400 7.52 7.86 7.48 8.45 7.26 8.21 7.81 8.58 7.65 8.68
18 tri1_600_4800 6.57 6.97 6.61 7.24 6.75 7.47 6.81 7.67 6.69 7.69
19 tri1_600_9600 6.24 6.54 6.37 7.35 6.48 7.81 6.22 7.29 6.35 7.03
tri2_600_2400 7.45 7.58 7.24 8.65 7.14 8.56 7.35 8.46 7.27 8.45
20 tri2_600_4800 6.52 6.98 6.27 7.25 6.29 7.64 6.35 7.61 6.33 7.97
21 tri2_600_9600 5.59 6.55 5.54 6.59 5.38 6.87 5.84 6.73 6.01 6.85
22 tri3_600_2400 5.79 6.25 5.61 6.58 5.54 6.69 5.58 6.67 5.88 6.95
tri3_600_4800 5.45 6.55 5.62 6.68 5.38 6.62 5.48 6.77 5.41 6.86
23 tri3_600_9600 5.12 6.62 5.34 6.84 5.27 6.85 5.23 6.23 5.32 6.97
24 SGMM 4.86 5.26 5.12 5.65 4.84 5.86 4.89 5.15 4.92 5.58
25
26 Table 6 The WER representation for hybrid modeling techniques for Continuous Kannada
27 speech database and Aurora-4 database
28 WER_1 WER_2 WER_3 WER_4 WER_5
29 Phonemes
CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4 CKSD Aurora-4
30 SGMM+MMI_it1 5.23 7.02 5.06 6.89 4.98 6.91 5.21 7.23 4.95 7.21
SGMM+MMI_it2 5.38 6.65 5.64 7.35 6.29 6.86 5.83 6.54 6.02 6.58
31 SGMM+MMI_it3 5.88 7.28 5.64 7.86 6.02 7.61 5.98 7.54 6.23 6.98
32 SGMM+MMI_it4 6.02 8.01 5.88 7.61 6.23 8.28 5.81 7.81 6.01 8.03
33 DNN+HMM 4.56 5.68 4.67 5.27 5.01 5.81 4.05 6.21 5.21 5.97
DNN+SGMM_it1 4.87 5.94 4.65 5.68 4.94 5.94 5.10 6.01 4.86 6.24
34 DNN+SGMM_it2 4.59 5.54 4.62 5.64 4.85 5.29 5.03 6.31 5.24 6.10
35 DNN+SGMM_it3 5.31 6.54 4.92 5.58 5.09 5.64 4.86 5.68 4.64 5.93
36 DNN+SGMM_it4 4.99 6.14 5.06 5.59 4.85 6.21 5.22 5.92 5.59 6.34

37
38
39 The comparison between the CKSD and the TIMIT database and the
40 Aurora-4 repository w.r.t recognition rate is seen in Figure 4 and Figure 5
41 respectively. The plots reveal that the efficiency of the triphone modelling
42 technique is greater than that of the monophone modelling technique. The effi-
43 ciency of CKSD is also higher than that of the TIMIT and Aurora-4 databases.
44
45 In Figure 6 and Figure 7, a comparison of the CKSD with the TIMIT
46 database and the Aurora-4 database for hybrid modelling techniques is shown.
47 These plots show that the combination of DNN and HMM modelling technique
48 is performing better than the other hybrid modelling techniques for CKSD.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 21

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 Figure 4 The performance comparision of CKSD and TIMIT database
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 Figure 5 The performance comparision of CKSD and Aurora-4 database
38
39
40 Also from the above plots, it is observed that the performance of CKSD has
41 a better edge compared to that of TIMIT and Aurora-4 database.
42
43
44 13 Conclusions
45
46 The CKSR system has been demonstrated in this study. The speech data have
47 been collected, transcribed and checked with the transcription system. Using
48 an SR toolkit, the ASR models were developed. In the lexicon were included
49 all alternate pronouncements for the Kannada speaking sentence. The WERs
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
22 Praveen Kumar P S, Jayanna H S

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 Figure 6 The performance comparision of CKSD and TIMIT database for hybrid modelling
19 techniques
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 Figure 7 The performance comparision of CKSD and Aurora4 database for hybrid mod-
39 elling techniques
40
41
42 performed for the monophonic, triphone1, triphone2, triphone3, combination
43 of SGMM and MMI, combination of DNN and HMM, combination of DNN
44 and SGMM are 8.36%, 6.22%, 5.38%, 5.12%, 4.84%, 4.98%, 4.05%, 4.59%,
45 respectively. The recognition rate of CKSR system is higher than that the
46 WER for TIMIT and Aurora-4 database. The SGMM and hybrid DNN-based
47 modelling techniques have achieved the least WER for continuous Kannada
48 speech data. These least WER models (SGMM and DNN-based models) could
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 23

1 be used to build a stable ASR framework. The developed ASR system has been
2
tested under deteriorated conditions by different speakers. Also, the precision
3
of the ASR system can be expanded to allow us to implement noise reduction
4
5 methods more effectively. The expected challenge is to further improve the
6 system efficiency by increasing the number of speakers and also by increasing
7 the number of phonemes appropriately.
8
9
10 References
11
1. AM Ahmed. Abushariah, tedi s. gunawan, othman o. khalifa., english digit speech
12 recognition system based on hidden markov models”. In International Conference on
13 Computer and Communication Engineering (ICCCE 2010), pages 11–13, 2010.
14 2. Md Alif Al Amin, Md Towhidul Islam, Shafkat Kibria, and Mohammad Shahidur Rah-
15 man. Continuous bengali speech recognition based on deep neural network. In 2019
16 International Conference on Electrical, Computer and Communication Engineering
(ECCE), pages 1–6. IEEE, 2019.
17 3. MA Anasuya and SK Katti. Speech recognition by machine: A review. International
18 Journal of Computer Science and Information Security, 6:181–205, 2009.
19 4. CS Anoop and AG Ramakrishnan. Automatic speech recognition for sanskrit. In 2019
20 2nd International Conference on Intelligent Computing, Instrumentation and Control
21 Technologies (ICICICT), volume 1, pages 1146–1151. IEEE, 2019.
5. Pratyush Banerjee, Gaurav Garg, Pabitra Mitra, and Anupam Basu. Application of
22 triphone clustering in acoustic modeling for continuous speech recognition in bengali.
23 In 2008 19th International Conference on Pattern Recognition, pages 1–4. IEEE, 2008.
24 6. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise
25 training of deep networks. In Advances in neural information processing systems, pages
26 153–160, 2007.
7. Astik Biswas, PK Sahu, Anirban Bhowmick, and Mahesh Chandra. Hindi vowel classi-
27 fication using gfcc and formant analysis in sensor mismatch condition. WSEAS Trans
28 Syst, 13:130–43, 2014.
29 8. Vishal Chourasia, K Samudravijaya, and Manohar Chandwani. Phonetically rich hindi
30 sentence corpus for creation of speech database. Proc. O-Cocosda, pages 132–137, 2005.
31 9. Piero Cosi. A kaldi-dnn-based asr system for italian. In 2015 International Joint
Conference on Neural Networks (IJCNN), pages 1–5. IEEE, 2015.
32 10. George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained
33 deep neural networks for large-vocabulary speech recognition. IEEE Transactions on
34 audio, speech, and language processing, 20(1):30–42, 2011.
35 11. Mathias De Wachter, Mike Matton, Kris Demuynck, Patrick Wambacq, Ronald Cools,
36 and Dirk Van Compernolle. Template-based continuous speech recognition. IEEE
Transactions on Audio, Speech, and Language Processing, 15(4):1377–1390, 2007.
37 12. H Gabzili, Z Lachiri, and N Ellouze. Experimental study of the hmms effect on the
38 word recognition performance. In First International Symposium on Control, Commu-
39 nications and Signal Processing, 2004., pages 615–618. IEEE, 2004.
40 13. Anushri Garud, Arti Bang, and Shrikant Joshi. Development of hmm based automatic
41 speech recognition system for indian english. In 2018 Fourth International Conference
on Computing Communication Control and Automation (ICCUBEA), pages 1–6. IEEE,
42 2018.
43 14. Frantisek Grézl, Martin Karafiát, Stanislav Kontár, and Jan Cernocky. Probabilistic
44 and bottle-neck features for lvcsr of meetings. In 2007 IEEE International Conference
45 on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–757. IEEE,
46 2007.
15. Jyoti Guglani and AN Mishra. Continuous punjabi speech recognition model based on
47 kaldi asr toolkit. International Journal of Speech Technology, 21(2):211–216, 2018.
48 16. Md Hasnat, Jabir Mowla, Mumit Khan, et al. Isolated and continuous bangla speech
49 recognition: implementation, performance and application perspective. 2007.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
24 Praveen Kumar P S, Jayanna H S

1 17. Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech. the Journal of
2 the Acoustical Society of America, 87(4):1738–1752, 1990.
3 18. Md Hossain, Md Rahman, Uzzal Kumar Prodhan, Md Khan, et al. Implementation of
4 back-propagation neural network for isolated bangla speech recognition. arXiv preprint
5 arXiv:1308.3785, 2013.
19. Daniel Jurafsky and James H Martin. Chapter 9: Hidden markov models. Speech and
6
Language Processing, 2018.
7 20. M Kalamani, M Krishnamorti, and RS Valarmati. Continuous tamil speech recognition
8 technique under non stationary noisy environments. International Journal of Speech
9 Technology, 22(1):47–58, 2019.
10 21. Irina Kipyatkova and Alexey Karpov. Dnn-based acoustic modeling for russian speech
recognition using kaldi. In International Conference on Speech and Computer, pages
11
246–253. Springer, 2016.
12 22. P. S. Praveen Kumar and H. S. Jayanna. Creation and instigation of triphone based
13 big-lexicon speaker-independent continuous speech recognition framework for kannada
14 language. International Journal of Innovative Technology and Exploring Engineering
15 (IJITEE), 9(2S), December 2019.
23. P. S. Praveen Kumar, G. Thimmaraja Yadava, and H. S. Jayanna. Continuous kannada
16
speech recognition system under degraded condition. Circuits, Systems, and Signal
17 Processing, 39(1):391–419, July 2019.
18 24. PS Praveen Kumar, G Thimmaraja Yadava, and HS Jayanna. Continuous kannada
19 speech recognition system under degraded condition. Circuits, Systems, and Signal
20 Processing, pages 1–29, 2019.
21 25. Cini Kurian and Kannan Balakrishnan. Speech recognition of malayalam numbers. In
2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pages
22 1475–1479. IEEE, 2009.
23 26. Liang Lu and Steve Renals. Small-footprint highway deep neural networks for speech
24 recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
25 25(7):1502–1511, 2017.
26 27. Andrew L Maas, Peng Qi, Ziang Xie, Awni Y Hannun, Christopher T Lengerich, Daniel
Jurafsky, and Andrew Y Ng. Building dnn acoustic models for large vocabulary speech
27 recognition. Computer Speech & Language, 41:195–213, 2017.
28 28. A Madhavraj and AG Ramakrishna. Design and development of a large vocabulary,
29 continuous speech recognition system for tamil. In 2017 14th IEEE India Council
30 International Conference (INDICON), pages 1–5. IEEE, 2017.
31 29. Md Mahadi Hasan Nahid, Md Ashraful Islam, and Md Saiful Islam. A noble approach
for recognizing bangla real number automatically using cmu sphinx4. In 2016 5th Inter-
32 national Conference on Informatics, Electronics and Vision (ICIEV), pages 844–849.
33 IEEE, 2016.
34 30. Branislav Popović, Stevan Ostrogonac, Edvin Pakoci, Nikša Jakovljević, and Vlado
35 Delić. Deep neural network based continuous speech recognition for serbian using the
36 kaldi toolkit. In International Conference on Speech and Computer, pages 186–192.
Springer, 2015.
37 31. Daniel Povey, Lukáš Burget, Mohit Agarwal, Pinar Akyazi, Feng Kai, Arnab Ghoshal,
38 Ondřej Glembek, Nagendra Goel, Martin Karafiát, Ariya Rastrow, et al. The subspace
39 gaussian mixture model—a structured model for speech recognition. Computer Speech
40 & Language, 25(2):404–439, 2011.
41 32. Daniel Povey, Lukśš Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal,
Ondřej Glembek, Nagendra Kumar Goel, Martin Karafiát, Ariya Rastrow, et al. Sub-
42 space gaussian mixture models for speech recognition. In 2010 IEEE International
43 Conference on Acoustics, Speech and Signal Processing, pages 4330–4333. IEEE, 2010.
44 33. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Na-
45 gendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al.
46 The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech
recognition and understanding, number CONF. IEEE Signal Processing Society, 2011.
47 34. Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of dnns with
48 natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 25

1 35. Hamdan Prakoso, Ridi Ferdiana, and Rudy Hartanto. Indonesian automatic speech
2 recognition system using cmusphinx toolkit and limited dataset. In 2016 International
3 Symposium on Electronics and Smart Devices (ISESD), pages 283–286. IEEE, 2016.
4 36. Alexey Prudnikov, Ivan Medennikov, Valentin Mendelev, Maxim Korenevsky, and Yuri
5 Khokhlov. Improving acoustic models for russian spontaneous speech recognition. In
International Conference on Speech and Computer, pages 234–242. Springer, 2015.
6
37. Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in
7 speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
8 38. Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-
9 dependent deep neural networks. In Twelfth annual conference of the international
10 speech communication association, 2011.
39. Roshan S Sharma, Sri Harsha Paladugu, K Jeeva Priya, and Deepa Gupta. Speech
11
recognition in kannada using htk and julius: A comparative study. In 2019 Interna-
12 tional Conference on Communication and Signal Processing (ICCSP), pages 0068–0072.
13 IEEE, 2019.
14 40. Shweta Sinha, Shyam S Agrawal, and Aruna Jain. Continuous density hidden markov
15 model for context dependent hindi speech recognition. In 2013 International Conference
on Advances in Computing, Communications and Informatics (ICACCI), pages 1953–
16
1958. IEEE, 2013.
17 41. Natalia Tomashenko and Yuri Khokhlov. Speaker adaptation of context dependent
18 deep neural networks based on map-adaptation and gmm-derived feature processing. In
19 Fifteenth Annual Conference of the International Speech Communication Association,
20 2014.
21 42. Prashant Upadhyaya, Sanjeev Kumar Mittal, Omar Farooq, Yash Vardhan Varshney,
and Musiur Raza Abidi. Continuous hindi speech recognition using kaldi asr based
22 on deep neural network. In Machine Intelligence and Signal Analysis, pages 303–311.
23 Springer, 2019.
24 43. Prashanth Upadyaya, Omar Faroq, Musiur Raza Abidi, and Yash Vardhan Varshney.
25 Continuous hindi speech recognition model based on kaldi asr toolkit. In 2017 Inter-
26 national Conference on Wireless Communications, Signal Processing and Networking
(WiSPNET), pages 786–789. IEEE, 2017.
27 44. Karel Veselỳ, Arnab Ghoshal, Lukás Burget, and Daniel Povey. Sequence-discriminative
28 training of deep neural networks. In Interspeech, volume 2013, pages 2345–2349, 2013.
29 45. Dong Yu and Li Deng. AUTOMATIC SPEECH RECOGNITION. Springer, 2016.
30 46. Shi-Xiong Zhang, Anton Ragni, and Mark John Francis Gales. Structured log linear
31 models for noise robust speech recognition. IEEE Signal Processing Letters, 17(11):945–
948, 2010.
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

You might also like