You are on page 1of 10

Digital Signal Processing 22 (2012) 544–553

Contents lists available at SciVerse ScienceDirect

Digital Signal Processing


www.elsevier.com/locate/dsp

A hierarchical language identification system for Indian languages


S. Jothilakshmi ∗ , V. Ramalingam, S. Palanivel
Department of Computer Science and Engineering, Annamalai University, Annamalainagar 608 002, India

a r t i c l e i n f o a b s t r a c t

Article history: Automatic spoken Language IDentification (LID) is the task of identifying the language from a short
Available online 27 January 2012 duration of speech signal uttered by an unknown speaker. In this work, an attempt has been made
to develop a two level language identification system for Indian languages using acoustic features. In
Keywords:
the first level, the system identifies the family of the spoken language, and then it is fed to the second
Language identification
Mel frequency cepstral coefficients
level which aims at identifying the particular language in the corresponding family. The performance
Shifted delta cepstral coefficients of the system is analyzed for various acoustic features and different classifiers. The suitable acoustic
Hidden Markov model feature and the pattern classification model are suggested for effective identification of Indian languages.
Gaussian mixture model The system has been modeled using hidden Markov model (HMM), Gaussian mixture model (GMM) and
Neural networks artificial neural networks (ANN). We studied the discriminative power of the system for the features mel
Indian languages frequency cepstral coefficients (MFCC), MFCC with delta and acceleration coefficients and shifted delta
cepstral (SDC) coefficients. Then the LID performance as a function of the different training and testing
set sizes has been studied. To carry out the experiments, a new database has been created for 9 Indian
languages. It is shown that GMM based LID system using MFCC with delta and acceleration coefficients
is performing well with 80.56% accuracy. The performance of GMM based LID system with SDC is also
considerable.
© 2012 Elsevier Inc. All rights reserved.

1. Introduction India, China, South Africa, Vietnam, etc., where a multitude of lan-
guages coexist, it is not always possible to have trained operators
Automatic spoken Language IDentification (LID) is the task of iden- who can assist in different languages, at all public service points.
tifying the language from a short duration of speech signal uttered An automatic language identification system can be extremely use-
by an unknown speaker [1,2]. In general, applications of language ful in such places to provide assistance. Once the language of the
identification systems can be categorized into two classes, namely speaker is identified, further responses from the system can be
front-end for machines and front-end for human listeners. given in the language of the speaker itself.
As a front-end for machines, the LID systems find many appli- As a front-end for human listeners, an LID system can play an
cations. In the case of a multi-language speech recognition system, important role in routing an international telephone call to a hu-
if the language of the input signal is not known, it will be required man operator who is fluent in the language of the caller. In all the
to run speech recognizers of several languages in parallel, which above mentioned applications, the LID system can be extremely
is computationally very expensive. In this scenario, an LID system useful if the performance of the system is reasonably good. In the
can be used as a front-end to reduce the number of speech rec- recent past, a number of significant results have been obtained in
ognizers to be activated, by considering the first n-best languages the area of language identification [3–5].
declared by the LID system. The LID systems are categorized into two types namely acoustic
Another application in this category is speech-to-speech trans- LID and phonotactic LID [6]. The acoustic LID determines the lan-
lation system, in which the source language is to be converted to guage directly on the basis of features derived from the speech sig-
that of the target language. Using an LID system as a front-end nal [7]. In phonotactic LID, speech is first transcribed by phoneme
to identify the language of the input (source) speech signal, the recognizer into strings or graphs of phonemes. On these, “lan-
guage” models are trained to capture statistics of couples and
speech recognizers’ performance can be drastically improved. An
triples of phonemes.
important application in this category is at public service points,
It is common for language identification systems to need large
where information access is enabled by voice. In countries like
sets of phonemically labeled data for training, usually one set
for each language to be identified [8–10,3]. Such an approach to
the language identification problem is known as Phonemic Recog-
* Corresponding author.
E-mail addresses: jothi.sekar@gmail.com (S. Jothilakshmi), aucsevr@yahoo.com nition followed by Language Modeling (PRLM) [9,11–13]. While
(V. Ramalingam), spal_yughu@yahoo.com (S. Palanivel). PRLM algorithm is effective, the phonemic labeling required can

1051-2004/$ – see front matter © 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.dsp.2011.11.008
S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553 545

be a laborious and time consuming process [3], and can also cre-
ate extensibility issues due to the large amount of time and effort
required to add new language models. It is therefore advantageous
for systems to be developed that perform the discrimination task
without this phonemic modeling prerequisite. Attempts at devel-
oping acoustic based models have resulted in the use of techniques
such as vector quantization (VQ) [14], support vector machines
(SVM) [15], and Gaussian mixture model (GMM) [16,17].
The aim of the proposed work is to develop LID system for In-
Fig. 1. Extraction of MFCC from speech signal.
dian languages which do not require any linguistic information, i.e.,
the system should be less complex and the inclusion of a new
stream) is put through a low order digital system to spec-
language should be a trivial task. But as far as Indian languages
trally flatten the signal and to make it less susceptible to finite
are concerned, since most of the languages have the same origin,
precision effects later in the signal processing. The output of
even though there are few unique sounds, all the languages share
the preemphasis network, ŝ(n) is related to the input s(n) by
a common set of phonemes. In this work, our intention is to sug-
the difference equation
gest more suitable acoustic features and modeling techniques for
Indian languages to give a reasonable performance. ŝ(n) = s(n) − α s(n − 1) (1)
This paper describes a two level identification system for In-
dian languages using acoustic features. In the first level, the system where α is the scaling factor which varies from 0 to 1. The
identifies the family of the spoken language, and then it is fed most common value for α is around 0.95.
to the second level which aims at identifying the particular lan- • Frame blocking: Speech analysis usually assumes that the sig-
guage in the corresponding family. The commonly used pattern nal properties change relatively slowly with time. This allows
recognition models in the literature for LID are hidden Markov examination of a short time window of speech to extract pa-
model (HMM), GMM, artificial neural networks (ANN) and SVM. rameters presumed to remain fixed for the duration of the
The proposed system has been modeled using HMM, GMM and window. Thus, to model dynamic parameters, we must divide
ANN for Indian languages. Any pattern recognition task will rely the signal into successive windows or analysis frames, so that
on a proper choice of features in order to perform well. The most the parameters can be calculated often enough to follow the
widely used features for LID are mel frequency cepstral coefficients relevant changes. In this step the preemphasized speech sig-
(MFCC). Traditionally, language and speaker identification tasks use nal, ŝ(n) is blocked into frames of N samples, with adjacent
feature vectors containing cepstra and delta and acceleration cep- frames being separated by M samples. If we denote the lth
stra. Recently, however, the shifted delta cepstrum (SDC) has been frame speech by ŝl (n), and there are L frames within the en-
found to exhibit superior performance to the delta and acceler- tire speech signal, then
ation cepstra in a number of language identification studies due
to its ability to incorporate additional temporal information, span- ŝl (n) = ŝ( Ml + n),
ning multiple frames, into the feature vector. The performances of n = 0, 1 , . . . , N − 1 , l = 0, 1 , . . . , L − 1 (2)
MFCC and SDC features are compared in this work. To carry out
the experiments, a new database has been created for 9 Indian We used a frame rate of 125 frames/s, where each frame was
languages. 16 ms in duration with an overlap of 50 percent between ad-
This paper is organized as follows: Section 2 briefly reviews ex- jacent frames.
traction of acoustic features used for LID. In Section 3, we briefly • Windowing: The next step in the processing is to window each
review modeling techniques used for LID. The proposed method is individual frame so as to minimize the signal discontinuities
presented in detail in Section 4. Section 5 describes the database at the beginning and end of the frame. The window must be
and the experimental results. Section 6 concludes the paper. selected to taper the signal to zero at the beginning and end
of each frame. If we define the window as w (n), 0  n  N − 1,
2. Acoustic features then the result of windowing the signal is

s̃l (n) = ŝl (n) w (n), 0n N −1 (3)


The most widely used features for LID are MFCC features. Tra-
ditionally, language and speaker identification tasks use feature The Hamming window is used for our work, which has the
vectors containing cepstra and delta and acceleration cepstra. Re- form
cently, however, the SDC has been found to exhibit superior per-  
2π n
formance to the delta and acceleration cepstra in a number of LID w (n) = 0.54 − 0.46 cos , 0n N −1 (4)
studies due to its ability to incorporate additional temporal infor- N −1
mation, spanning multiple frames, into the feature vector. The per- • Computing spectral coefficients: The spectral coefficients of the
formances of MFCC and SDC features are compared in this work. windowed frames are computed using fast Fourier transform
(FFT), as follows:
2.1. Mel frequency cepstral coefficients

N −1

X (k) = s̃l (n) exp− jk( N
)n
, 0k N −1 (5)
Mel frequency cepstral coefficients have proved to be one of the
most successful feature representations in speech related recogni- n =0

tion tasks. The mel-cepstrum exploits auditory principles, as well • Computing mel spectral coefficients: The spectral coefficients of
as the decorrelating property of the cepstrum [18]. The procedure each frame are then weighted by a series of filter frequency
of MFCC computation is shown in Fig. 1 and is described as fol- responses whose center frequencies and bandwidths roughly
lows: match those of the auditory critical band filters. These filters
follow the mel scale whereby band edges and center frequen-
• Preemphasis: The digitized speech signal S = s(n), n = 1, . . . , N t cies of the filters are linear for low frequency and logarith-
(where N t is the total number of samples in the speech mically increase with increasing frequency. These filters are
546 S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553

3. Acoustic modeling for LID

Language recognition can be seen as a classification problem


with each language representing a class. Many classifier designs
exist in the machine learning literature for high-dimension vector
classification. In this work, we used three conventional techniques,
namely HMM, GMM and ANN. We now describe the three classi-
fiers in more detail.

3.1. Hidden Markov model


Fig. 2. Mel scale filter bank.

The hidden Markov model (HMM) consists of two stage stochas-


called as mel-scale filters and collectively a mel-scale filter tic processes: One is an unobservable Markov chain with a finite
bank. As shown in Fig. 2, the filters used are triangular and number of states, an initial state probability distribution and a
they are equally spaced along the mel-scale which is defined state transition probability matrix, and the other is a set of proba-
by bility density functions associated with each state. The probability
  density function can be either discrete (discrete HMM) or continu-
f
Mel( f ) = 2595 log10 1 + (6) ous (continuous HMM) [20]. The continuous HMM is characterized
700 by the following [21]:
where f is the linear scale frequency and Mel( f ) is the corre-
sponding mel scale frequency. Each short term Fourier trans- • N x , the number of states in the model. We denote the individ-
form (STFT) magnitude coefficient is multiplied by the corre- ual states as s = {s1 , s2 , . . . , s N x }, and the state at time t as qt .
sponding filter gain and the results are accumulated. • The state-transition probability distribution A = {ai j }, where
• Computing mel frequency cepstral coefficients: The discrete cosine
ai j = P [qt +1 = s j |qt = si ], 1  i, j  N x (9)
transform (DCT) is applied to the log of the mel spectral co-
efficients to obtain the mel frequency cepstral coefficients as defines the probability of transition from state si to s j at
follows: time t with the constraint
  
M f −1
 
Nx
2 (2i + 1)mπ a i j = 1, 1  i  Nx (10)
x(m) = E (i ) cos ,
Mf 2M f j =1
i =0
m = 1, . . . , M f (7) • The observation probability density function in state i, B =
{bi (O)}, where
where M f = number of filters in the filter bank. Finally cep-
stral mean subtraction is performed to reduce the channel M

effects. b i (O) = C ik p (O, μik , Σik ), 1  i  Nx (11)
k =1
2.2. Shifted delta cepstral coefficients where C ik is the mixture coefficient for kth mixture compo-
nent in state i. M is the number of components in a Gaussian
The use of the shifted delta cepstral feature vectors allows a mixture model, and p (O, μik , Σik ) is a Gaussian probability
pseudo-prosodic feature vector to be computed without having density function with mean (μik ) and covariance (Σik ).
to explicitly find or model the prosodic structure of the speech • The initial state distribution π = {πi }, where
signal [19]. A shifting delta operation is applied to frame based
acoustic feature vectors in order to create the new combined fea- πi = P [q1 = si ], 1  i  N x (12)
ture vectors for each frame.
The SDC coefficients are computed, for a cepstral frame at Given appropriate values of N x , M, A, B, and π , the HMM can be
time t, according to: used as both a generator of observations, and as a model of how
a given observation sequence O = (o1 o2 . . . o T ) was generated by
cn (t , i ) = cn (t + i P + D ) − cn (t + i P − D ), an appropriate HMM, where T is the length of the observation
sequence. The compact notation λ = (A, B, π ) is used to indicate
n = 0, . . . , N f − 1 , i = 0, . . . , k b − 1 (8)
the complete parameter set of the model. A 5 state (N x = 5) left–
where n is the nth cepstral coefficient, D is the lag of the deltas, right HMM with two mixtures (M = 2) in each state has been
P is the distance between successive delta computations, and i is used for modeling each language feature vectors. In the ergodic
the SDC block number. The final feature vector is obtained by con- HMM, any state can be reached from any other state in a single
catenation of kb blocks of N f parameters. step, i.e., ai j > 0 for all i , j.
The computation of the shifted delta feature vectors is a rel-
atively simple procedure. The process is as follows: The cepstral 3.2. Gaussian mixture models
feature vectors are first computed as described above. Then, the
acoustic feature vectors spaced D sample frames apart are first dif- The basis for using Gaussian mixture model (GMM) is that
ferenced. Then kb differenced feature vector frames, and spaced P the distribution of feature vectors extracted from an individual’s
frames apart are then stacked to form a new feature vector. Fig. 3 speech data can be modeled by a mixture of Gaussian densi-
gives a graphical depiction of this process. ties [20]. For an N f dimensional feature vector x, the mixture
The configuration 7–1–3–7 for N f − D − P − kb has been used density function for class s is defined as
in this work to extract SDC feature vector. For each frame, with M
  
7 direct MFCC coefficients, 49 SDC coefficients are appended, so p x/λs = αis f is (x) (13)
totally 56 coefficients are used.
i =1
S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553 547

Fig. 3. Calculation of the shifted delta feature vectors.

The mixture density function is a weighted linear combination a GMM, the user is not restricted to specific functional forms, as
of M component unimodal Gaussian densities f is (.). Each Gaus- in truly parametric modeling. Yet the size of the model only grows
sian density function f is (.) is parameterized by the mean vector μis with the complexity of the problem being solved, unlike fully non-
and the covariance matrix Σ is using parametric methods. Also, GMMs are capable of modeling the den-
sity of data points along an arbitrary feature space segmentation
1 curve. This allows a GMM to discriminate between classes that lie
f is (x) = 
(2π ) N f |Σ is | on the same feature space curve, but have different densities along
  that curve. The acoustic feature vectors of the different phonemes
1  T  − 1   of languages are one example of this.
× exp − x − μis Σ is x − μis (14)
For the most part, the phonemes of the languages lie within the
2
same bounded surface of the feature space, with the differentiat-
where (Σ is )−1 and |Σ is | denote the inverse and determinant ing factor between languages being the distribution of data points
of the covariance matrix Σ is , respectively. The mixture weights along the feature space surface. While these traits of GMMs are ex-
M
(α1s , α2s , . . . , αM i =1 αi = 1. Collectively,
s s ceptionally noteworthy, one must be careful to ensure that enough
) satisfy the constraint
the parameters of the class s model λs are denoted as λs = training data and mixture components are used to accurately de-
{αis , μis , Σ is }, i = 1, 2, . . . , M. The number of mixture components scribe the classification boundary in the feature space accurately.
is chosen empirically for a given data set. The parameters of GMM The more the mixture components are used, the more accurate
are estimated using the iterative expectation-maximization (EM) the model can become. But increasing the number of mixtures also
algorithm [22]. increases the amount of training data and time required for pro-
In language identification task, a GMM is created for each lan- cessing. Such trade-offs should be considered when using GMMs.
guage. Under GMM assumption, the likelihood of a feature vector In this experiment, 256 mixtures are used.
is represented by a weighted sum of multi-variant Gaussian den-
sity. During the recognition, an unknown speech utterance is rep- 3.3. Artificial neural network
resented by a sequence of feature vectors, and the log-likelihood
produced by the model is calculated. Artificial neural networks are capable of learning complex map-
Because of the quasi-parametric nature of GMMs, arbitrary fea- pings between inputs and outputs, and are particularly useful
ture space segmentations can be modeled to very high accuracy when the underlying statistics of the considered task are not well-
given enough training data and enough mixture components. With understood. A multi-layer perceptron typically has three layers of
548 S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553

Fig. 4. Effect of acoustic features for various test utterance duration. (a) 3 s. (b) 10 s. (c) 30 s.

nodes [23,24]. The input layer takes feature vectors as inputs. Each the experiments conducted, training feature vectors for each lan-
node in the output layer corresponds to a language. Therefore, we guage to be modeled are randomly selected as the training set. The
have as many nodes in the first layer as the dimensionality of the testing and training sets for all of the experiments presented here
feature vector. The non-linear functions at the hidden and output are mutually exclusive, and approximately equal in size.
nodes are sigmoid. For an L c -class LID task, we have L c nodes in Prior to automatic LID, the speech signals are first pre-processed
the output layer. The number of nodes in the hidden layer is cho- by removing the silences using short term energy function. This
sen in an empirical manner to yield a good compromise between system consists primarily of three steps: Feature extraction, Family
generalization and regression of the training data. During training, identification, and Language identification. In the first step, from
the output node target corresponding to the language of the input the speech data of all the languages under consideration, the fea-
vector is set to one, and the other output node targets are set to tures are extracted from all the frames, with a frame duration of
zero. The training can be done by an error back-propagation algo- 20 ms and frame-shift of 10 ms.
rithm. Three key topics are addressed by conducting a systematic se-
ries of experiments. For the topic of feature extraction, we studied
4. LID system the discriminative power of the system for the features MFCC,
MFCC with delta and acceleration coefficients and shifted delta
cepstral coefficients. Then the LID performance as a function of the
In this work, a hierarchical language identification system has
different training and testing set sizes has been studied. Finally the
been proposed for Indian languages. The languages of India be-
performance of the proposed approach is compared for different
long to four major linguistic families namely Indo Aryan, Dravid-
modeling techniques namely HMM, GMM and neural networks.
ian, Austro-Asiatic and Tibeto-Burman [25]. The largest of these in
terms of speakers is Indo Aryan which is spoken by 75.28% of the
5.1. HMM based LID system
people. The second largest is the Dravidian family which is spo-
ken by 22.5% of the people whereas Austro-Asiatic is spoken by
When modeling an acoustic system with HMMs, the model
1.13% of the people and Tibeto-Burman by 0.97% [26]. So, nearly
size is dictated by two major settings, the number of states and
98.0% of the people in India speak languages from Aryan fam-
the number of Gaussian mixtures per state. For our experiments,
ily and Dravidian family. Hence, the proposed two level system
5 state HMMs with 2 mixtures/state are chosen. Separate HMM
is designed to identify the languages of the Aryan and Dravidian
models are created for the Aryan and Dravidian families and for
families. In the first level, the system identifies the family of the
each language.
spoken language, and then it is fed to the second level which aims
The features used for comparison are 19 MFCC coefficients, 39
at identifying the particular language in the corresponding fam-
MFCC with delta and acceleration coefficients and 7 MFCC coef-
ily. The acoustic systems are an interesting compromise between
ficients concatenated with 49 SDC coefficients. The usual practice
complexity and performance. This work explores the use of differ-
is to use SDC coefficients without concatenating with the original
ent types of acoustic features for Indian languages identification
cepstral features. Doing so, however, leads to significant degrada-
for three types of classifiers namely HMM, GMM and ANN.
tion in performance. Figs. 4(a), (b) and (c) show the effect of dif-
ferent features on the performance of the HMM based system. The
5. Experimental results results show that the MFCC with delta and acceleration coefficients
and SDC with MFCC coefficients are the best acoustic features to
All the experiments described in this paper were conducted on classify the Indian languages. This is due to their ability to incor-
our own database. It comprises broadcast news in 9 languages. In porate temporal information. The highest performance obtained by
the Dravidian family, all four languages namely Tamil (Ta), Telugu MFCC features is 46.45% and MFCC with delta and acceleration co-
(Te), Kannada (Ka) and Malayalam (Mal) languages are used. In the efficients is 50.56%, whereas 48.44% is obtained by SDC features.
Aryan family, the major languages namely Hindi (Hi), Bengali (Be), It is well known that the size of the training set often greatly
Marathi (Ma), Gujarati (Gu), and Punjabi (Pu) are selected. This affects the performance of a pattern classification system. The ex-
database contains a total of 9 hours of broadcast data from Do- periments are carried out with various speech segments for each
ordarshan television network which is broadcasting programs in language to study the effects of training set size. The training set
all regional languages of India. is randomly selected for each language. The aim of this work is to
While training the models, it is important to be consistent with design a less complex acoustic system for Indian language identi-
the amount of training data used for training. For the majority of fication. Hence, the performance of the system with 200 seconds,
S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553 549

Fig. 5. Effect of training set size for various features. (a) MFCC features. (b) MFCC with delta and acceleration features. (c) SDC features.

Fig. 6. Effect of test utterance duration for various training sets. (a) 200 s. (b) 300 s. (c) 400 s.

Table 1
Language-wise performance of the HMM based LID system (in %).

Be Gu Hi Ma Pu Ka Mal Ta Te Average
HMM based 30.0 35.0 20.0 35.0 40.0 100 50.0 85.0 60.0 50.56

300 seconds and 400 seconds of speech segments from each lan- Table 2
guage has been studied. The variation in the performance of MFCC, False rejection (FR) and false acceptance (FA) (in %).
MFCC with delta and acceleration and SDC features for these train- Language False rejection False acceptance
ing sets is analyzed. The average LID error rate as a function Be 70.0 0.0
of the training data size for HMM based classifier is shown in Gu 65.0 0.0
Figs. 5(a), (b) and (c). The results show that for all the features, Hi 80.0 2.5
the average error rate reduces by increasing the training set size. Ma 65.0 1.9
Pu 60.0 1.9
Selecting segments that have enough information for identify-
Ka 0.0 4.4
ing the language is clearly needed to achieve better performance. Mal 50.0 9.4
As the duration of the test segments varies greatly, an analysis has Ta 15.0 28.13
been made on the performance of this LID system with different Te 40.0 7.5
durations of test signals. The results are shown in Figs. 6(a), (b)
and (c). As expected, the performance of the system improves
• Performance for the Dravidian family languages is good when
when using longer segments. The improvement is clearly signifi-
compared to Aryan family.
cant when selecting segments of duration above 20 seconds.
• Malayalam and Tamil (7th and 8th rows in Table 2 and 7th
The best performance of the LID using HMM based acoustic
and 8th column in Table 3) are severely biased.
system is given in Table 1. This is achieved for MFCC with delta
and acceleration coefficients. The training set size is 400 seconds • Even though the performance for the language Kannada is
for each language and the test utterance duration is 30 seconds. good, it is biased and other languages are severely confused
The average performance is severely affected by the poorer perfor- with this.
mance of the Aryan languages.
The observations made on the false acceptance and false rejec- The family-wise performance of HMM based system for various
tion analysis (Table 2) and the confusion matrix (Table 3) show features is given in Table 4. The observations show that the HMM
that there is a strong bias with some of the languages. The specific system is performing well for the Dravidian family when compared
observations are given below: to Aryan family.
550 S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553

Fig. 7. Effect of acoustic features for various test utterance duration. (a) 3 s. (b) 10 s. (c) 30 s.

Table 3 MFCC with delta and acceleration coefficients is 80.56% whereas


Confusion matrix (LID system using HMM). Rows correspond to the actual class of 78.59% is obtained by SDC features.
the test data, columns to the assigned class of the test data.
As discussed in Section 5.1, we proceed with various speech
Be Gu Hi Ma Pu Ka Mal Ta Te segments for each language to study the effects of training set size
Be 3 6 0 0 0 3 3 0 5 with GMM classifier. The performance of the GMM based LID sys-
Gu 0 0 3 7 0 7 3 0 0 tem with 200 seconds, 300 seconds and 400 seconds of speech
Hi 0 0 0 0 4 8 0 4 4 segments from each language has been studied. The variation in
Ma 0 0 7 0 0 7 3 3 0
Pu 8 0 0 0 0 6 0 0 6
the performance of MFCC, MFCC with delta and acceleration and
Ka 0 0 0 0 0 20 0 0 0 SDC features for these training sets is analyzed. The average LID
Mal 0 0 0 0 0 0 10 10 0 error rate as a function of the training data size for GMM based
Ta 0 0 0 0 0 0 0 17 3 system is shown in Figs. 8(a), (b) and (c). The results show that for
Te 0 0 4 0 0 0 0 4 12
all the features, the average error rate reduces by increasing the
training set size.
Table 4 As the duration of the test segments varies greatly, an analysis
Family-wise average performance of HMM based system (in %). has been made on the performance of this GMM based LID system
Features Aryan Dravidian with different durations of test signals. The results are shown in
Figs. 9(a), (b) and (c). As expected, the performance of the system
MFCC 27.0 39.03
MFCC + D + A 32.0 73.75 improves when using longer segments. The improvement is clearly
SDC 36.66 55.0 significant when selecting segments of duration above 20 seconds.
The best performance of the LID using GMM based acoustic sys-
tem is 80.56% which is given in Table 5. This is achieved for MFCC
5.2. GMM based LID system with delta and acceleration coefficients. The training set size is 400
seconds for each language and the test utterance duration is 30
This section presents and discusses the experimental data that seconds.
are collected on the GMM based LID task. When modeling an The observations made on the false acceptance and false rejec-
acoustic system with GMMs, the model size is dictated by the tion analysis (Table 6) and the confusion matrix (Table 7) show
number of Gaussian mixtures. In this work, 256 number of mix- that there is a strong bias with some of the languages. The specific
tures are used for the number of heuristic reasons. observations are given below:

• It gives a reasonable and consistent estimate for the number • Performance for the Dravidian family languages is good when
of phonemes that can be expected in any given language. compared to that for the Aryan family.
• Keeping the number of mixtures low allows faster algorithm • Bengali, Tamil and Telugu (1st, 7th and 8th rows in Table 6
speed and GMM training. and 1st, 7th and 8th column in Table 7) are severely biased.
• It reduces memory constrains on the system as compared to • Even though the performance for the languages Gujarati and
using higher number of mixture components. Kannada is good, they are not biased and other languages are
• It requires less training data to obtain accurate results than not that much confused with these languages.
using higher mixture orders.
The best family-wise performance of GMM based system for
As discussed in Section 5.1, this section compares the different various features is given in Table 8. The experiments show that
types of features namely MFCC, MFCC with delta and acceleration the performance of MFCC with delta and acceleration coefficients
and SDC features for language identification with Gaussian mixture and SDC features is good. The observations show that the system
models. The features used for comparison are 19 MFCC coefficients, is performing well for the Dravidian family when compared to the
39 MFCC with delta and acceleration coefficients and 7 MFCC co- Aryan family.
efficients concatenated with 49 SDC coefficients. Figs. 7(a), (b)
and (c) show the effect of different features on the performance 5.3. Neural networks based LID system
of the GMM based system. The results show that the MFCC with
delta and acceleration coefficients are performing well with GMM. This section presents and discusses the experimental data that
The highest performance obtained by MFCC features is 65.15% and are collected on the neural networks based LID system. The struc-
S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553 551

Fig. 8. Effect of training set size for various features. (a) MFCC features. (b) MFCC with delta and acceleration features. (c) SDC features.

Fig. 9. Effect of test utterance duration for various training sets. (a) 200 s. (b) 300 s. (c) 400 s.

Table 5
Language-wise performance of the GMM based LID system (in %).

Be Gu Hi Ma Pu Ka Mal Ta Te Average
GMM based 45.0 100 80.0 80.0 80.0 80.0 100 100 60.0 80.56

Table 6 Table 8
False rejection (FR) and false acceptance (FA) (in %). Family-wise performance of GMM based system (in %).

Language False rejection False acceptance Features Aryan Dravidian


Be 55.0 8.1 MFCC 62.13 68.93
Gu 0.0 0 .0 MFCC + D + A 77.0 85.0
Hi 20.0 0 .0 SDC 73.46 85.0
Ma 20.0 0 .0
Pu 20.0 0 .0
Ka 20.0 0.63 ture of the neural network for MFCC features is 19L u 39N u 29N u
Mal 0.0 1.25 (excluding output layer). The network structure used for MFCC
Ta 0.0 5.0
Te 40.0 6.9
with delta and acceleration coefficients is 39L u 59N u 49N u (ex-
cluding output layer). The structure of the neural network for SDC
features is 56L u 68N u 58N u (excluding output layer) where L u
denotes a linear unit, and N u denotes a non-linear unit. The non-
Table 7 linear units use tanh(s) as the activation function, where s is the
Confusion matrix (LID system using GMM). Rows correspond to the actual class of
activation value of the unit. The back propagation learning algo-
the test data, columns to the assigned class of the test data.
rithm is used to adjust the weights of the network to minimize
Be Gu Hi Ma Pu Ka Mal Ta Te the mean square error for each acoustic feature vector.
Be 9 0 0 0 0 0 0 2 9 The experiments are carried out by varying the size of the
Gu 0 20 0 0 0 0 0 0 0 training set and test utterance duration as done for HMM and
Hi 0 0 16 0 0 0 0 4 0
Ma 1 0 0 16 0 1 0 2 0
GMM based LID systems. From the results it is observed that
Pu 0 0 0 0 16 0 2 0 2 the performance of this system is very poor for all features. The
Ka 4 0 0 0 0 16 0 0 0 highest performance obtained is 11.72%. There is no significant im-
Mal 0 0 0 0 0 0 20 0 0 provement in the performance even though the size of the training
Ta 0 0 0 0 0 0 0 20 0
set and the test utterance duration are varied. Moreover, it takes
Te 8 0 0 0 0 0 0 0 12
longer training time when compared to the training time taken by
552 S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553

HMM and GMM based systems. The system is heavily biased with [7] R. Ng, T. Lee, C.-C. Leung, B. Ma, H. Li, Analysis and selection of prosodic
particular languages most of the time. Hence, it is concluded that features for language identification, in: Proc. of Int’l Conf. Asian Language Pro-
cessing, 2009, IALP ’09, 2009, pp. 123–128.
the neural network is not a suitable classifier to classify Indian lan-
[8] Y.K. Muthusamy, E. Barnard, R.A. Cole, Reviewing automatic language identifi-
guages. cation, IEEE Signal Process. Mag. (1994) 33–41.
[9] P.A. Torres-Carrasquillo, T.P. Gleason, W.M. Campbell, D.A. Reynolds, E. Singer,
5.4. Discussion Acoustic, phonetic, and discriminative approaches to automatic language recog-
nition, in: Proc. Eurospeech in Geneva, Switzerland, ISCA, 2003, pp. 1345–
1348.
Based on the results obtained from this hierarchical LID system,
[10] M.A. Zissman, Automatic language identification using Gaussian mixture and
we can conclude that for identifying Indian languages MFCC with hidden Markov models, in: Proc. of IEEE Int’l Conf. Acoustics, Speech and Signal
delta and acceleration coefficients outperforms the other feature Processing, 1993, pp. 399–402.
vector types examined. The performance of the SDC features also [11] L. Wang, E. Ambikairajah, E.H.C. Choi, Multi-lingual phoneme recognition and
is good. But the computational complexity of SDC is high when language identification using phonotactic information, in: Proc. of IEEE Int’l
Conf. on Pattern Recognition, 2006.
compared to MFCC with delta and acceleration coefficients. While
[12] H. Li, B. Ma, C.H. Lee, A vector space modeling approach to spoken language
comparing the performance of the models used in this analysis, identification, IEEE Trans. Audio, Speech Language Process. 15 (1) (2007) 271–
GMM is performing better than HMM and neural networks. The 284.
highest performance of GMM based system is 80.56%, HMM based [13] K.C. Sim, H. Li, On acoustic diversification front-end for spoken language iden-
system is 50.56% and neural network based system is 11.72%. These tification, IEEE Trans. Audio, Speech Language Process. 16 (2008) 1029–1037.
[14] T. Nagarajan, H.A. Murthy, A pair-wise multiple codebook approach to implicit
results are obtained with limited amount of training data.
language identification, in: Workshop on Spoken Language Processing, TIFR, In-
A thorough analysis of the error was performed on the above- dia, 2003, pp. 101–108.
described LID methods. It is noticed that the error in identifying [15] P.A. Torres-Carrasquillo, T.P. Gleason, W.M. Campbell, D.A. Reynolds, E. Singer,
the languages correctly is due to the following reasons. Language recognition with support vector machines, in: Proc. Odyssey: The
Speaker and Language Recognition Workshop in Toledo, Spain, ISCA, 2004,
pp. 41–44.
• The Indian languages are similar in nature.
[16] T.P. Gleason, P.A. Torres-Carrasquillo, D.A. Reynolds, Dialect identification using
• The recorded speech signals are news bulletins and are Gaussian mixture models, in: Proc. Odyssey: The Speaker and Language Recog-
recorded within a week. As they are recorded from Doordar- nition Workshop in Toledo, Spain, ISCA, 2004, pp. 297–300.
shan network, most of the information is similar and few of [17] C.-Y. Lin, H.-C. Wang, Language identification using pitch contour information
the words are also common for many languages. Hence, the in the Ergodic Markov model, in: Proc. of IEEE Int’l Conf. Acoustics, Speech and
Signal Processing, 2006, pp. I-193–I-196.
confusions are more.
[18] S.B. Davis, P. Mermelstein, Comparison of parametric representations for mono-
syllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust.
6. Conclusion and future work Speech Signal Process. 28 (1980) 357–366.
[19] B. Yin, E. Ambikairajah, F. Chen, Combining cepstral and prosodic features in
In this paper, a hierarchical approach is proposed for spoken LID language identification, in: Proc. of IEEE Int’l Conf. Pattern Recognition, 2006.
[20] S. Palanivel, Person authentication using speech, face and visual speech, PhD
of Indian languages. This method identifies the family of the lan-
thesis, Department of Computer Science and Engrg., Indian Institute of Tech-
guage and then identifies the particular language in the family. So nology Madras, 2004.
the number of comparisons required to identify a language is less. [21] L. Rabiner, A tutorial on hidden Markov models and selected applications in
As this is an acoustic LID system, the complexity of the system is speech recognition, Proc. IEEE 77 (2) (1989) 257–287.
less. The LID task is carried out using different acoustic features [22] R. Redner, H. Walker, Mixture densities, maximum likelihood and the EM algo-
rithm, SIAM Rev. 26 (1984) 195–239.
namely MFCC, MFCC with delta and acceleration coefficients and
[23] B. Yegnanarayana, Artificial Neural Networks, Prentice Hall, New Delhi, 1999.
SDC features. To identify the best modeling technique to classify [24] V. Ramalingam, A study of advertisement effectiveness using neural networks,
Indian languages, the LID system is developed using HMM, GMM PhD thesis, Annamalai University, Department of Computer Science and Engrg.,
and neural networks and the results are compared. It is shown that 2006.
GMM based LID system using MFCC with delta and acceleration [25] www.ciil-spokencorpus.net [Online, accessed on 20 January 2009].
[26] http://www.lisindia.net/ [Online, accessed on 1 February 2011].
coefficients is performing well with 80.56% accuracy. The perfor-
mance of GMM based LID system with SDC is also considerable. In
this system, only acoustic features are used. If prosodic features are
Dr. S. Jothilakshmi received the B.E. degree in
combined, better performance can be expected. It is noticed that
Electronics and Communication Engineering from
the automatic LID system is biased toward some languages. An at-
Govt. College of Engineering, Salem in 1994. She re-
tempt can be made to address this. In the Aryan family, only 5 ceived the M.E. degree in Computer Science and Engi-
languages are considered. Further experiments can be carried out neering from Annamalai University in the year 2005.
on a larger set of broadcast sources and more languages like Oriya, She has been with Annamalai University, since 1999.
Kashmiri and Urdu. She completed her Ph.D. degree in Computer Science
and Engineering at Annamalai University in 2011. She
References published 6 papers in international journals and con-
ferences. Her research interest includes speech processing, image and
[1] J.-L. Rouas, I. Trancoso, C. Viana, M. Abreu, Language and variety verification on video processing, and pattern classification.
broadcast news for Portuguese, Speech Commun. 50 (2008) 965–979.
[2] T. Nagarajan, Implicit systems for spoken language identification, PhD thesis,
Department of Computer Science and Engrg., Indian Institute of Technology Dr. V. Ramalingam received the M.Sc. degree in
Madras, 2004. Statistics from Annamalai University in 1980. He re-
[3] M.A. Zissman, Comparison of four approaches to automatic language identifi- ceived the M.Tech degree in Computer Applications
cation of telephone speech, IEEE Trans. Speech Audio Process. 4 (1996) 31–44. from Indian Institute of Technology Delhi in the year
[4] A.K.V.S. Jayaram, V. Ramasubramanian, T.V. Sreenivas, Language identification 1995. He has been with Annamalai University, since
using parallel sub-word recognition, in: Proc. of IEEE Int’l Conf. Acoustics,
1982. He completed his Ph.D. degree in Computer
Speech and Signal Processing, 2003, pp. 32–35.
Science and Engineering at Annamalai University in
[5] T. Nagarajan, H.A. Murthy, Language identification using acoustic log-
likelihoods of syllable-like units, Speech Commun. 48 (2006) 913–926. 2006. He published more than 40 papers in interna-
[6] R. Tong, B. Ma, D. Zhu, H. Li, E. Siong Chng, Integrating acoustic, prosodic and tional conferences and journals. His research interest
phonotactic features for spoken language identification, in: Proc. ICASSP, 2006, includes image and video processing, natural language processing and
pp. 205–208. neural networks.
S. Jothilakshmi et al. / Digital Signal Processing 22 (2012) 544–553 553

Dr. S. Palanivel received the B.E. (Hons) degree in He completed his Ph.D. degree in Computer Science and Engineering at
Computer Science and Engineering from Mookambigai Indian Institute of Technology Madras under quality improvement pro-
College of Engineering in 1989. He received the M.E. gramme (QIP) sponsored by Annamalai University, in the year 2005. He
degree in Computer Science and Engineering from published 26 papers in international journals and 14 in international
Government College of Technology in the year 1994. conferences. His research interest includes speech processing, image and
He has been with Annamalai University, since 1994. video processing, pattern classification techniques.

You might also like